WO2021258853A1 - 词汇纠错方法、装置、计算机设备及存储介质 - Google Patents

词汇纠错方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021258853A1
WO2021258853A1 PCT/CN2021/091066 CN2021091066W WO2021258853A1 WO 2021258853 A1 WO2021258853 A1 WO 2021258853A1 CN 2021091066 W CN2021091066 W CN 2021091066W WO 2021258853 A1 WO2021258853 A1 WO 2021258853A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
vocabulary
character
word set
characters
Prior art date
Application number
PCT/CN2021/091066
Other languages
English (en)
French (fr)
Inventor
陈乐清
刘东煜
曾增烽
李炫�
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021258853A1 publication Critical patent/WO2021258853A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • This application relates to the field of data processing, and in particular to a method, device, computer equipment and storage medium for vocabulary error correction.
  • the embodiments of the present application provide a vocabulary error correction method, device, computer equipment, and storage medium to solve the problem of low efficiency when performing vocabulary error correction.
  • a vocabulary error correction method including:
  • the candidate word set corresponding to each candidate character is obtained from the preset hierarchical inverted index dictionary to form a candidate word set set, wherein each character in the hierarchical inverted index dictionary
  • the corresponding candidate word set is classified and stored in a hierarchical manner based on the number of characters;
  • the edit distance algorithm is used to determine the characters to be replaced from the vocabulary to be processed, and based on the characters to be replaced, the word set to be processed is determined from the set of candidate word sets.
  • the word set to be processed includes the target character and each word set.
  • Intersection processing is performed on candidate word sets corresponding to each of the target characters to obtain a target vocabulary, where the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • a vocabulary error correction device includes:
  • the first acquiring module is configured to acquire a vocabulary to be processed, where the vocabulary to be processed includes N candidate characters;
  • the second acquisition module is used to obtain the candidate word set corresponding to each candidate character from the preset hierarchical inverted index dictionary by adopting the inverted index method to form a candidate word set set, wherein the hierarchical inverted index dictionary
  • the candidate word set corresponding to each character in the ranking index dictionary is stored in a way that is classified and hierarchically based on the number of characters;
  • the first determining module is configured to determine the character to be replaced from the vocabulary to be processed by using the edit distance algorithm, and determine the word set to be processed from the candidate word set set based on the character to be replaced, the word set to be processed Including target characters and candidate word sets corresponding to each of the target characters;
  • the first intersection processing module is configured to perform intersection processing on the candidate word set corresponding to each target character to obtain a target vocabulary, wherein the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
  • the candidate word set corresponding to each candidate character is obtained from the preset hierarchical inverted index dictionary to form a candidate word set set, wherein each character in the hierarchical inverted index dictionary
  • the corresponding candidate word set is classified and stored in a hierarchical manner based on the number of characters;
  • the edit distance algorithm is used to determine the characters to be replaced from the vocabulary to be processed, and based on the characters to be replaced, the word set to be processed is determined from the set of candidate word sets.
  • the word set to be processed includes the target character and each word set.
  • Intersection processing is performed on candidate word sets corresponding to each of the target characters to obtain a target vocabulary, where the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the candidate word set corresponding to each candidate character is obtained from the preset hierarchical inverted index dictionary to form a candidate word set set, wherein each character in the hierarchical inverted index dictionary
  • the corresponding candidate word set is classified and stored in a hierarchical manner based on the number of characters;
  • the edit distance algorithm is used to determine the characters to be replaced from the vocabulary to be processed, and based on the characters to be replaced, the word set to be processed is determined from the set of candidate word sets.
  • the word set to be processed includes the target character and each word set.
  • Intersection processing is performed on candidate word sets corresponding to each of the target characters to obtain a target vocabulary, where the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • the above vocabulary error correction method, device, computer equipment and storage medium are used to obtain the vocabulary to be processed.
  • the vocabulary to be processed includes N candidate characters; the inverted index method is adopted to obtain each candidate character from the preset hierarchical inverted index dictionary
  • the corresponding candidate word sets constitute a candidate word set set.
  • the candidate word set corresponding to each character in the hierarchical inverted index dictionary is classified and stored in a hierarchical manner by the number of characters; the edit distance algorithm is adopted from the waiting list.
  • the character to be replaced is determined in the processed vocabulary, and based on the character to be replaced, the word set to be processed is determined from the set of candidate word sets.
  • the word set to be processed includes the target character and the candidate word set corresponding to each target character; for each target character The candidate word sets are subjected to intersection processing to obtain the target vocabulary, where the target vocabulary is the vocabulary obtained after error correction of the vocabulary to be processed;
  • this embodiment proposes an efficient hierarchical inverted index storage structure, which not only improves the space utilization, but also Can also flexibly recall the candidate set of any edit distance; when the candidate word set of all characters is stored in the hierarchical inverted index dictionary, the inverted index method can quickly obtain the corresponding to each candidate character in the vocabulary to be processed Candidate word set, thus solving the problem of inefficient vocabulary error correction.
  • this solution can reduce the indexing time and improve the efficiency of data processing to a greater extent by combining the edit distance recall algorithm.
  • FIG. 1 is a schematic diagram of an application environment of a vocabulary error correction method in an embodiment of the present application
  • FIG. 2 is an example diagram of a vocabulary error correction method in an embodiment of the present application
  • Fig. 3 is another example diagram of a vocabulary error correction method in an embodiment of the present application.
  • Fig. 4 is another example diagram of a vocabulary error correction method in an embodiment of the present application.
  • FIG. 5 is another example diagram of a vocabulary error correction method in an embodiment of the present application.
  • Fig. 6 is another example diagram of a vocabulary error correction method in an embodiment of the present application.
  • Fig. 7 is a functional block diagram of a vocabulary error correction device in an embodiment of the present application.
  • FIG. 8 is another principle block diagram of a vocabulary error correction device in an embodiment of the present application.
  • FIG. 9 is another principle block diagram of a vocabulary error correction device in an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the vocabulary error correction method provided by the embodiment of the present application can be applied in the application environment as shown in FIG. 1.
  • the vocabulary error correction method is applied in a vocabulary error correction system.
  • the vocabulary error correction system includes a client and a server as shown in FIG.
  • the efficiency of the wrong time is not high.
  • the client is also called the client, which refers to the program that corresponds to the server and provides local services to the client.
  • the client can be installed on, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented with a standalone server or a server cluster composed of multiple servers.
  • a method for lexical error correction is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the vocabulary to be processed refers to the data to be processed.
  • the vocabulary to be processed is the vocabulary to be corrected.
  • the word to be processed can be syrup, diphtheria or trypsin, etc.
  • the vocabulary to be processed includes N candidate characters, and N is a positive integer.
  • Candidate characters are the characters that make up the vocabulary to be processed.
  • the word “syrup” to be processed includes two candidate characters of "sugar” and “syrup”; the word to be processed “diphtheria” includes three candidate characters of " ⁇ ", “larynx” and “ ⁇ ”; the word to be processed “pancreas” Protease includes four candidate characters: “pancreas", “egg”, "white” and "enzyme”.
  • acquiring the vocabulary to be processed can obtain the information to be corrected by the user as the vocabulary to be processed in real time, or pre-collect the words that need to be corrected as the vocabulary to be processed; it can also directly acquire any word from the dictionary library as the vocabulary to be processed Vocabulary, a large amount of vocabulary information is pre-stored in the dictionary library.
  • S20 Use the inverted index method to obtain the candidate word set corresponding to each candidate character from the preset hierarchical inverted index dictionary to form a candidate word set set, where each character in the hierarchical inverted index dictionary The candidate word set is sorted and stored hierarchically based on the number of characters.
  • the inverted index method is an index method that needs to find records based on the value of an attribute.
  • the inverted index method is used to store a mapping of the storage location of a word in a document or a group of documents under a full-text search. It is the most commonly used data structure in document retrieval systems. Through the inverted index method, a candidate word set containing the corresponding candidate character can be quickly obtained according to each candidate character.
  • the inverted index method is used to query each candidate character in a preset hierarchical inverted index dictionary, so as to obtain the candidate word set corresponding to each candidate character.
  • each candidate word in the candidate word set corresponding to each candidate character contains the corresponding candidate character.
  • the candidate word set corresponding to each character in the hierarchical inverted index dictionary is classified and stored hierarchically by the number of characters.
  • the hierarchical inverted index dictionary contains several characters and a set of candidate words corresponding to each character. The candidate words contained in the candidate word set corresponding to each character are classified and stored hierarchically according to the number of characters.
  • the set of candidate words corresponding to the character " ⁇ " includes [baiyun, white screen, albinism, diphtheria, Behcet's disease, Aedes albopictus, hairy leukoplakia, structural protein, secretory protein, leukoencephalopathy...
  • the candidate word set corresponding to each character is stored in a classified and hierarchical storage manner, which can reduce the space of each layer of word sets, thereby reducing the indexing time, and at the same time, it can be flexibly Search for a set of candidate words at any edit distance.
  • S30 Use the edit distance algorithm to determine the characters to be replaced from the vocabulary to be processed, and determine the word set to be processed from the set of candidate word sets based on the characters to be replaced.
  • the word set to be processed includes the target character and the candidate word set corresponding to each target character .
  • Edit distance recall algorithm refers to an algorithm that calculates the similarity between two strings.
  • Edit Distance (Edit Distance): Also known as Levenshtein distance, it refers to the minimum number of editing operations required to convert two strings from one to the other. The permitted editing operations include replacing one character with another, inserting a character, and deleting a character. The smaller the edit distance of the string, the greater the similarity between the two strings.
  • the edit distance algorithm is used to calculate the similarity between the vocabulary to be processed and the correct standard vocabulary set in advance, so as to determine the character to be replaced.
  • the standard refers to the correct vocabulary set in advance to meet the requirements.
  • the candidate word set corresponding to the character to be replaced is removed from the candidate word set set, thereby obtaining the word set to be processed.
  • the word set to be processed includes a candidate word set corresponding to each target character.
  • the target character is a character not to be replaced in the vocabulary to be processed.
  • the candidate word set set includes candidate word set A4 corresponding to candidate character a, candidate word set B4 corresponding to candidate character b, candidate word set C4 corresponding to candidate character c, and candidate word set C4 corresponding to candidate character d Corresponding candidate word set D4; if the character to be replaced determined from the vocabulary to be processed using the edit distance algorithm is a, the word set to be processed determined from the candidate word set includes candidate word set B4, candidate word set C4 and candidate Word set D4.
  • the candidate word set B4 is a word set corresponding to the target character b
  • the candidate word set C4 is a word set corresponding to the target character c
  • the candidate word set D4 is a word set corresponding to the target character d.
  • S40 Perform intersection processing on the candidate word set corresponding to each target character to obtain a target vocabulary, where the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • the candidate word set of each target character in the word set to be processed is subjected to intersection processing, that is, the words in the candidate word set corresponding to each target character are matched with each other, so as to filter out the words appearing in the same time.
  • the words in the candidate word set for each target character are used as the target vocabulary.
  • the target vocabulary is a word that contains all target characters at the same time.
  • the candidate word set corresponding to the target character "white” is [white line hernia, urine protein, white blood cell, white pepper, bleaching powder, white flower Oil, tongue leukoplakia, Imperata cylindrica root, proteinuria, casein, protein powder, protein, white tiger soup, globulin, sandfly fever, albinism, Behcet’s disease...]; the set of candidate words corresponding to the target character "disease” For [heavy chain disease, scrapie, morbidity, tuberculosis, low back pain, aspergillosis, air conditioning disease, scleroderma, rabies, albinism, diphtheria, Behcet's disease, aphthous, erythroderma...]; Then, after the intersection of the candidate word sets of each target character, the target words obtained are "albinism” and "
  • the vocabulary to be processed is obtained, and the vocabulary to be processed includes N candidate characters; the inverted index method is adopted to obtain the candidate word set corresponding to each candidate character from the preset hierarchical inverted index dictionary to form a candidate The word set collection, where the candidate word set corresponding to each character in the hierarchical inverted index dictionary is classified and stored hierarchically by the number of characters; the edit distance algorithm is used to determine the characters to be replaced from the vocabulary to be processed, Based on the characters to be replaced, the word set to be processed is determined from the set of candidate word sets.
  • the word set to be processed includes the target character and the candidate word set corresponding to each target character; the candidate word set corresponding to each target character is intersected to obtain The target vocabulary, where the target vocabulary is the vocabulary obtained after error correction of the vocabulary to be processed; this embodiment proposes an efficient storage structure of hierarchical inverted index, which not only improves the space utilization, but also can flexibly recall any edit distance
  • the candidate word set of all characters is stored in the hierarchical inverted index dictionary
  • the candidate word set corresponding to each candidate character in the vocabulary to be processed can be quickly obtained through the inverted index, thereby solving the problem
  • the efficiency of vocabulary correction is not high.
  • this solution can reduce the indexing time and improve the efficiency of data processing to a greater extent by combining the edit distance recall algorithm.
  • the vocabulary error correction method before acquiring the vocabulary to be processed, the vocabulary error correction method further specifically includes the following steps:
  • the data to be stored includes N sample characters and a word set to be stored corresponding to each sample character.
  • the data to be stored refers to the data to be stored.
  • the data to be stored includes N sample characters and the word set to be stored corresponding to each sample character.
  • the sample character can be any character containing one byte.
  • the sample characters can be "white”, “hou”, “sickness", "brain” and so on.
  • the word set to be stored corresponding to each sample character is a set of expanded words obtained by expanding the edit distance of the corresponding sample character. Understandably, each extended word in the to-be-stored vocabulary set corresponding to each sample character contains a corresponding sample character.
  • the set of words to be stored corresponding to the sample character "white” can be [baiyun, white screen, albinism, diphtheria, Behcet's disease, Aedes albopictus, hairy leukoplakia, structural protein, secretory protein, leucoencephalopathy. ..].
  • each sample character can be collected in advance, and then each sample character is expanded according to the preset edit distance threshold, so as to obtain the word set to be stored corresponding to each sample character, and then each sample character and the corresponding
  • the set of words to be stored is used as the data to be stored; it is also possible to directly obtain several candidate characters and the set of candidate words corresponding to each candidate character from the inverted index dictionary as the data to be stored.
  • S12 Classify the word set to be stored corresponding to each sample character according to the number of characters, and obtain a hierarchical candidate word set of each sample character.
  • the extended words in the word set to be stored may be two-character words, three-character words, or four-character words, etc. Therefore, in this step, the word set to be stored for each character to be stored is classified according to the number of characters, that is, the expanded words in the word set to be stored are classified according to the number of characters contained, and the characters in each word set to be stored are classified The expansion words with the same number are classified into the same category, and the expansion words with different numbers of characters are classified into different categories; thus, a hierarchical candidate word set of each character to be stored is obtained.
  • the hierarchical candidate word set of each sample character includes multiple types of candidate word sets, and the number of characters of all expanded words in the same type of candidate word set is the same.
  • the set of words to be stored corresponding to the sample character "white” is [baiyun, secreted protein, white screen, Aedes albopictus, albinism, diphtheria, structural protein, Behcet's disease, hairy leukoplakia, Leukoencephalopathy...]
  • the hierarchical candidate word set of the sample character "white” obtained includes the first candidate word set [ ⁇ , ⁇ ...]
  • the second candidate word set [albinism, diphtheria, Behçet’s disease...]
  • the third candidate word set [Aedes albopictus, hairy leukoplakia, structural protein, secretory protein, white matter disease...].
  • the hierarchical inverted index dictionary refers to a dictionary used to store each sample character and the corresponding hierarchical candidate word set.
  • the hierarchical candidate word set corresponding to each sample character is stored hierarchically according to the number of characters of the candidate word. Specifically, after determining the hierarchical candidate word set of each sample character, associate each sample character with the corresponding hierarchical candidate word set, and then associate each sample character corresponding to each of the hierarchical candidate word sets.
  • the class candidate word set is stored in a hierarchical method. For example, the first candidate word set in the hierarchical candidate word set is stored in the first layer, and the second candidate word set in the hierarchical candidate word set is stored in the second layer.
  • the third candidate word set in the hierarchical candidate word set is stored in the third layer, thereby generating a hierarchical inverted index dictionary.
  • storing the hierarchical candidate word set of each sample character by means of hierarchical storage can reduce the space of each layer of word sets, thereby reducing indexing time, and at the same time, it is possible to flexibly search for candidate word sets with any edit distance. .
  • the data to be stored is obtained.
  • the data to be stored includes N sample characters and the word set to be stored corresponding to each sample character; the word set to be stored corresponding to each sample character is classified according to the number of characters, and each The hierarchical candidate word set of the same character; each sample character and the corresponding hierarchical candidate word set are stored hierarchically to generate a hierarchical inverted index dictionary; the classification of each sample character is divided into a hierarchical storage method.
  • the storage of layer candidate word sets can reduce the space of each layer word set, thereby reducing indexing time, and at the same time, it can flexibly search for candidate word sets with any edit distance.
  • the vocabulary error correction method further specifically includes the following steps:
  • S21 Combine each candidate character in the vocabulary to be processed according to a preset strategy to obtain a candidate character combination set, where the candidate character combination set includes at least one candidate character combination.
  • each candidate character in the vocabulary to be processed is combined according to a preset strategy to obtain a candidate character combination set.
  • the candidate character combination set includes at least one candidate character combination.
  • the preset strategy is to combine two adjacent candidate characters in the vocabulary to be processed according to the number of candidate characters contained in the vocabulary to be processed.
  • the preset strategy may be that if the vocabulary to be processed includes four candidate characters, the first candidate character and the second candidate character are combined, and the third candidate character and the fourth candidate character are combined. For example: if the word to be processed is "abcd", the candidate character a and the candidate character b are combined to obtain the candidate character combination ab, and the candidate character c and the candidate character d are combined to obtain the candidate character combination cd. Therefore, if the vocabulary to be processed is "abcd", after the preset strategy combines the candidate characters, the obtained candidate character combination set includes the candidate character combination ab and the candidate character combination cd.
  • the first candidate character and the second candidate character are combined, or the second candidate character and the third candidate character are combined.
  • the word to be processed is "abc"
  • the candidate character a and the candidate character b are combined to obtain the candidate character combination ab
  • the candidate character b and the candidate character c are combined to obtain the candidate character combination bc.
  • the user can combine the candidate characters in pairs in combination with a regular vocabulary relationship to obtain a combination of candidate characters.
  • S22 Obtain a candidate word set corresponding to each candidate character in each candidate character combination in the candidate character combination set, and perform intersection processing on the candidate word sets in each candidate character combination to obtain a candidate word set set of the vocabulary to be processed.
  • the candidate character combination set includes at least one candidate character combination, and each candidate character combination is composed of two candidate characters, and each candidate character corresponds to a candidate word set. Therefore, each candidate character is acquired. Combine the two candidate word sets corresponding to each candidate character in the combination, and perform intersection processing on the two candidate word sets in each candidate character combination to obtain the candidate word set corresponding to each candidate character combination. Finally, The candidate word sets corresponding to each candidate character combination are combined to obtain a candidate word set set of the vocabulary to be processed.
  • the candidate word set of the candidate character a in the candidate character combination ab is A, and the candidate word set of the candidate character b is B; in the candidate character combination cd
  • each candidate character in the vocabulary to be processed is combined according to a preset strategy to obtain a candidate character combination set.
  • the candidate character combination set includes at least one candidate character combination; each candidate character combination in the candidate character combination set is obtained
  • the candidate word set corresponding to each candidate character in each candidate character combination is processed by the intersection of the candidate word sets in each candidate character combination to obtain the candidate word set set of the vocabulary to be processed; the candidate corresponding to each candidate character in the vocabulary to be processed in advance
  • the word set is calculated for intersection, thereby reducing the amount of calculation for repeated intersection when determining the target vocabulary.
  • the inverted index technology is used to obtain the candidate word set corresponding to each candidate character from the preset hierarchical inverted index dictionary, which specifically includes the following steps:
  • S201 Determine the number of characters of the vocabulary to be processed.
  • the hierarchical word set corresponding to each character in the hierarchical inverted index dictionary is classified and stored hierarchically by the number of characters, that is, the score corresponding to each character in the hierarchical inverted index dictionary
  • Each layer word set includes several word sets with different numbers of characters. Therefore, in this step, in order to improve the subsequent data processing efficiency, the hierarchical word set corresponding to each candidate character is obtained based on the number of characters of the vocabulary to be processed. Understandably, the number of characters of the vocabulary to be processed is the number N of candidate characters included in the data to be entangled.
  • the inverted index method is first used to query each candidate character in a preset hierarchical inverted index dictionary, so as to obtain the hierarchical word set corresponding to each candidate character.
  • the hierarchical word set corresponding to each candidate character is composed of multiple layers of different word sets. The number of characters in words in different levels of vocabulary is different, and the number of characters in words in the same level of vocabulary is the same.
  • the obtained hierarchical word set of the candidate character a includes: the first-level word set A1, the second-level word set A2, and the third-level word set A3.
  • S203 Extract a word set with the same number of characters as the vocabulary to be processed from the hierarchical word set corresponding to each candidate character as the candidate word set of the corresponding candidate character.
  • the hierarchical word set corresponding to each candidate character is filtered, and a word set with the same number of characters as the number of characters contained in the vocabulary to be processed is filtered from each hierarchical word set The set of candidate words as the corresponding candidate characters.
  • the number of characters in each word in the candidate word set of each candidate character is the same as the number of characters in the vocabulary to be processed. Understandably, if the word to be processed is abcd, the number of characters in each word in the candidate word set corresponding to candidate character a, candidate character b, candidate character c, and candidate character d are all four.
  • the candidate character a in the word to be processed is "white"
  • the second word set corresponding to 3 characters A3 [white line hernia, urine protein, albinism, diphtheria, Behcet’s disease...]
  • Corresponding third word set A4 [hairy leukoplakia, structural protein, white membrane infiltration, secretory protein, leukoencephalopathy...] etc.
  • the number of characters for each word in the vocabulary is 4.
  • the number of characters of the vocabulary to be processed is determined; the inverted index method is adopted to obtain the hierarchical word set corresponding to each candidate character from the preset hierarchical inverted index dictionary; The hierarchical word set extracts the word set with the same number of characters as the vocabulary to be processed as the candidate word set of the corresponding candidate characters; the candidate word set of each candidate character is obtained by using the inverted index method, thereby improving the candidate word set Generate efficiency and accuracy.
  • the use of the edit distance algorithm to determine the character to be replaced from the vocabulary to be processed specifically includes the following steps:
  • the target character string is a character string composed of candidate characters contained in the vocabulary to be processed.
  • the vocabulary to be processed includes a number of candidate characters, and each candidate character included in the vocabulary to be processed is formed into a target string.
  • S302 Calculate the edit distance between the target character string and the preset character string using an edit distance algorithm, where the preset character string is a character string corresponding to the preset standard data.
  • the edit distance algorithm refers to an algorithm that calculates the similarity between two character strings.
  • the edit distance algorithm is used to calculate the similarity between the target character string and the preset character string, so as to obtain the edit distance between the target character string and the preset character string.
  • the preset character string is a character string corresponding to the preset correct standard data.
  • the edit distance algorithm is used to calculate the edit distance between the target character string and the preset character string.
  • the edit distance refers to: refers to the minimum number of editing operations required to convert from one character string to another character string between two character strings.
  • the permitted editing operations include: replacing one character with another, inserting a character, and deleting a character.
  • the smaller the edit distance the greater the similarity between the two strings.
  • the smaller the editing distance the closer to the target character string and the preset character string, that is, the closer the vocabulary to be processed is to the standard data.
  • a character that is different from the character in the preset character string is determined from the target string based on the edit distance as the character to be replaced, and the character to be replaced is determined by the edit distance, thereby improving the determination.
  • the accuracy of the characters to be replaced is improved.
  • the character string corresponding to the vocabulary to be processed is obtained as the target character string; the edit distance algorithm is used to calculate the edit distance between the target character string and the preset character string; wherein the preset character string is a preset standard
  • the character string corresponding to the data; the character to be replaced is determined based on the editing distance; thus, the accuracy of the character to be replaced is determined.
  • a vocabulary error correction device is provided, and the vocabulary error correction device corresponds to the vocabulary error correction method in the above-mentioned embodiment in a one-to-one correspondence.
  • the vocabulary error correction device includes a first acquisition module 10, a second acquisition module 20, a first determination module 30 and a first intersection processing module 40.
  • the detailed description of each functional module is as follows:
  • the first obtaining module 10 is configured to obtain a vocabulary to be processed, and the vocabulary to be processed includes N candidate characters;
  • the second acquisition module 20 is used to obtain the candidate word set corresponding to each candidate character from the preset hierarchical inverted index dictionary by adopting the inverted index method to form a candidate word set set, where the hierarchical inverted index dictionary
  • the candidate word set corresponding to each character is stored in a sorted and hierarchical manner based on the number of characters;
  • the first determining module 30 is used to determine the character to be replaced from the vocabulary to be processed using the edit distance algorithm, and to determine the word set to be processed from the set of candidate word sets based on the character to be replaced.
  • the word set to be processed includes the target character and each target The candidate word set corresponding to the character;
  • the first intersection processing module 40 is configured to perform intersection processing on the candidate word set corresponding to each target character to obtain a target vocabulary, where the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • the vocabulary error correction device further includes:
  • the third acquiring module 11 is configured to acquire data to be stored, and the data to be stored includes N sample characters and a word set to be stored corresponding to each sample character;
  • the classification module 12 is used to classify the to-be-stored word set corresponding to each sample character according to the number of characters to obtain a hierarchical candidate word set of each sample character;
  • the storage module 13 is used to store each sample character and the corresponding hierarchical candidate word set hierarchically to generate a hierarchical inverted index dictionary.
  • the vocabulary error correction device further includes:
  • the combination module 21 is configured to combine each candidate character in the vocabulary to be processed according to a preset strategy to obtain a candidate character combination set, and the candidate character combination set includes at least one candidate character combination;
  • the second intersection processing module 22 is used to obtain the candidate word set corresponding to each candidate character in each candidate character combination in the candidate character combination set, and perform intersection processing on the candidate word set in each candidate character combination to obtain the vocabulary to be processed A collection of candidate word sets.
  • the second acquisition module 20 includes:
  • the number of characters determining unit is used to determine the number of characters of the vocabulary to be processed
  • the first obtaining unit is configured to obtain a hierarchical word set corresponding to each candidate character from a preset hierarchical inverted index dictionary by adopting an inverted index method;
  • the extraction unit is configured to extract a word set with the same number of characters as the vocabulary to be processed from the hierarchical word set corresponding to each candidate character as the candidate word set of the corresponding candidate characters.
  • the first determining module 30 includes:
  • the second acquiring unit is used to acquire each candidate character in the vocabulary to be processed as a target string
  • the calculation unit is used to calculate the edit distance between the target character string and the preset character string using an edit distance algorithm, where the preset character string is a character string corresponding to the preset standard data;
  • the determining unit is used to determine the character to be replaced based on the editing distance.
  • Each module in the above-mentioned vocabulary error correction device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a readable storage medium and an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
  • the database of the computer device is used to store the data used in the vocabulary error correction method in the foregoing embodiment.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to implement a method for lexical error correction.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the candidate word set corresponding to each candidate character is obtained from the preset hierarchical inverted index dictionary to form a candidate word set set, wherein each character in the hierarchical inverted index dictionary
  • the corresponding candidate word set is classified and stored in a hierarchical manner based on the number of characters;
  • the edit distance algorithm is used to determine the characters to be replaced from the vocabulary to be processed, and based on the characters to be replaced, the word set to be processed is determined from the set of candidate word sets.
  • the word set to be processed includes the target character and each word set.
  • Intersection processing is performed on candidate word sets corresponding to each of the target characters to obtain a target vocabulary, where the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • one or more readable storage media storing computer readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the following steps:
  • the candidate word set corresponding to each candidate character is obtained from the preset hierarchical inverted index dictionary to form a candidate word set set, wherein each character in the hierarchical inverted index dictionary
  • the corresponding candidate word set is classified and stored in a hierarchical manner based on the number of characters;
  • the edit distance algorithm is used to determine the characters to be replaced from the vocabulary to be processed, and based on the characters to be replaced, the word set to be processed is determined from the set of candidate word sets.
  • the word set to be processed includes the target character and each word set.
  • Intersection processing is performed on candidate word sets corresponding to each of the target characters to obtain a target vocabulary, where the target vocabulary is a vocabulary obtained after error correction is performed on the vocabulary to be processed.
  • a person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions.
  • the computer-readable instructions can be stored in a non-volatile computer.
  • a readable storage medium or a volatile readable storage medium when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments.
  • any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种词汇纠错方法、装置、计算机设备及存储介质,所述方法包括:获取待处理词汇,待处理词汇包括N个候选字符(S10);采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的候选词集,组成候选词集集合,其中,分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的(S20);采用编辑距离算法从待处理词汇中确定待替换字符,基于待替换字符,从候选词集集合中确定待处理词集,待处理词集包括目标字符和每一目标字符对应的候选词集(S30);对每一目标字符对应的候选词集进行交集处理,得到目标词汇,其中,目标词汇为对待处理词汇进行纠错后得到的词汇(S40)。该方法解决了进行词汇纠错时的效率不高问题。

Description

词汇纠错方法、装置、计算机设备及存储介质
本申请以2020年6月24日提交的申请号为202010587455.3,名称为“词汇纠错方法、装置、计算机设备及存储介质”的中国申请专利申请为基础,并要求其优先权。
技术领域
本申请涉及数据处理领域,尤其涉及一种词汇纠错方法、装置、计算机设备及存储介质。
背景技术
随着互联网技术应用的越来越广泛,用户越来越频繁的需要通过计算机输入信息而完成人机交互。但是发明人意识到,用户在很多情况下有可能会输入了错误信息,因此经常需要对输入的信息进行纠错。在对信息进行纠错过程中往往会涉及到对数据的处理和查询过程。目前,在对数据进行纠错的候选词查询过程时,往往需要对待纠错数据进行编辑距离扩充再与扩充词典进行对比。因此,经常会造成扩充词典在加载为常驻内存时会损耗过大的内存空间;以及在与庞大数量级的扩充词典进比对时会消耗较长的索引时间。
技术问题
本申请实施例提供一种词汇纠错方法、装置、计算机设备及存储介质,以解决进行词汇纠错时的效率不高问题。
技术解决方案
一种词汇纠错方法,包括:
获取待处理词汇,所述待处理词汇包括N个候选字符;
采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
一种词汇纠错装置,包括:
第一获取模块,用于获取待处理词汇,所述待处理词汇包括N个候选字符;
第二获取模块,用于采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
第一确定模块,用于采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
第一交集处理模块,用于对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
获取待处理词汇,所述待处理词汇包括N个候选字符;
采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取待处理词汇,所述待处理词汇包括N个候选字符;
采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
上述词汇纠错方法、装置、计算机设备及存储介质,获取待处理词汇,待处理词汇包括N个候选字符;采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的候选词集,组成候选词集集合,其中,分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;采用编辑距离算法从待处理词汇中确定待替换字符,基于待替换字符,从候选词集集合中确定待处理词集,待处理词集包括目标字符和每一目标字符对应的候选词集;对每一目标字符对应的候选词集进行交集处理,得到目标词汇,其中,目标词汇为对待处理词汇进行纠错后得到的词汇;本实施例提出了高效的分层倒排索引的存储结构,不但提高了空间利用率,还可以灵活地召回任意编辑距离的候选集;当分层倒排索引字典中存储有所有字符的候选词集时,通过倒排索引的方式可以快速获取待处理词汇中每一候选字符所对应的候选词集,从而解决了进行词汇纠错时的效率不高问题。另外地,本方案通过结合编辑距离召回算法可以更大程度地降低了索引时间和提高了数据处理的效率。
有益效果
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中词汇纠错方法的一应用环境示意图;
图2是本申请一实施例中词汇纠错方法的一示例图;
图3是本申请一实施例中词汇纠错方法的另一示例图;
图4是本申请一实施例中词汇纠错方法的另一示例图;
图5是本申请一实施例中词汇纠错方法的另一示例图;
图6是本申请一实施例中词汇纠错方法的另一示例图;
图7是本申请一实施例中词汇纠错装置的一原理框图;
图8是本申请一实施例中词汇纠错装置的另一原理框图;
图9是本申请一实施例中词汇纠错装置的另一原理框图;
图10是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的词汇纠错方法可应用如图1所示的应用环境中。具体地,该词汇纠错方法应用在词汇纠错系统中,该词汇纠错系统包括如图1所示的客户端和服务端,客户端与服务端通过网络进行通信,用于解决进行词汇纠错时的效率不高问题。其中,客户端又称为用户端,是指与服务器相对应,为客户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种词汇纠错方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:
S10:获取待处理词汇,待处理词汇包括N个候选字符。
其中,待处理词汇是指待进行处理的数据。在本实施例中,待处理词汇为待进行纠错的词汇。例如:待处理词汇可以为糖浆、白喉病或胰蛋白酶等。具体地,待处理词汇包括N个候选字符,N为正整数。候选字符为组成待处理词汇的字符。例如,待处理词汇“糖浆”包括“糖”和“浆”两个候选字符;待处理词汇“白喉病”包括“白”、“喉”和“病”三个候选字符;待处理词汇“胰蛋白酶”包括“胰”、“蛋”、“白”和“酶”四个候选字符。具体地,获取待处理词汇可以实时获取用户输入的待纠错信息作为待处理词汇,或者预先采集需要进行纠错的词语作为待处理词汇;还可以直接从字典库中获取任意一词语作为待处理词汇,该字典库中预先存储有大量的词汇信息。
S20:采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的候选词集,组成候选词集集合,其中,分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的。
其中,倒排索引方法是一种需要根据属性的值来查找记录的索引方法。倒排索引方法被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。通过倒排索引方法,可以根据每一候选字符快速获取包含对应的这个候选字符的候选词集。
具体地,采用倒排索引方法,将每一候选字符在预设的分层倒排索引字典中进行查询,从而获取每一候选字符所对应的候选词集。可以理解地,每一候选字符所对应的候选词集中的每一候选词语都包含有对应的这个候选字符。其中,分层倒排索引字典中每一字符所对应的候选词集是以字符的数量进行分类和分层方式存储的。具体地,分层倒排索引字典中包含若干字符和每一字符所对应的候选词集。每一字符对应的候选词集所包含的候选词语是按照字符数量进行分类和分层存储的。例如:若字符“白”所对应的候选词集包括有[白云,白屏,白化病,白喉病,白塞病,白纹伊蚊,毛状白斑,结构蛋白,分泌蛋白, 脑白质病...],则在分层倒排索引字典中[白云,白屏]为第一类,存储在分层倒排索引字典中的第一层,[白化病,白喉病,白塞病]为一类为第二类,存储在分层倒排索引字典中的第二层,[白纹伊蚊,毛状白斑,结构蛋白,分泌蛋白, 脑白质病]为第三类,存储在分层倒排索引字典中的第三层。在本实施例中,分层倒排索引字典中通过分类分层的存储方式对每一字符所对应的候选词集进行存储,可以降低每层词集的空间从而降低索引时间,同时可以灵活地搜索任意编辑距离的候选词集。
S30:采用编辑距离算法从待处理词汇中确定待替换字符,基于待替换字符,从候选词集集合中确定待处理词集,待处理词集包括目标字符和每一目标字符对应的候选词集。
其中,待替换字符是指需要进行替换的字符。编辑距离召回算法是指计算两个字符串之间的相似度的算法。编辑距离(Edit Distance):又称Levenshtein距离,是指两个字符串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。字符串编辑距离越小,两个字符串的相似度越大。具体地,采用编辑距离算法,计算待处理词汇与预先设定的正确的标准词汇之间的相似度,从而确定出待替换字符。其中,标准标准是指预先设定的满足要求的正确词汇。可选地,待替换字符可以为一个或者多个。具体地,在确定了待替换字符之后,基于待替换字符,从候选词集集合中剔除掉待替换字符所对应的候选词集,从而得到待处理词集。待处理词集包括每一目标字符所对应的候选词集。其中,目标字符为待处理词汇中的非待替换字符。示例性地,若候选词集集合中包括有候选字符a所对应的候选词集A4、候选字符b所对应的候选词集B4、候选字符c所对应的候选词集C4,以及候选字符d所对应的候选词集D4;若采用编辑距离算法从待处理词汇中确定的待替换字符为a,则从候选词集集合中确定的待处理词集包括候选词集B4、候选词集C4和候选词集D4。其中,候选词集B4为为目标字符b所对应的词集,候选词集C4为目标字符c所对应的词集,候选词集D4为目标字符d所对应的词集。
S40:对每一目标字符对应的候选词集进行交集处理,得到目标词汇,其中,目标词汇为对待处理词汇进行纠错后得到的词汇。
具体地,在确定了待处理词集之后,对待处理词集中每一目标字符的候选词集进行交集处理,即将每一目标字符对应的候选词集中的词语进行相互匹配,从而筛选出同时出现在每个目标字符的候选词集中的词语,作为目标词汇。可以理解地,目标词汇为同时包含有所有目标字符的词语。示例性地,若待处理词汇为“白喉病”,待替换字符为“喉”;目标字符“白”所对应的候选词集为[白线疝,尿蛋白,白细胞,白胡椒,漂白粉,白花油,舌白斑,白茅根,蛋白尿,酪蛋白,蛋白粉,蛋白质,白虎汤,球蛋白,白蛉热,白化病,白塞病...];目标字符“病”所对应的候选词集为[重链病,瘙痒病,呆小病,结核病,腰痛病,曲霉病,空调病,硬皮病,狂犬病,白化病,白喉病,白塞病,口疮病,红皮病...];则对每一目标字符的候选词集进行交集处理后,得到的目标词汇为“白化病”和“白塞病”。其中,白化病”和“白塞病”均出现在目标字符“白”和“病”所对应的候选词集中。可以理解地,该目标词汇为对待处理词汇进行纠错后得到的正确的词汇。
在本实施例中,获取待处理词汇,待处理词汇包括N个候选字符;采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的候选词集,组成候选词集集合,其中,分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;采用编辑距离算法从待处理词汇中确定待替换字符,基于待替换字符,从候选词集集合中确定待处理词集,待处理词集包括目标字符和每一目标字符对应的候选词集;对每一目标字符对应的候选词集进行交集处理,得到目标词汇,其中,目标词汇为对待处理词汇进行纠错后得到的词汇;本实施例提出了高效的分层倒排索引的存储结构,不但提高了空间利用率,还可以灵活地召回任意编辑距离的候选集;当分层倒排索引字典中存储有所有字符的候选词集时,通过倒排索引的方式可以快速获取待处理词汇中每一候选字符所对应的候选词集,从而解决了进行词汇纠错时的效率不高问题。另外地,本方案通过结合编辑距离召回算法可以更大程度地降低了索引时间和提高了数据处理的效率。
在一实施例中,如图3所示,在获取待处理词汇之前,该词汇纠错方法还具体包括如下步骤:
S11:获取待存储数据,待存储数据包括N个样本字符和每一样本字符对应的待存储词集。
其中,待存储数据是指待进行存储的数据。待存储数据包括N个样本字符和每一样本字符所对应的待存储词集。其中,样本字符可以为包含一个字节的任意字符。例如:样本字符可以为“白”、“候”、“病”、“脑”等。每一样本字符所对应的待存储词集为对对应的样本字符进行编辑距离扩充后所得到的扩充词的集合。可以理解地,每一样本字符所对应的待存储词集中的每一扩充词都包含有对应的样本字符。例如:样本字符“白”所对应的待存储词集可以为[白云,白屏,白化病,白喉病,白塞病,白纹伊蚊,毛状白斑,结构蛋白,分泌蛋白, 脑白质病...]。具体地,获取待存储数据可以预先采集若干样本字符,然后将每一样本字符按照预设编辑距离阈值进行扩充,从而得到每一样本字符对应的待存储词集,再将每一样本字符和对应的待存储词集作为待存储数据;也可以直接从倒排索引字典中直接获取若干候选字符和每一候选字符对应的候选词集作为待存储数据。
S12:将每一样本字符对应的待存储词集按照字符数量进行分类,得到每一样本字符的分层候选词集。
具体地,由于根据步骤S11获取的每一待存储字符所对应的待存储词集中存在字符数量不相同的扩充词。即待存储词集中的扩充词可能为两个字符的词语,也可能为三个字符的词语,还可能为四个字符的词语等。因此,在本步骤中,对将每一待存储字符的待存储词集按照字符数量进行分类,即将待存储词集中的扩充词根据所包含的字符数量进行分类,将每一待存储词集中字符数量相同的扩充词归在相同一类,将字符数量不同的扩充词归在不同类;从而得到每一待存储字符的分层候选词集。可以理解地,每一样本字符的分层候选词集都包括多类候选词集,且同一类候选词集中的所有扩充词的字符数量相同。例如:若样本字符“白”所对应的待存储词集为[白云,分泌蛋白,白屏,白纹伊蚊,白化病,白喉病,结构蛋白,白塞病,毛状白斑, 脑白质病...],则将待存储词集按照字符数量进行分类后,得到的样本字符“白”的分层候选词集包括第一候选词集[白云,白屏...]、第二候选词集[白化病,白喉病,白塞病...]和第三候选词集[白纹伊蚊,毛状白斑,结构蛋白,分泌蛋白, 脑白质病...]。
S13:将每一样本字符和对应的分层候选词集进行分层存储,生成分层倒排索引字典。
其中,分层倒排索引字典是指用于存储每一样本字符和对应的分层候选词集的字典。在分层倒排索引字典中,每个样本字符对应的分层候选词集都是按照候选词的字符数量进行分层存储的。具体地,在确定了每一样本字符的分层候选词集之后,将每一样本字符和对应的分层候选词集进行关联,然后将每一样本字符对应的分层候选词集中的每一类候选词集按照分层的方法进行存储,比如将分层候选词集中的第一候选词集存储在第一层,将分层候选词集中的第二候选词集存储在第二层,将分层候选词集中的第三候选词集存储在第三层,从而生成分层倒排索引字典。在本实施例中,通过分层存储的方式将每一样本字符的分层候选词集进行存储可以降低每层词集的空间从而降低索引时间,同时可以灵活地搜索任意编辑距离的候选词集。
在本实施例中,获取待存储数据,待存储数据包括N个样本字符和每一样本字符对应的待存储词集;将每一样本字符对应的待存储词集按照字符数量进行分类,得到每一样本字符的分层候选词集;将每一样本字符和对应的分层候选词集进行分层存储,生成分层倒排索引字典;通过分类分层的存储方式将每一样本字符的分层候选词集进行存储可以降低每层词集的空间从而降低索引时间,同时可以灵活地搜索任意编辑距离的候选词集。
在一实施例中,如图4所示,在从预设的分层倒排索引字典中获取每一候选字符对应的候选词集之后,该词汇纠错方法,还具体包括如下步骤:
S21:根据预设策略对待处理词汇中的每一候选字符进行组合,得到候选字符组合集,候选字符组合集包括至少一个候选字符组合。
具体地,当待处理词汇所包含的候选字符的数量较多时,后续在生成目标词汇时往往需要对每一候选字符所对应的候选词集做多次重复的交集计算。因此,在本实施例中,为了减少后续在确定目标词汇时做重复交集的计算量,预先对待处理词汇中的每一候选字符所对应的候选词集进行预处理。在本步骤中,先根据预设策略对待处理词汇中的每一候选字符进行组合,得到候选字符组合集,候选字符组合集包括至少一个候选字符组合。具体地,预设策略为根据待处理词汇所包含的候选字符的数量,对待处理词汇中相邻的两个候选字符进行两两组合。待处理词汇所包含的候选字符的数量不同,进行候选字符组合的方式可能不同。可选地,预设策略可以为若待处理词汇包括四个候选字符,则将第一个候选字符和第二候选字符进行组合,以及将第三个候选字符和第四候选字符进行组合。例如:若待处理词汇为“abcd”,则将候选字符a和候选字符b进行组合,得到候选字符组合ab,以及将候选字符c和候选字符d进行组合,得到候选字符组合cd。因此,若待处理词汇为“abcd”,则预设策略对候选字符进行组合后,得到的候选字符组合集包括候选字符组合ab、和候选字符组合cd。若待处理词汇包括三个候选字符,则将第一个候选字符和第二候选字符进行组合,或者将第二个候选字符和第三候选字符进行组合。例如:若待处理词汇为“abc”,则将候选字符a和候选字符b进行组合,得到候选字符组合ab;或者将候选字符b和候选字符c进行组合,得到候选字符组合bc。在一具体实施例中,用户可结合常规词汇关系对候选字符进行两两组合,从而得到候选字符组合。
S22:获取候选字符组合集中每一候选字符组合中的每一候选字符对应的候选词集,对每一候选字符组合中的候选词集进行交集处理,得到待处理词汇的候选词集集合。
具体地,由步骤S21可知候选字符组合集中包括至少一个候选字符组合、且每一候选字符组合是由两个候选字符组成的,每一候选字符对应一个候选词集,因此,获取每一候选字符组合中每一候选字符所对应的两个候选词集,并将每一候选字符组合中的两个候选词集进行交集处理,从而得到每一候选字符组合所对应的候选词集,最后,将每一候选字符组合所对应的候选词集进行组合,从而得到待处理词汇的待选词集集合。例如:若获取的候选字符组合集包括候选字符组合ab和候选字符组合cd,候选字符组合ab中候选字符a的候选词集为A,候选字符b的候选词集为B;候选字符组合cd中候选字符c的候选词集为C,候选字符d的候选词集为D;则将候选词集A和候选词集B做交集处理(令Inter(AB)=A∩B),以及将候选词集C和候选词集D做交集处理(令Inter(CD)=C∩D)得到Inter(CD),得到候选字符组合ab的候选词集为Inter(AB),候选字符组合cd的候选词集为Inter(CD);最后将候选词集Inter(AB)和候选词集Inter(CD)作为待处理词汇“abcd”的候选词集集合。
在本实施例中,根据预设策略对待处理词汇中的每一候选字符进行组合,得到候选字符组合集,候选字符组合集包括至少一个候选字符组合;获取候选字符组合集中每一候选字符组合中的每一候选字符对应的候选词集,对每一候选字符组合中的候选词集进行交集处理,得到待处理词汇的候选词集集合;预先对待处理词汇中的每一候选字符所对应的候选词集做交集计算,从而减少了后续在确定目标词汇时做重复交集的计算量。
在一实施例中,如图5所示,采用倒排索引技术,从预设的分层倒排索引字典中获取每一候选字符对应的候选词集,具体包括如下步骤:
S201:确定待处理词汇的字符数量。
具体地,由于分层倒排索引字典中每一字符所对应的分层词集是以字符的数量进行分类和分层方式存储的,即分层倒排索引字典中每一字符所对应的分层词集中都包括有若干层字符数量不同的词集。因此,在本步骤中,为了提高后续的数据处理效率,基于待处理词汇的字符数量来获取每一候选字符所对应的分层词集。可以理解地,待处理词汇的字符数量即为待纠结数据所包括的候选字符的数量N。
S202:采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的分层词集。
具体地,先采用倒排索引方法将每一候选字符在预设的分层倒排索引字典中进行查询,从而获取每一候选字符所对应的分层词集。可以理解地,每一候选字符所对应的分层词集是由多层不同的词集组成。不同层词集中的词语的字符数量不同,相同层词集中的词语的字符数量相同。例如:获取的候选字符a的分层词集包括:第一层词集A1、第二层词集A2和第三层词集A3。
S203:从每一候选字符对应的分层词集中提取出与待处理词汇的字符数量相同的词集作为对应的候选字符的候选词集。
具体地,基于待处理词汇的字符数量,对每一候选字符所对应的分层词集进行筛选,从每一分层词集中筛选出字符数量与待处理词汇所包含的字符数量相同的词集作为对应的候选字符的候选词集。每一候选字符的候选词集中的每一词语的字符数量都与待处理词汇的字符数量相同。可以理解地,若待处理词汇为abcd,则获取的候选字符a、候选字符b、候选字符c和候选字符d所对应的候选词集中的每一词语的字符数量都为4。例如:若待处理词汇为abcd,待处理词汇中候选字符a为“白”,“白”所对应的候选词集中包括有2个字符所对应的第一词集A2=[白色、白墙、百威、白云、白屏...];3个字符所对应的第二词集A3=[白线疝、尿蛋白、白化病、白喉病、白塞病...],以及4个字符所对应的第三词集A4=[毛状白斑、结构蛋白、白膜侵睛、分泌蛋白、脑白质病...]等等;由于待处理词汇abcd包括4个候选字符,因此,筛选出的候选字符“白”所对应的候选词集为A4=[毛状白斑、结构蛋白、白膜侵睛、分泌蛋白、脑白质病...],即得到的候选字符“白”所对应的候选词集中的每一词语的字符数量都为4。
在本实施例中,确定待处理词汇的字符数量;采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的分层词集;从每一候选字符对应的分层词集中提取出与待处理词汇的字符数量相同的词集作为对应的候选字符的候选词集;通过采用倒排索引方法获取每一候选字符的候选词集,从而提高了候选词集的生成效率和准确性。
在一实施例中,如图6所示,所述采用编辑距离算法从所述待处理词汇中确定待替换字符,具体包括如下步骤:
S301:获取待处理词汇中的每一候选字符,作为目标字符串。
其中,目标字符串是由待处理词汇所包含的候选字符组成的字符串。具体地,待处理词汇包括若干候选字符,将待处理词汇所包括的每一候选字符组成目标字符串。
S302:采用编辑距离算法计算目标字符串与预设字符串的编辑距离,其中,预设字符串为预先设定的标准数据所对应的字符串。
其中,编辑距离算法是指计算两个字符串之间的相似度的算法。在本实施例中,通过采用编辑距离算法计算目标字符串与预设字符串的相似度,从而得到目标字符串与预设字符串的编辑距离。其中,预设字符串为预先设定的正确的标准数据所对应的字符串。具体地,采用编辑距离算法计算得到目标字符串与预设字符串的编辑距离。其中,编辑距离是指:是指两个字符串之间,由一个字符串转成另一个字符串所需的最少编辑操作次数。许可的编辑操作包括:将一个字符替换成另一个字符、插入一个字符和删除一个字符。其中,编辑距离越小,则表明两个字符串的相似度越大。在实施例中,编辑距离越小,则表明和目标字符串和预设字符串越接近,即待处理词汇和标准数据越接近。
S303:基于编辑距离确定待替换字符。
具体地,在确定了编辑距离之后,基于该编辑距离从目标字符串中确定出与预设字符串中的字符不相同的字符作为待替换字符,通过编辑距离确定待替换字符从而提高了确定的待替换字符的准确性。
在本实施例中,获取待处理词汇所对应的字符串,作为目标字符串;采用编辑距离算法计算目标字符串与预设字符串的编辑距离;其中,预设字符串为预先设定的标准数据所对应的字符串;基于编辑距离确定待替换字符;从而提高了确定的待替换字符的准确性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种词汇纠错装置,该词汇纠错装置与上述实施例中词汇纠错方法一一对应。如图7所示,该词汇纠错装置包括第一获取模块10、第二获取模块20、第一确定模块30和第一交集处理模块40。各功能模块详细说明如下:
第一获取模块10,用于获取待处理词汇,待处理词汇包括N个候选字符;
第二获取模块20,用于采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的候选词集,组成候选词集集合,其中,分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
第一确定模块30,用于采用编辑距离算法从待处理词汇中确定待替换字符,基于待替换字符,从候选词集集合中确定待处理词集,待处理词集包括目标字符和每一目标字符对应的候选词集;
第一交集处理模块40,用于对每一目标字符对应的候选词集进行交集处理,得到目标词汇,其中,目标词汇为对待处理词汇进行纠错后得到的词汇。
优选地,如图8所示,该词汇纠错装置还包括:
第三获取模块11,用于获取待存储数据,待存储数据包括N个样本字符和每一样本字符对应的待存储词集;
分类模块12,用于将每一样本字符对应的待存储词集按照字符数量进行分类,得到每一样本字符的分层候选词集;
存储模块13,用于将每一样本字符和对应的分层候选词集进行分层存储,生成分层倒排索引字典。
优选地,如图9所示,该词汇纠错装置还包括:
组合模块21,用于根据预设策略对待处理词汇中的每一候选字符进行组合,得到候选字符组合集,候选字符组合集包括至少一个候选字符组合;
第二交集处理模块22,用于获取候选字符组合集中每一候选字符组合中的每一候选字符对应的候选词集,对每一候选字符组合中的候选词集进行交集处理,得到待处理词汇的候选词集集合。
优选地,第二获取模块20包括:
字符数量确定单元,用于确定待处理词汇的字符数量;
第一获取单元,用于采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的分层词集;
提取单元,用于从每一候选字符对应的分层词集中提取出与待处理词汇的字符数量相同的词集作为对应的候选字符的候选词集。
优选地,第一确定模块30包括:
第二获取单元,用于获取待处理词汇中的每一候选字符,作为目标字符串;
计算单元,用于采用编辑距离算法计算目标字符串与预设字符串的编辑距离,其中,预设字符串为预先设定的标准数据所对应的字符串;
确定单元,用于基于编辑距离确定待替换字符。
关于词汇纠错装置的具体限定可以参见上文中对于词汇纠错方法的限定,在此不再赘述。上述词汇纠错装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储上述实施例中词汇纠错方法使用到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种词汇纠错方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:
获取待处理词汇,所述待处理词汇包括N个候选字符;
采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现以下步骤:
获取待处理词汇,所述待处理词汇包括N个候选字符;
采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种词汇纠错方法,其中,包括:
    获取待处理词汇,所述待处理词汇包括N个候选字符;
    采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
    采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
    对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
  2. 如权利要求1所述的词汇纠错方法,其中,所述在获取待处理词汇之前,所述词汇纠错方法还包括:
    获取待存储数据,所述待存储数据包括N个样本字符和每一所述样本字符对应的待存储词集;
    将每一所述样本字符对应的所述待存储词集按照字符数量进行分类,得到每一所述样本字符的分层候选词集;
    将每一所述样本字符和对应的所述分层候选词集进行分层存储,生成分层倒排索引字典。
  3. 如权利要求1所述的词汇纠错方法,其中,所述在从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集之后,所述词汇纠错方法还包括:
    根据预设策略对所述待处理词汇中的每一所述候选字符进行组合,得到候选字符组合集,所述候选字符组合集包括至少一个候选字符组合;
    获取所述候选字符组合集中每一所述候选字符组合中的每一所述候选字符对应的所述候选词集,对每一所述候选字符组合中的所述候选词集进行交集处理,得到所述待处理词汇的候选词集集合。
  4. 如权利要求1所述的词汇纠错方法,其中,所述采用倒排索引技术,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,包括:
    确定所述待处理词汇的字符数量;
    采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的分层词集;
    从每一所述候选字符对应的分层词集中提取出与所述待处理词汇的字符数量相同的词集作为对应的所述候选字符的候选词集。
  5. 如权利要求1所述的词汇纠错方法,其中,所述采用编辑距离算法从所述待处理词汇中确定待替换字符,包括:
    获取待处理词汇中的每一候选字符,作为目标字符串;
    采用编辑距离算法计算所述目标字符串与预设字符串的编辑距离,其中,所述预设字符串为预先设定的标准数据所对应的字符串;
    基于所述编辑距离确定待替换字符。
  6. 一种词汇纠错装置,其中,包括:
    第一获取模块,用于获取待处理词汇,所述待处理词汇包括N个候选字符;
    第二获取模块,用于采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
    第一确定模块,用于采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
    第一交集处理模块,用于对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
  7. 如权利要求6所述的词汇纠错装置,其中,所述词汇纠错装置还包括:
    第三获取模块,用于获取待存储数据,所述待存储数据包括N个样本字符和每一所述样本字符对应的待存储词集;
    分类模块,用于将每一所述样本字符对应的所述待存储词集按照字符数量进行分类,得到每一所述样本字符的分层候选词集;
    存储模块,用于将每一所述样本字符和对应的所述分层候选词集进行分层存储,生成分层倒排索引字典。
  8. 如权利要求6所述的词汇纠错装置,其中,所述词汇纠错装置还包括:
    组合模块,用于根据预设策略对所述待处理词汇中的每一所述候选字符进行组合,得到候选字符组合集,所述候选字符组合集包括至少一个候选字符组合;
    第二交集处理模块,用于获取所述候选字符组合集中每一所述候选字符组合中的每一所述候选字符对应的所述候选词集,对每一所述候选字符组合中的所述候选词集进行交集处理,得到所述待处理词汇的候选词集集合。
  9. 如权利要求6所述的词汇纠错装置,其中,第二获取模块包括:
    字符数量确定单元,用于确定待处理词汇的字符数量;
    第一获取单元,用于采用倒排索引方法,从预设的分层倒排索引字典中获取每一候选字符对应的分层词集;
    提取单元,用于从每一候选字符对应的分层词集中提取出与待处理词汇的字符数量相同的词集作为对应的候选字符的候选词集。
  10. 如权利要求6所述的词汇纠错装置,其中,第一确定模块包括:
    第二获取单元,用于获取待处理词汇中的每一候选字符,作为目标字符串;
    计算单元,用于采用编辑距离算法计算目标字符串与预设字符串的编辑距离,其中,预设字符串为预先设定的标准数据所对应的字符串;
    确定单元,用于基于编辑距离确定待替换字符。
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待处理词汇,所述待处理词汇包括N个候选字符;
    采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
    采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
    对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
  12. 如权利要求11所述的计算机设备,其中,所述在获取待处理词汇之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取待存储数据,所述待存储数据包括N个样本字符和每一所述样本字符对应的待存储词集;
    将每一所述样本字符对应的所述待存储词集按照字符数量进行分类,得到每一所述样本字符的分层候选词集;
    将每一所述样本字符和对应的所述分层候选词集进行分层存储,生成分层倒排索引字典。
  13. 如权利要求11所述的计算机设备,其中,所述在从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集之后,所述处理器执行所述计算机可读指令时还实现如下步骤:
    根据预设策略对所述待处理词汇中的每一所述候选字符进行组合,得到候选字符组合集,所述候选字符组合集包括至少一个候选字符组合;
    获取所述候选字符组合集中每一所述候选字符组合中的每一所述候选字符对应的所述候选词集,对每一所述候选字符组合中的所述候选词集进行交集处理,得到所述待处理词汇的候选词集集合。
  14. 如权利要求11所述的计算机设备,其中,所述采用倒排索引技术,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,包括:
    确定所述待处理词汇的字符数量;
    采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的分层词集;
    从每一所述候选字符对应的分层词集中提取出与所述待处理词汇的字符数量相同的词集作为对应的所述候选字符的候选词集。
  15. 如权利要求11所述的计算机设备,其中,所述采用编辑距离算法从所述待处理词汇中确定待替换字符,包括:
    获取待处理词汇中的每一候选字符,作为目标字符串;
    采用编辑距离算法计算所述目标字符串与预设字符串的编辑距离,其中,所述预设字符串为预先设定的标准数据所对应的字符串;
    基于所述编辑距离确定待替换字符。
  16. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取待处理词汇,所述待处理词汇包括N个候选字符;
    采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,组成候选词集集合,其中,所述分层倒排索引字典中每一字符对应的候选词集是以字符的数量进行分类和分层的方式存储的;
    采用编辑距离算法从所述待处理词汇中确定待替换字符,基于所述待替换字符,从所述候选词集集合中确定待处理词集,所述待处理词集包括目标字符和每一所述目标字符对应的候选词集;
    对每一所述目标字符对应的候选词集进行交集处理,得到目标词汇,其中,所述目标词汇为对所述待处理词汇进行纠错后得到的词汇。
  17. 如权利要求16所述的可读存储介质,其中,所述在获取待处理词汇之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取待存储数据,所述待存储数据包括N个样本字符和每一所述样本字符对应的待存储词集;
    将每一所述样本字符对应的所述待存储词集按照字符数量进行分类,得到每一所述样本字符的分层候选词集;
    将每一所述样本字符和对应的所述分层候选词集进行分层存储,生成分层倒排索引字典。
  18. 如权利要求16所述的可读存储介质,其中,所述在从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    根据预设策略对所述待处理词汇中的每一所述候选字符进行组合,得到候选字符组合集,所述候选字符组合集包括至少一个候选字符组合;
    获取所述候选字符组合集中每一所述候选字符组合中的每一所述候选字符对应的所述候选词集,对每一所述候选字符组合中的所述候选词集进行交集处理,得到所述待处理词汇的候选词集集合。
  19. 如权利要求16所述的可读存储介质,其中,所述采用倒排索引技术,从预设的分层倒排索引字典中获取每一所述候选字符对应的候选词集,包括:
    确定所述待处理词汇的字符数量;
    采用倒排索引方法,从预设的分层倒排索引字典中获取每一所述候选字符对应的分层词集;
    从每一所述候选字符对应的分层词集中提取出与所述待处理词汇的字符数量相同的词集作为对应的所述候选字符的候选词集。
  20. 如权利要求16所述的可读存储介质,其中,所述采用编辑距离算法从所述待处理词汇中确定待替换字符,包括:
    获取待处理词汇中的每一候选字符,作为目标字符串;
    采用编辑距离算法计算所述目标字符串与预设字符串的编辑距离,其中,所述预设字符串为预先设定的标准数据所对应的字符串;
    基于所述编辑距离确定待替换字符。
     
PCT/CN2021/091066 2020-06-24 2021-04-29 词汇纠错方法、装置、计算机设备及存储介质 WO2021258853A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010587455.3A CN111737981A (zh) 2020-06-24 2020-06-24 词汇纠错方法、装置、计算机设备及存储介质
CN202010587455.3 2020-06-24

Publications (1)

Publication Number Publication Date
WO2021258853A1 true WO2021258853A1 (zh) 2021-12-30

Family

ID=72652039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091066 WO2021258853A1 (zh) 2020-06-24 2021-04-29 词汇纠错方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111737981A (zh)
WO (1) WO2021258853A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720812A (zh) * 2023-08-11 2023-09-08 合肥恒艺德机械有限公司 一种基于数据编码的大数据智慧仓储管理系统
CN116719424A (zh) * 2023-08-09 2023-09-08 腾讯科技(深圳)有限公司 一种类型识别模型的确定方法及相关装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737981A (zh) * 2020-06-24 2020-10-02 平安科技(深圳)有限公司 词汇纠错方法、装置、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468719A (zh) * 2015-11-20 2016-04-06 北京齐尔布莱特科技有限公司 一种查询纠错方法、装置和计算设备
CN108664467A (zh) * 2018-04-11 2018-10-16 广州视源电子科技股份有限公司 候选词评估方法、装置、计算机设备和存储介质
US10402490B1 (en) * 2015-08-14 2019-09-03 Shutterstock, Inc. Edit distance based spellcheck
CN110348020A (zh) * 2019-07-17 2019-10-18 杭州嘉云数据科技有限公司 一种英文单词拼写纠错方法、装置、设备及可读存储介质
CN111079412A (zh) * 2018-10-18 2020-04-28 北京嘀嘀无限科技发展有限公司 文本纠错方法及装置
CN111737981A (zh) * 2020-06-24 2020-10-02 平安科技(深圳)有限公司 词汇纠错方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462085B (zh) * 2013-09-12 2019-04-12 腾讯科技(深圳)有限公司 检索关键词纠错方法及装置
CN110019647B (zh) * 2017-10-25 2023-12-15 华为技术有限公司 一种关键词搜索方法、装置和搜索引擎

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402490B1 (en) * 2015-08-14 2019-09-03 Shutterstock, Inc. Edit distance based spellcheck
CN105468719A (zh) * 2015-11-20 2016-04-06 北京齐尔布莱特科技有限公司 一种查询纠错方法、装置和计算设备
CN108664467A (zh) * 2018-04-11 2018-10-16 广州视源电子科技股份有限公司 候选词评估方法、装置、计算机设备和存储介质
CN111079412A (zh) * 2018-10-18 2020-04-28 北京嘀嘀无限科技发展有限公司 文本纠错方法及装置
CN110348020A (zh) * 2019-07-17 2019-10-18 杭州嘉云数据科技有限公司 一种英文单词拼写纠错方法、装置、设备及可读存储介质
CN111737981A (zh) * 2020-06-24 2020-10-02 平安科技(深圳)有限公司 词汇纠错方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YIBULAYIN MAYIRE, MIJITI ABULIMITI, ASKAR HAMDULLA: "A Minimum Edit Distance Based Uighur Spelling Check", JOURNAL OF CHINESE INFORMATION PROCESSING, vol. 22, no. 3, 1 May 2008 (2008-05-01), pages 110 - 114, XP055882754, ISSN: 1003-0077 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719424A (zh) * 2023-08-09 2023-09-08 腾讯科技(深圳)有限公司 一种类型识别模型的确定方法及相关装置
CN116719424B (zh) * 2023-08-09 2024-03-22 腾讯科技(深圳)有限公司 一种类型识别模型的确定方法及相关装置
CN116720812A (zh) * 2023-08-11 2023-09-08 合肥恒艺德机械有限公司 一种基于数据编码的大数据智慧仓储管理系统
CN116720812B (zh) * 2023-08-11 2023-10-20 合肥恒艺德机械有限公司 一种基于数据编码的大数据智慧仓储管理系统

Also Published As

Publication number Publication date
CN111737981A (zh) 2020-10-02

Similar Documents

Publication Publication Date Title
WO2021258853A1 (zh) 词汇纠错方法、装置、计算机设备及存储介质
WO2021258848A1 (zh) 数据字典生成方法、数据查询方法、装置、设备及介质
JP6998928B2 (ja) データを記憶およびクエリするための方法、装置、設備、および媒体
WO2022142613A1 (zh) 训练语料扩充方法及装置、意图识别模型训练方法及装置
CN110532347B (zh) 一种日志数据处理方法、装置、设备和存储介质
CN111666370B (zh) 面向多源异构航天数据的语义索引方法和装置
WO2019161645A1 (zh) 基于Shell的数据表提取方法、终端、设备及存储介质
WO2021253688A1 (zh) 数据同步方法及装置、数据查询方法及装置
JP2012533819A (ja) 文書インデックス化およびデータクエリングのための方法およびシステム
CN107301214A (zh) 在hive中数据迁移方法、装置及终端设备
EP3926484B1 (en) Improved fuzzy search using field-level deletion neighborhoods
CN110569289A (zh) 基于大数据的列数据处理方法、设备及介质
CN109656947B (zh) 数据查询方法、装置、计算机设备和存储介质
US10558636B2 (en) Index page with latch-free access
CN109213775B (zh) 搜索方法、装置、计算机设备和存储介质
CN116383238B (zh) 基于图结构的数据虚拟化系统、方法、装置、设备及介质
CN114139040A (zh) 一种数据存储及查询方法、装置、设备及可读存储介质
US8321429B2 (en) Accelerating queries using secondary semantic column enumeration
US7672925B2 (en) Accelerating queries using temporary enumeration representation
US20230153455A1 (en) Query-based database redaction
US9305080B2 (en) Accelerating queries using delayed value projection of enumerated storage
WO2022262240A1 (zh) 数据处理方法、电子设备及存储介质
US20170031909A1 (en) Locality-sensitive hashing for algebraic expressions
CN111159218B (zh) 数据处理方法、装置及可读存储介质
CN110471901B (zh) 数据导入方法及终端设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21828309

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21828309

Country of ref document: EP

Kind code of ref document: A1