CN112115232B - Data error correction method, device and server - Google Patents

Data error correction method, device and server Download PDF

Info

Publication number
CN112115232B
CN112115232B CN202011016203.1A CN202011016203A CN112115232B CN 112115232 B CN112115232 B CN 112115232B CN 202011016203 A CN202011016203 A CN 202011016203A CN 112115232 B CN112115232 B CN 112115232B
Authority
CN
China
Prior art keywords
word
search
reference word
dictionary tree
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011016203.1A
Other languages
Chinese (zh)
Other versions
CN112115232A (en
Inventor
韩时通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011016203.1A priority Critical patent/CN112115232B/en
Publication of CN112115232A publication Critical patent/CN112115232A/en
Application granted granted Critical
Publication of CN112115232B publication Critical patent/CN112115232B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例公开了一种数据纠错方法、装置及服务器,该方法包括:获取用户输入的搜索词;将所述搜索词与预先创建的字典树进行匹配,得到匹配结果,所述字典树包括多个节点,所述多个节点中的每个节点用于表示参考词列表中参考词的一个分词片段;若所述匹配结果指示所述搜索词与所述字典树不匹配,则获取所述搜索词的特征向量,并根据所述搜索词的特征向量从所述参考词列表包括的多个参考词中确定出目标参考词;将数据库中与所述目标参考词匹配的内容作为所述搜索词的搜索结果。该方法可以准确地对搜索词进行自动化纠错,提升数据查询的效率和准确度。

The embodiment of the present invention discloses a data error correction method, device and server, the method comprising: obtaining a search word input by a user; matching the search word with a pre-created dictionary tree to obtain a matching result, the dictionary tree comprising multiple nodes, each of the multiple nodes being used to represent a word segment of a reference word in a reference word list; if the matching result indicates that the search word does not match the dictionary tree, obtaining a feature vector of the search word, and determining a target reference word from multiple reference words included in the reference word list according to the feature vector of the search word; and using the content in the database that matches the target reference word as the search result of the search word. The method can accurately and automatically correct the search word and improve the efficiency and accuracy of data query.

Description

Data error correction method, device and server
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data error correction method, apparatus, and server.
Background
With the rapid development of internet technology, the amount of information in the internet is also increasing, and how to more effectively obtain the information required by the information has been attracting more and more attention. Most people complete the searching process of their information through a search engine, but when a user inputs a search word in the search engine to inquire, there are often cases of inputting mispronounced words, multiple words or few words for various reasons, for example, when the user inputs "principals" into "cocks" in the case of homophones, the search engine may generate a problem that the returned search result does not meet the expectations of the user, at this time, the user needs to search for the required information in a large number of search result pages, usually needs to spend more time to find the input error of the search word after looking up the search result, and try to correct the search word to search again, or replace the search word continuously for obtaining effective information, and the searching method cannot achieve the purpose of intelligent inquiry and is low in efficiency.
Disclosure of Invention
In view of the above, the embodiment of the invention provides a data error correction method, which can accurately perform automatic error correction on search words and improve the efficiency and accuracy of data query.
In a first aspect, an embodiment of the present invention provides a data error correction method, including:
Acquiring search words input by a user;
matching the search word with a pre-created dictionary tree to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing one word segmentation segment of a reference word in a reference word list;
if the matching result indicates that the search word is not matched with the dictionary tree, obtaining a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
and taking the content matched with the target reference word in the database as the search result of the search word.
In a second aspect, an embodiment of the present invention provides a data error correction apparatus, including:
The data acquisition module is used for acquiring search words input by a user;
the data matching module is used for matching the search word with a pre-established dictionary tree to obtain a matching result, the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing a word segmentation segment of a reference word in a reference word list;
The data determining module is used for acquiring the feature vector of the search word if the matching result indicates that the search word is not matched with the dictionary tree, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
And the data output module is used for taking the content matched with the target reference word in the database as the search result of the search word.
In a third aspect, an embodiment of the present application provides a server, where the server includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is configured to store a computer program, and the computer program includes program instructions, and the processor is configured to invoke the program instructions to perform an operation related to the foregoing data error correction method.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium storing a computer program, where the processor executes a program related to the above-mentioned data error correction method.
In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs a data error correction method as described above.
According to the embodiment of the invention, for the obtained search word, the search word is firstly matched with the pre-established dictionary tree, whether the search word needs error correction is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the feature vector of the search word is obtained, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally, the content matched with the target reference word in the database is used as the search result of the search word, so that the automatic error correction can be accurately performed on the search word, and the efficiency and the accuracy of data query are improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a data retrieval system according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data error correction method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating steps for creating a dictionary tree provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of a dictionary tree according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an error correction log interface provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a data error correction device according to an embodiment of the present invention;
Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, cloud storage, big data processing technologies, operation/interaction systems, electromechanical integration, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.
Cloud storage (cloud storage) is a new concept which extends and develops in the concept of cloud computing, and a distributed cloud storage system refers to a storage system which integrates a large number of storage devices (storage devices are also called as storage nodes) of different types in a network through application software or application interfaces to work cooperatively through functions of cluster application, grid technology, distributed storage file systems and the like, and provides data storage and service access functions together.
The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.
When the data error correction method provided by the application is used for correcting the input search words, the technology such as cloud computing, cloud storage, database and the like in the artificial intelligence technology is needed, and the automatic error correction of the search words can be realized by matching the search words input by the user with the dictionary tree, so that the efficiency and the accuracy of data query are improved.
Before explaining the embodiment of the present application in detail, an application scenario of the embodiment of the present application is described.
The data error correction method in the embodiment of the application can be particularly applied to some small and medium-sized portal websites for correcting search words, for example, government service websites, the existing government service websites have little technical accumulation and lack of network related talents, so that the subsequent operation of the websites is difficult, and meanwhile, the availability is only ensured after the system is online, and the problem of practicability is not ensured. The government service web site is only used for illustration, and can also be applied to other small and medium-sized portal web sites, such as enterprise portal web sites.
Fig. 1 is a schematic diagram of an architecture of a data retrieval system according to an embodiment of the present invention. The data retrieval system may include a user terminal 101, a network 102, and a server 103, the user terminal 101 and the server 103 communicating via the network 102. The user terminal 101 obtains a search word, which may be a determined text input by a user of the user terminal 101, and then sends the search word to the server 103 through the network 102, the server 103 matches the search word, determines whether an accurate search result can be obtained by directly using the search word, for example, determines whether the search word is in a dictionary tree, if the search word is not determined to be not matched in the dictionary tree, automatically corrects the search word, and uses matching content corresponding to the target reference word obtained after correction as the search result of the search word. The network 102 may include various connection types, such as a wired, wireless communication link, or an optical fiber cable, etc., the user terminal 101 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, the server 103 may be implemented by a stand-alone server or a server cluster formed by a plurality of servers, for example, a government server or a server cluster of a government platform, and may also be a cloud server providing cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligent platforms.
It may be understood that the schematic diagram of the architecture of the system described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application, and those skilled in the art can know that, with the evolution of the architecture of the system and the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is equally applicable to similar technical problems.
In one embodiment, as shown in fig. 2, a data error correction method is provided according to an embodiment of the present invention based on the data retrieval system of fig. 1. The present embodiment is mainly exemplified by the application of the method to the server 103 in fig. 1, and includes the following steps:
Step S201, obtaining search words input by a user.
In the embodiment of the invention, when a user needs to search information through the user terminal, the user can input the search word in the search box, so that the user terminal obtains the search word input by the user, the user terminal sends the search word to the server, and the server obtains the search word. The search box refers to an interaction control in the search engine system and is used for extracting corresponding accurate contents in massive information according to search characters input in the search box.
It should be noted that, in practical application, when a user inputs a search word in a search box, the user may manually input the search word, or may input the search word in a voice form, etc., and the method of inputting the search word by the user is not limited in the embodiment of the present application.
Step S202, matching the search word with a pre-created dictionary tree to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing a word segmentation segment of a reference word in a reference word list.
The dictionary tree (Trie tree) is also called as a prefix tree, and is a tree-shaped data structure, which comprises a plurality of nodes and can be used in the processes of character string matching, quick searching and the like. The method can furthest reduce the comparison times of meaningless character strings and improve the efficiency of word frequency statistics and character string sequencing. The key idea is to use the common prefix among character strings to reduce the cost of inquiry by constructing a tree structure and using space to change time. Dictionary trees generally have three properties: 1) The root node does not contain characters, each node except the root node only contains a character string, and the character string can be a word segmentation segment of a reference word in the reference word list; 2) All character strings on the path from the root node to a certain leaf node are connected together, namely, the combined character string corresponding to the node, and each combined character string can be a reference word; 3) All child nodes of each node contain different characters.
In the embodiment of the application, a pre-created dictionary tree is used for obtaining a plurality of reference words in a reference word list, then, word segmentation processing is carried out on each reference word to obtain a plurality of word segmentation fragments of each reference word, the word segmentation fragments of each reference word in the reference word list are sequentially stored into different nodes in one path of the dictionary tree by taking each word segmentation fragment as a unit, in the process of creating the dictionary tree, whether a node exists in characters of a first (or earlier) word segmentation fragment of the reference word is compared, if so, a pointer is pointed to the node, and if not, the node is created for the word segmentation fragment.
In one embodiment, the server obtains a plurality of original corpora from the database, wherein the corpora include some key data, and the key data includes keywords and corresponding occurrence numbers. The server can also obtain the search records of the user in a certain time from the search engine, obtain a plurality of search words and corresponding occurrence times, and generate a reference word list by counting key data and key words and search words in the search records of the user.
Specifically, after receiving a search word input by a user terminal, the server performs word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word, starts from a first word segmentation fragment of the search word, matches with a pre-created dictionary tree from a first layer child node of a root node to determine whether a node exists in the current word segmentation fragment, and continues to match the next word segmentation fragment of the search word from the current node in the dictionary tree when the node exists and the word segmentation fragment is matched, and generates a matching result when the word segmentation fragment in the search word does not exist in the node in the dictionary tree, so that the search word is considered to be not matched with the dictionary tree.
As a specific example of this embodiment, if the search term input by the user terminal is "driver license order", the dictionary tree includes nodes corresponding to each word segment of the reference term "driver license claim", after word segmentation processing is performed on the search term, the obtained "driver" license "and" claim "are multiple word segments of the search term, and the multiple word segments of the search term" driver "license" and "claim" are matched with the nodes in the dictionary tree, and when the word segment "claim" is matched, the corresponding nodes cannot be found in the dictionary tree due to the fact that the "collar" is input as the "claim", so that the search term and the dictionary tree are determined to be not matched.
Step 203, if the matching result indicates that the search word is not matched with the dictionary tree, obtaining a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word.
The feature vector refers to a vector which expresses natural language into a semantic meaning, and a text expression algorithm can be used for vectorizing a plurality of reference words in a search word and reference word list to obtain a corresponding feature vector, wherein the text expression algorithm comprises a method based on a vector idle model, a method based on a topic model, a method based on a neural network and the like.
Specifically, in order to determine the target reference word, the feature vector of each reference word in the plurality of reference words in the reference word list is first required to be obtained, after the feature vector of each reference word in the plurality of reference words included in the reference word list is obtained, the similarity between the feature vector of the search word and the feature vector of each reference word is calculated, and after the similarity is ordered, the reference word corresponding to the highest similarity is used as the target reference word.
Wherein the similarity is used to measure the similarity between the feature vector of the search term and the feature vector of each of the plurality of reference terms in the reference term list, the similarity may be calculated using a similarity algorithm, which includes, but is not limited to, a euclidean distance algorithm, a cosine similarity algorithm, a pearson correlation coefficient algorithm, a jaccard similarity coefficient algorithm, and the like.
For example, the search term input by the user terminal is "license order", and the reference term list includes a reference term of "license claim". The server may perform similarity calculation with the feature vector of each of the plurality of reference words in the reference word list after converting "driver license order" into the feature vector, and when "driver license claims" is returned as the reference word with the highest similarity, use "driver license claims" as the target search word.
And step S204, taking the content matched with the target reference word in the database as the search result of the search word.
Specifically, after the server obtains the target search word, the server searches in the database directly according to the target search word, and takes the searched corresponding search result as the search result of the search word. For example, when the search term input by the user terminal is "driver license order", after the target search term "driver license claim" is obtained through step S202 and step S203, the corresponding data resource is searched in the database directly using the "driver license claim" as the target reference term, as the search result of the search term.
According to the data error correction based method, after the search word is obtained, the search word is matched with the pre-established dictionary tree, whether the search word needs to be corrected or not is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally, the content matched with the target reference word in the database is used as the search result of the search word, so that whether the search word has errors or not can be intelligently judged, the error correction is performed, a user is not required to manually correct the search word, and the efficiency of inquiring data can be effectively improved.
In one embodiment, as shown in fig. 3, before the search word is matched with the pre-created dictionary tree to obtain a matching result, the method further includes a step of creating the dictionary tree, where the step specifically includes the following steps:
Step 301, extracting key data from contents included in a database, wherein the key data comprises a plurality of keywords and the occurrence number of each keyword in the plurality of keywords;
the server obtains a large number of original corpora from the database, wherein the corpora contain keywords (such as "license claims") and counts the occurrence times of the keywords.
Step 302, obtaining a user search record, wherein the user search record comprises a plurality of search words and the occurrence frequency of each search word in the plurality of search words;
Specifically, the server may obtain a search record of a user in a certain time from the search engine, and obtain a plurality of search terms and corresponding occurrence numbers.
In one embodiment, the server obtains a search record of a user in a certain time from the search engine, sorts the search records according to time, for the search record of a certain user, generally, the search behavior of the user is segmented, each segment has a relatively obvious interval, each segment is called a search session, the server obtains a plurality of search words of each search session of the user, and counts the corresponding occurrence times, because the search behavior of the user in one search session is to solve a problem, the search words input by the user in one search session are always related, repeated words or synonyms can be mined by counting the search words of the user in one search session, thereby further improving the completeness of a reference word list and improving the accuracy of query.
In one embodiment, step 302 may be performed first, and then step 301 may be performed. The specific order of steps 301 and 302 is not limited in this embodiment of the present invention.
And 303, creating a reference word list according to the key data and the user search record.
Specifically, by counting key data and key words and search words in a user search record, a reference word column in a reference word list is generated, and the occurrence frequency of the reference word is used as the word frequency corresponding to the reference word.
Illustratively, the list of reference words is shown in Table 1:
Table 1 list of reference words
Reference words Word frequency Creation time
Long nurse 42 2020-09-18 10:04:01
Accumulation of money 210 2020-09-18 10:04:01
Fertility body paste 21 2020-09-18 10:04:01
Fertility patch 22 2020-09-18 10:04:01
Driver license declaration 125 2020-09-18 10:04:01
Xishan house 1 2020-09-18 10:04:01
Step 304, obtaining a reference word list, wherein the reference word list comprises a plurality of reference words;
step 305, performing word segmentation processing on each reference word in the plurality of reference words to obtain a plurality of word segmentation fragments of each reference word;
in this embodiment, word segmentation refers to the segmentation of a text sequence into individual words, and word segmentation refers to the word segments obtained by processing the text sequence. For example, the search word input by the user is "driving license claim", and the driving license is obtained after word segmentation processing, wherein the driving license is a word segment.
And 306, generating nodes of the dictionary tree according to each word segmentation segment of each reference word so as to create the dictionary tree corresponding to the plurality of reference words.
In this embodiment, the word segmentation segment of each reference word in the reference word list is sequentially stored in different nodes in one path of the dictionary tree, in the process of creating the dictionary tree, whether a node exists in the first (or earlier occurrence) word segmentation segment of the reference word is compared, if so, a pointer is pointed to the node, and if not, a node is created for the word segmentation segment.
For example, as shown in fig. 4, taking three reference words, i.e., a "public accumulation fund", "a child career", and a "child career" included in the reference word list as an example, the server may create a root node, multiple nodes below the root node are all child nodes, and for the word segmentation "public" and "child" in the reference words "public accumulation fund", "child career", the server may determine that there are no child nodes connected to the root node and matched with the character, and the server creates a child node "public" and "child" connected to the root node. Similarly, for the word segment "product" in the reference word "public backlog", the server may determine that there are no child nodes connected to the node "public" that match the character, and then the server may create a child node "product" connected to the node "public"; for the word segment "gold" in the reference word "accumulation gold", the server may determine that there is no child node connected to the node "accumulation" and matching the character, and then the server may create a child node "gold" connected to the node "accumulation"; whereby the server can obtain a tree structure as shown in fig. 4 (1). Similarly, a tree structure as shown in fig. 4 can be obtained.
In one embodiment, matching the search term with a pre-created dictionary tree to obtain a matching result includes: performing word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word; matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-created dictionary tree; if the word segmentation segment is not matched with the nodes in the dictionary tree, a matching result is generated, and the matching result is used for indicating that the search word is not matched with the dictionary tree.
In this embodiment, after receiving a search word input by a user terminal, the server performs word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word, starting from a first word segmentation fragment of the search word, matching with a pre-created dictionary tree to determine whether a node exists in the current word segmentation fragment from a first layer child node of a root node, continuing to match the next word segmentation fragment of the search word from the current node in the dictionary tree when the node exists to match the word segmentation fragment, and generating a matching result when the word segmentation fragment in the search word does not exist in the node in the dictionary tree, wherein the search word is considered to be unmatched with the dictionary tree.
As a specific example of this embodiment, if the search term input by the user terminal is "driver license order", after the word segmentation processing is performed on the search term, the obtained "driver" "" claim "" "command" is a plurality of word segmentation fragments of the search term, the plurality of word segmentation fragments of the search term "driver" "" claim "" "command" is matched with the nodes in the dictionary tree, and when the word segmentation fragment "command" is matched, the corresponding nodes cannot be found in the dictionary tree due to the fact that the "collar" is input as the "command", so that the search term and the dictionary tree are not matched is determined.
According to the embodiment, the obtained search words are matched with the dictionary tree, so that whether the search words need error correction or not can be effectively confirmed, and the accuracy of search results returned by inquiry is ensured.
In one embodiment, after obtaining the feature vector of each reference word in the plurality of reference words included in the reference word list, the server calculates the similarity between the feature vector of the search word and the feature vector of each reference word, and may calculate the sub-reference word lists in the reference word list through a plurality of preset threads concurrently to obtain the reference word corresponding to the highest similarity of the sub-reference word lists. Specifically, the server may use bigram, trigram or other strategies to control the splitting size, thereby controlling the calculation amount, and divide the reference word list into K sub-reference word lists, where K is greater than or equal to 2, and each sub-reference word list contains N reference words, where N is greater than or equal to 1; the server calls a plurality of preset threads to respectively calculate the similarity between the feature vector of the search word input by the user and the feature vector of N reference words in each sub-reference word list, so as to obtain the reference word corresponding to the highest similarity in each sub-reference word list, and the target search word corresponding to the highest similarity is obtained by sorting according to the descending order of the similarity.
In this embodiment, each thread may calculate the similarity between the feature vector of the search term and the feature vectors of N reference terms in one sub-reference list, and multiple threads may process at the same time, so that the time required for searching may be reduced, and the processing speed of searching may be improved.
As a specific example of this embodiment, as shown in table 2, the server splits the table 1 reference word list into two reference word lists, and obtains feature vectors corresponding to a plurality of reference words in each reference word list, to obtain a reference word feature vector list (a) and a reference word feature vector list (b), for example, the user terminal inputs a search word "driver license claim", the server may perform similarity calculation with the feature vector of each reference word in the reference word feature vector list (a) and the reference word feature vector list (b) after converting the "driver license claim" into the feature vector, wherein the calculation with the reference word feature vector list (a) returns [ principal=0.0123 ], the calculation with the reference word feature vector list (b) returns [ driver license claim= 0.89102], and the server selects the reference word "driver license claim" with the highest similarity from the results returned from the reference word feature vector lists (a) and (b) as the target search word.
Illustratively, the list of reference word feature vectors is shown in Table 2:
table 2 reference word feature vector List (a)
Reference words Feature vector
Nurse certificate [0.1233422,10.1292920d,101.929101]
Accumulation of money [1.1233422,10.1292920d,101.929101]
Fertility body paste [2.1233422,10.1292920d,101.929101]
Table 2 reference word feature vector List (b)
Reference words Feature vector
Fertility patch [3.1233422,10.1292920d,101.929101]
Driver license declaration [4.1233422,10.1292920d,101.929101]
Xishan house [5.1233422,10.1292920d,101.929101]
In one embodiment, after the reference word list is obtained, the reference words in the reference word list may be converted into pinyin, the pinyin of the obtained reference words is subjected to word segmentation, and the word segmentation fragments of the pinyin subjected to word segmentation construct a pinyin dictionary tree.
Illustratively, the Pinyin list of the reference words is shown in Table 3:
TABLE 3 Pinyin list of reference words
Reference words Pinyin Creation time
Long nurse hushizhang 2020-09-18 10:05:10
Accumulation of money gongjijin 2020-09-18 10:05:10
Fertility body paste shengyujintie 2020-09-18 10:05:10
Fertility patch shengyubutie 2020-09-18 10:05:10
Driver license declaration jiazhaoshenling 2020-09-18 10:05:10
Xishan house xishanju 2020-09-18 10:05:10
After the server matches the search word input by the user with the dictionary tree, the matching result shows that the word segmentation segment of the search word cannot find the corresponding node in the dictionary tree, further, the pinyin corresponding to the search word is matched with the pinyin dictionary tree, and if the pinyin corresponding to the search word can find the corresponding node in the pinyin dictionary tree, the reference word corresponding to the pinyin character sequence of the matched node is used as the target reference word; if the corresponding node cannot be found in the pinyin dictionary tree by the pinyin corresponding to the search word, performing similarity calculation on the feature vector of each pinyin in the plurality of pinyins included in the pinyin list of the search word and the reference word, obtaining the pinyin with the highest corresponding similarity, and taking the reference word corresponding to the pinyin as the target reference word.
In one embodiment, the reference word list further includes a word frequency of each reference word in the plurality of reference words, and the step of using the reference word with the highest corresponding similarity as the target reference word includes: acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity; obtaining a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word; judging whether the difference value is smaller than or equal to a preset difference value threshold value or not; if yes, inquiring word frequency of the first reference word and word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word; and if not, taking the first reference word as a target reference word.
In this embodiment, after similarity calculation is performed on feature vectors of reference words and feature vectors of a plurality of reference words in a reference word list, a first reference word with the highest similarity and a second reference word with the second highest similarity are obtained, when editing distances between the first reference word and the second reference word and the search word are smaller, a difference value between the similarities corresponding to the first reference word and the second reference word is smaller than or equal to a preset difference threshold value, word frequencies corresponding to the first reference word and the second reference word are used as basis, and a reference word with the highest word frequency is selected as a target reference word. For example, when the search term input by the user is "fertility", the reference term list includes the reference terms "fertility post" and "fertility post", the edit distance between the search term "fertility" and the reference terms "fertility post" and "fertility post" is 2, and the difference between the similarity between the search term "fertility" and the reference terms "fertility post" and "fertility post" is smaller than the preset difference threshold, and the word frequency of the reference term "fertility post" is 21 and the word frequency of the reference term "fertility post" is 22, so that the "fertility post" is taken as the target reference term.
In one embodiment, the data error correction method further comprises the steps of: if the matching result indicates that the search word is matched with the dictionary tree, searching candidate content matched with the search word from a database; acquiring the correlation degree between the candidate content and the search word; if the correlation degree is smaller than or equal to a preset correlation degree threshold value, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word; and taking the content matched with the target reference word in the database as a search result of the search word.
In this embodiment, word segmentation fragments of search words input by a user are matched with a dictionary tree, when the word segmentation fragments of the search words input by the user can find corresponding nodes in the dictionary tree, the matching result indicates that the search words are matched with the dictionary tree, the search words input by the user are queried in a database, and corresponding search results are obtained as candidate contents.
Further, the relevance between the search word and the candidate content is obtained, for example, the candidate content returned in the database has ten web page articles, the relevance between the search word and each web page article is calculated by using a relevance algorithm, for example, a BM25 algorithm, finally, the average value of the relevance between the search word and the ten web page articles is compared with a preset relevance threshold, if 100 is divided into full scores, the preset relevance threshold is 50 scores, and if the relevance between the search word input by the user and the candidate content is lower than 50 scores, the target reference word is determined from a plurality of reference words included in the reference word list according to the feature vector of the search word input by the user.
In one embodiment, after a search word input by a user is queried in a database to obtain a corresponding search result as candidate content, whether the candidate content is a search result interesting to the user can be calculated by using a click model, for example, the click probability of the user clicking on a webpage article in the candidate content is calculated, when the predicted click probability is lower than a preset threshold value, a target reference word is determined from a plurality of reference words included in a reference word list according to a feature vector of the search word, and the content matched with the target reference word in the database is used as the search result of the search word.
The click model builds a probability map model based on some preconditions by mining information such as search words, search contents corresponding to the search words, clicked search results in the search contents corresponding to the search words, and the like, so as to model the search behavior of the user. Click models include, but are not limited to, cascading models, dynamic bayesian network models, and the like.
In one embodiment, the data error correction method further comprises the steps of: obtaining a search word error correction log, wherein the search word error correction log comprises a plurality of error correction records, and each error correction record in the plurality of error correction records comprises an input search word and a corresponding target reference word; acquiring a reference word to be added which is input according to an error correction record with errors in the search word error correction log; and adding the reference words to be added into the reference word list, and updating the dictionary tree.
As shown in fig. 5, the search word input by the user and the corresponding target search word are recorded through the error correction log, so as to obtain a plurality of error correction records. In a government website, there is actually no term "driver license", and the driver's license is correct, so that the driver's license is not added to the list of reference words in table 1, and thus the driver's license is likely to be corrected as a passport. When an operator obtains an error correction log from a server and finds out automatic error correction errors of search, the server can add the reference word to be added, namely the driver's license, to the reference word list by taking the driver's license as the reference word to be added.
Further, the server searches the reference word to be added in the dictionary tree, matches nodes in the dictionary tree with word segmentation fragments of the reference word to be added, and updates the nodes of the dictionary tree when the matching result indicates that the reference word to be added cannot find corresponding nodes in the dictionary tree for matching. Therefore, the nodes in the dictionary tree can be effectively and dynamically updated according to the error correction log, so that the reference words are effectively expanded, and the completeness and accuracy of the reference words stored in the dictionary tree are enhanced.
As shown in fig. 6, fig. 6 is a schematic structural diagram of a data error correction device according to an embodiment of the present application, including:
a data acquisition module 601, configured to acquire a search term input by a user;
The data matching module 602 is configured to match the search word with a pre-created dictionary tree to obtain a matching result, where the dictionary tree includes a plurality of nodes, and each node in the plurality of nodes is configured to represent a word segmentation segment of a reference word in a reference word list;
The data determining module 603 is configured to obtain a feature vector of the search word if the matching result indicates that the search word does not match the dictionary tree, and determine a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
and the data output module 604 is used for taking the content matched with the target reference word in the database as the search result of the search word.
In one embodiment, the data matching module 602 matches the search term with a pre-created dictionary tree to obtain a matching result, including:
performing word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word;
Matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-created dictionary tree;
if the word segmentation segment is not matched with the nodes in the dictionary tree, a matching result is generated, and the matching result is used for indicating that the search word is not matched with the dictionary tree.
In one embodiment, the data determining module 603 determines, according to the feature vector of the search term, a target reference term from a plurality of reference terms included in the reference term list, including:
Acquiring a feature vector of each reference word in a plurality of reference words included in the reference word list;
calculating the similarity between the feature vector of the search word and the feature vector of each reference word;
And taking the corresponding reference word with the highest similarity as the target reference word.
In one embodiment, the data determining module 603 uses, as the target reference word, the reference word with the highest corresponding similarity, including:
acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity;
Obtaining a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word;
Judging whether the difference value is smaller than or equal to a preset difference value threshold value or not;
If yes, inquiring word frequency of the first reference word and word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word;
and if not, taking the first reference word as a target reference word.
In one embodiment, if the matching result indicates that the search term matches the dictionary tree, the data determining module 603 is further configured to query the database for candidate content matching the search term;
the data determining module 603 is further configured to obtain a correlation degree between the candidate content and the search term;
The data determining module 603 is further configured to obtain a feature vector of the search term if the correlation is less than or equal to a preset correlation threshold, and determine a target reference term from a plurality of reference terms included in the reference term list according to the feature vector of the search term;
The data output module 604 is further configured to use the content in the database that matches the target reference word as a search result of the search word.
In one embodiment, before the search word is matched with the pre-created dictionary tree to obtain a matching result, the data obtaining module 601 is further configured to obtain a reference word list, where the reference word list includes a plurality of reference words;
The data obtaining module 601 is further configured to perform word segmentation processing on each reference word in the plurality of reference words, so as to obtain a plurality of word segmentation segments of each reference word;
The data obtaining module 601 is further configured to generate nodes of a dictionary tree according to each word segmentation segment of each reference word, so as to create a dictionary tree corresponding to the plurality of reference words.
In one embodiment, the data acquisition module 601 acquires a list of reference words, including:
extracting key data from content included in a database, wherein the key data comprises a plurality of keywords and the occurrence number of each keyword in the plurality of keywords;
Acquiring a user search record, wherein the user search record comprises a plurality of search words and the occurrence frequency of each search word in the plurality of search words;
and creating a reference word list according to the key data and the user search record.
In one embodiment, the data obtaining module 601 is further configured to obtain a search word error correction log, where the search word error correction log includes a plurality of error correction records, and each error correction record in the plurality of error correction records includes an input search word and a corresponding target reference word;
The data obtaining module 601 is further configured to obtain a reference word to be added according to an error correction record input in the error correction log of the search word;
the data obtaining module 601 is further configured to add the reference word to be added to the reference word list, and update the dictionary tree.
According to the data error correction device provided by the embodiment of the application, after the search word is obtained, the search word is matched with the pre-established dictionary tree, whether the search word needs error correction is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally, the content matched with the target reference word in the database is used as the search result of the search word, so that the automatic error correction can be accurately performed on the search word, and the efficiency and the accuracy of data query are improved.
Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application, where the internal structure of the server is shown in fig. 7, and the server includes an input device 701, an output device 702, a processor 703, a memory 704, a program 705, and a communication bus 706, where the input device 701, the output device 702, the processor 703, and the memory 704 complete communication with each other through the communication bus 706.
A memory 704 for storing a program 705;
The processor 703 is configured to execute the program 705 stored in the memory 704, thereby implementing the following steps:
Acquiring search words input by a user;
matching the search word with a pre-created dictionary tree to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing one word segmentation segment of a reference word in a reference word list;
if the matching result indicates that the search word is not matched with the dictionary tree, obtaining a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
and taking the content matched with the target reference word in the database as the search result of the search word.
In one embodiment, the processor 703 matches the search term to a pre-created dictionary tree to obtain a matching result, including:
performing word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word;
Matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-created dictionary tree;
if the word segmentation segment is not matched with the nodes in the dictionary tree, a matching result is generated, and the matching result is used for indicating that the search word is not matched with the dictionary tree.
In one embodiment, the processor 703 determines a target reference word from the plurality of reference words included in the reference word list according to the feature vector of the search word, including:
Acquiring a feature vector of each reference word in a plurality of reference words included in the reference word list;
calculating the similarity between the feature vector of the search word and the feature vector of each reference word;
And taking the corresponding reference word with the highest similarity as the target reference word.
In one embodiment, the processor 703 takes the corresponding reference word with the highest similarity as the target reference word, and includes:
acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity;
Obtaining a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word;
Judging whether the difference value is smaller than or equal to a preset difference value threshold value or not;
If yes, inquiring word frequency of the first reference word and word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word;
and if not, taking the first reference word as a target reference word.
In one embodiment, the processor 703 is further configured to perform the following:
if the matching result indicates that the search word is matched with the dictionary tree, searching candidate content matched with the search word from a database;
acquiring the correlation degree between the candidate content and the search word;
if the correlation degree is smaller than or equal to a preset correlation degree threshold value, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
and taking the content matched with the target reference word in the database as a search result of the search word.
In one embodiment, the processor 703 is further configured to perform the following operations before matching the search term with the pre-created dictionary tree to obtain a matching result:
acquiring a reference word list, wherein the reference word list comprises a plurality of reference words;
performing word segmentation processing on each reference word in the plurality of reference words to obtain a plurality of word segmentation fragments of each reference word;
Generating nodes of a dictionary tree according to each word segmentation segment of each reference word so as to create the dictionary tree corresponding to the plurality of reference words.
In one embodiment, the processor 703 obtains a list of reference words, including:
extracting key data from content included in a database, wherein the key data comprises a plurality of keywords and the occurrence number of each keyword in the plurality of keywords;
Acquiring a user search record, wherein the user search record comprises a plurality of search words and the occurrence frequency of each search word in the plurality of search words;
and creating a reference word list according to the key data and the user search record.
In one embodiment, the processor 703 is further configured to perform the following:
Obtaining a search word error correction log, wherein the search word error correction log comprises a plurality of error correction records, and each error correction record in the plurality of error correction records comprises an input search word and a corresponding target reference word;
acquiring a reference word to be added which is input according to an error correction record with errors in the search word error correction log;
and adding the reference words to be added into the reference word list, and updating the dictionary tree.
According to the server provided by the embodiment of the application, after the search word is obtained, the server is matched with the pre-established dictionary tree according to the search word, whether the search word needs error correction is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally, the content matched with the target reference word in the database is used as the search result of the search word, so that automatic error correction can be accurately performed on the search word, and the efficiency and accuracy of data query are improved.
The embodiment of the present application also provides a computer readable storage medium storing a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the program instructions may execute the steps executed by the server in the above embodiment.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by computer programs stored on a computer readable storage medium, which when executed, may include embodiments of the file management methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps performed in the embodiments of the methods described above.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (8)

1.一种数据纠错方法,其特征在于,应用于中小型门户网站对搜索词的纠错,包括:1. A data error correction method, characterized in that it is applied to the error correction of search terms on small and medium-sized portal websites, comprising: 从数据库包括的内容中提取关键数据,所述关键数据包括多个关键词以及所述多个关键词中每个关键词的出现次数;Extracting key data from the content included in the database, the key data including a plurality of keywords and the number of occurrences of each of the plurality of keywords; 获取用户一定时间内的搜索记录,所述用户搜索记录包括多个搜索词以及所述多个搜索词中每个搜索词的出现次数;Obtaining a user's search history within a certain period of time, wherein the user's search history includes multiple search terms and the number of occurrences of each of the multiple search terms; 根据所述关键数据和所述用户搜索记录创建参考词列表;Creating a reference word list based on the key data and the user search record; 获取用户输入的搜索词;Get the search term entered by the user; 将所述搜索词与预先创建的字典树进行匹配,得到匹配结果,所述字典树包括多个节点,所述多个节点中的每个节点用于表示参考词列表中参考词的一个分词片段;Matching the search word with a pre-created dictionary tree to obtain a matching result, wherein the dictionary tree includes a plurality of nodes, each of the plurality of nodes is used to represent a word segment of a reference word in a reference word list; 若所述匹配结果指示所述搜索词与所述字典树不匹配,则获取所述搜索词的特征向量,并根据所述搜索词的特征向量从所述参考词列表包括的多个参考词中确定出目标参考词;If the matching result indicates that the search word does not match the dictionary tree, obtaining a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word; 将数据库中与所述目标参考词匹配的内容作为所述搜索词的搜索结果;Using the content in the database that matches the target reference word as the search result of the search word; 获取搜索词纠错日志,所述搜索词纠错日志包括多个纠错记录,所述多个纠错记录中的每个纠错记录包括输入的搜索词以及对应的目标参考词;Acquire a search word error correction log, wherein the search word error correction log includes a plurality of error correction records, each of the plurality of error correction records includes an input search word and a corresponding target reference word; 获取根据所述搜索词纠错日志中出错的纠错记录输入的待添加参考词;Obtaining a reference word to be added that is input according to an erroneous correction record in the search word correction log; 将所述待添加参考词添加到所述参考词列表中,并更新所述字典树。The reference word to be added is added to the reference word list, and the dictionary tree is updated. 2.根据权利要求1所述的方法,其特征在于,所述将所述搜索词与预先创建的字典树进行匹配,得到匹配结果,包括:2. The method according to claim 1, characterized in that the step of matching the search term with a pre-created dictionary tree to obtain a matching result comprises: 对所述搜索词进行分词处理,得到所述搜索词的多个分词片段;Performing word segmentation processing on the search word to obtain multiple word segmentation fragments of the search word; 将所述多个分词片段中的每个分词片段与预先创建的字典树中的各个节点进行匹配;Matching each of the multiple word segmentation segments with each node in a pre-created dictionary tree; 若存在分词片段与所述字典树中的节点不匹配,则生成匹配结果,所述匹配结果用于指示所述搜索词与所述字典树不匹配。If there is a word segment that does not match a node in the dictionary tree, a matching result is generated, and the matching result is used to indicate that the search word does not match the dictionary tree. 3.根据权利要求1所述的方法,其特征在于,所述根据所述搜索词的特征向量从所述参考词列表包括的多个参考词中确定出目标参考词,包括:3. The method according to claim 1, wherein determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word comprises: 获取所述参考词列表包括的多个参考词中每个参考词的特征向量;Obtaining a feature vector of each reference word in a plurality of reference words included in the reference word list; 计算所述搜索词的特征向量与所述每个参考词的特征向量之间的相似度;Calculating the similarity between the feature vector of the search word and the feature vector of each reference word; 将对应的相似度最高的参考词作为目标参考词。The corresponding reference word with the highest similarity is used as the target reference word. 4.根据权利要求3所述的方法,其特征在于,所述参考词列表还包括所述多个参考词中每个参考词的词频,所述将对应的相似度最高的参考词作为目标参考词,包括:4. The method according to claim 3, characterized in that the reference word list further includes the word frequency of each reference word in the plurality of reference words, and the step of using the corresponding reference word with the highest similarity as the target reference word comprises: 获取对应的相似度最高的第一参考词和对应的相似度次高的第二参考词;Obtaining a first reference word with the highest similarity and a second reference word with the second highest similarity; 获取所述第一参考词对应的相似度和所述第二参考词对应的相似度之间的差值;Obtaining a difference between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word; 判断所述差值是否小于或等于预设差值阈值;Determining whether the difference is less than or equal to a preset difference threshold; 若是,则从所述参考词列表中查询所述第一参考词的词频和所述第二参考词的词频,并将所述第一参考词和所述第二参考词中词频最高的参考词作为目标参考词;If yes, query the word frequency of the first reference word and the word frequency of the second reference word from the reference word list, and use the reference word with the highest word frequency between the first reference word and the second reference word as the target reference word; 若否,则将所述第一参考词作为目标参考词。If not, the first reference word is used as the target reference word. 5.根据权利要求1所述的方法,其特征在于,所述方法还包括:5. The method according to claim 1, characterized in that the method further comprises: 若所述匹配结果指示所述搜索词与所述字典树匹配,则从数据库中查询与所述搜索词匹配的候选内容;If the matching result indicates that the search term matches the dictionary tree, querying the database for candidate content matching the search term; 获取所述候选内容与所述搜索词之间的相关度;Obtaining the relevance between the candidate content and the search term; 若所述相关度小于或等于预设相关度阈值,则获取所述搜索词的特征向量,并根据所述搜索词的特征向量从所述参考词列表包括的多个参考词中确定出目标参考词;If the relevance is less than or equal to a preset relevance threshold, obtaining a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word; 将所述数据库中与所述目标参考词匹配的内容作为所述搜索词的搜索结果。The content in the database that matches the target reference word is used as the search result of the search word. 6.根据权利要求1~5中任一项所述的方法,其特征在于,所述将所述搜索词与预先创建的字典树进行匹配,得到匹配结果之前,所述方法还包括:6. The method according to any one of claims 1 to 5, characterized in that before matching the search term with a pre-created dictionary tree to obtain a matching result, the method further comprises: 获取参考词列表,所述参考词列表包括多个参考词;Acquire a reference word list, wherein the reference word list includes a plurality of reference words; 对所述多个参考词中的每个参考词进行分词处理,得到所述每个参考词的多个分词片段;Performing word segmentation processing on each of the multiple reference words to obtain multiple word segmentation fragments of each reference word; 根据所述每个参考词的每个分词片段生成字典树的节点,以创建所述多个参考词对应的字典树。A node of a dictionary tree is generated according to each word segment of each reference word to create a dictionary tree corresponding to the multiple reference words. 7.一种数据纠错装置,其特征在于,应用于中小型门户网站对搜索词的纠错,包括:7. A data error correction device, characterized in that it is applied to small and medium-sized portal websites to correct search terms, comprising: 数据获取模块,用于从数据库包括的内容中提取关键数据,所述关键数据包括多个关键词以及所述多个关键词中每个关键词的出现次数;获取用户一定时间内的搜索记录,所述用户搜索记录包括多个搜索词以及所述多个搜索词中每个搜索词的出现次数;根据所述关键数据和所述用户搜索记录创建参考词列表;获取用户输入的搜索词;A data acquisition module is used to extract key data from the content included in the database, wherein the key data includes multiple keywords and the number of occurrences of each keyword in the multiple keywords; obtain the user's search record within a certain period of time, wherein the user's search record includes multiple search terms and the number of occurrences of each search term in the multiple search terms; create a reference word list based on the key data and the user's search record; and obtain the search term input by the user; 数据匹配模块,用于将所述搜索词与预先创建的字典树进行匹配,得到匹配结果,所述字典树包括多个节点,所述多个节点中的每个节点用于表示参考词列表中参考词的一个分词片段;A data matching module, used for matching the search word with a pre-created dictionary tree to obtain a matching result, wherein the dictionary tree includes a plurality of nodes, each of the plurality of nodes is used for representing a word segment of a reference word in a reference word list; 数据确定模块,用于若所述匹配结果指示所述搜索词与所述字典树不匹配,则获取所述搜索词的特征向量,并根据所述搜索词的特征向量从所述参考词列表包括的多个参考词中确定出目标参考词;a data determination module, configured to obtain a feature vector of the search word if the matching result indicates that the search word does not match the dictionary tree, and determine a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word; 数据输出模块,用于将数据库中与所述目标参考词匹配的内容作为所述搜索词的搜索结果;A data output module, used to use the content matching the target reference word in the database as the search result of the search word; 所述数据获取模块,还用于获取搜索词纠错日志,所述搜索词纠错日志包括多个纠错记录,所述多个纠错记录中的每个纠错记录包括输入的搜索词以及对应的目标参考词;获取根据所述搜索词纠错日志中出错的纠错记录输入的待添加参考词;将所述待添加参考词添加到所述参考词列表中,并更新所述字典树。The data acquisition module is also used to obtain a search term correction log, the search term correction log includes multiple correction records, each of the multiple correction records includes an input search term and a corresponding target reference term; obtain the reference term to be added input according to the erroneous correction record in the search term correction log; add the reference term to be added to the reference term list, and update the dictionary tree. 8.一种服务器,其特征在于,包括存储器以及处理器,所述存储器存储一组程序代码,所述处理器调用所述存储器中存储的程序代码,用于执行权利要求1~6任一项所述的方法。8. A server, characterized in that it comprises a memory and a processor, wherein the memory stores a set of program codes, and the processor calls the program codes stored in the memory to execute the method according to any one of claims 1 to 6.
CN202011016203.1A 2020-09-24 2020-09-24 Data error correction method, device and server Active CN112115232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011016203.1A CN112115232B (en) 2020-09-24 2020-09-24 Data error correction method, device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011016203.1A CN112115232B (en) 2020-09-24 2020-09-24 Data error correction method, device and server

Publications (2)

Publication Number Publication Date
CN112115232A CN112115232A (en) 2020-12-22
CN112115232B true CN112115232B (en) 2024-11-22

Family

ID=73801711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011016203.1A Active CN112115232B (en) 2020-09-24 2020-09-24 Data error correction method, device and server

Country Status (1)

Country Link
CN (1) CN112115232B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632285A (en) * 2020-12-31 2021-04-09 北京有竹居网络技术有限公司 Text clustering method and device, electronic equipment and storage medium
CN112800315B (en) * 2021-01-29 2023-08-04 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN112965998B (en) * 2021-02-04 2023-05-09 成都健数科技有限公司 A compound database establishment and retrieval method and system
CN113342848B (en) * 2021-05-25 2024-04-02 中国平安人寿保险股份有限公司 Information searching method, device, terminal equipment and computer readable storage medium
CN113553398B (en) * 2021-07-15 2024-01-26 杭州网易云音乐科技有限公司 Search word correction method, search word correction device, electronic equipment and computer storage medium
CN114090735B (en) * 2021-11-18 2025-09-16 金蝶云科技有限公司 Text matching method, device, equipment and storage medium
CN114254627A (en) * 2021-12-15 2022-03-29 阳光保险集团股份有限公司 Text error correction method, device, equipment and readable storage medium
CN115310447A (en) * 2022-08-04 2022-11-08 平安银行股份有限公司 Entity standardization method, device, electronic equipment and computer readable storage medium
CN115203379A (en) * 2022-09-15 2022-10-18 太平金融科技服务(上海)有限公司深圳分公司 Retrieval method, retrieval apparatus, computer device, storage medium, and program product
CN116187303A (en) * 2023-02-28 2023-05-30 北京智通云联科技有限公司 A dictionary-based search word query error correction method and system
CN116932781A (en) * 2023-07-29 2023-10-24 企知道科技有限公司 Enterprise information matching method and system based on ac automaton
CN117574243B (en) * 2024-01-15 2024-04-26 河北网新数字技术股份有限公司 A data analysis method, device and system
CN118245618B (en) * 2024-04-10 2025-08-01 中电信人工智能科技(北京)有限公司 Combined word matching detection method, system, electronic equipment and storage medium
CN118152428A (en) * 2024-05-09 2024-06-07 烟台海颐软件股份有限公司 A method and device for predicting and enhancing query instructions of power customer service system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414763A (en) * 2020-02-28 2020-07-14 长沙千博信息技术有限公司 A semantic disambiguation method, device, device and storage device for sign language computing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10301362B4 (en) * 2003-01-16 2005-06-09 GEMAC-Gesellschaft für Mikroelektronikanwendung Chemnitz mbH A block data compression system consisting of a compression device and a decompression device, and methods for fast block data compression with multi-byte search
US10936611B2 (en) * 2015-10-30 2021-03-02 Salesforce.Com, Inc. Search promotion systems and method
CN105824798A (en) * 2016-03-03 2016-08-03 云南电网有限责任公司教育培训评价中心 Examination question de-duplicating method of examination question base based on examination question key word likeness

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414763A (en) * 2020-02-28 2020-07-14 长沙千博信息技术有限公司 A semantic disambiguation method, device, device and storage device for sign language computing

Also Published As

Publication number Publication date
CN112115232A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN112115232B (en) Data error correction method, device and server
US12282504B1 (en) Systems and methods for graph-based dynamic information retrieval and synthesis
Phan et al. Pair-linking for collective entity disambiguation: Two could be better than all
US20130226846A1 (en) System and Method for Universal Translating From Natural Language Questions to Structured Queries
US20110270820A1 (en) Dynamic Indexing while Authoring and Computerized Search Methods
US8805755B2 (en) Decomposable ranking for efficient precomputing
JP6176017B2 (en) SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
CN104657440A (en) Structured query statement generating system and method
US20220114340A1 (en) System and method for an automatic search and comparison tool
CN114391142A (en) Parsing queries using structured and unstructured data
CN112560425B (en) Template generation method, device, electronic device and storage medium
Elshater et al. godiscovery: Web service discovery made efficient
CN119988572B (en) A GraphRAG-based intelligent question answering method, system, device, and medium
CN115335819B (en) Methods and systems for searching and retrieving information
CN113609847A (en) Information extraction method and device, electronic equipment and storage medium
CN113590755B (en) Word weight generation method and device, electronic equipment and storage medium
CN120560649A (en) A method, device and medium for generating SQL statements based on natural language
CN114385777A (en) Text data processing method and device, computer equipment and storage medium
CN113420219B (en) Method, device, electronic device and readable storage medium for querying information error correction
CN115718821A (en) A search ranking model generation method, ranking display method, device and equipment
US20250348762A1 (en) Knowledge base and interface for efficient response to user queries
CN110309258B (en) Input checking method, server and computer readable storage medium
CN115309995B (en) A method and apparatus for pushing scientific and technological resources based on demand text
CN118114681A (en) Semantic analysis method, device and computer readable storage medium
CN118093805A (en) Question answering method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant