Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, cloud storage, big data processing technologies, operation/interaction systems, electromechanical integration, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.
Cloud storage (cloud storage) is a new concept which extends and develops in the concept of cloud computing, and a distributed cloud storage system refers to a storage system which integrates a large number of storage devices (storage devices are also called as storage nodes) of different types in a network through application software or application interfaces to work cooperatively through functions of cluster application, grid technology, distributed storage file systems and the like, and provides data storage and service access functions together.
The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.
When the data error correction method provided by the application is used for correcting the input search words, the technology such as cloud computing, cloud storage, database and the like in the artificial intelligence technology is needed, and the automatic error correction of the search words can be realized by matching the search words input by the user with the dictionary tree, so that the efficiency and the accuracy of data query are improved.
Before explaining the embodiment of the present application in detail, an application scenario of the embodiment of the present application is described.
The data error correction method in the embodiment of the application can be particularly applied to some small and medium-sized portal websites for correcting search words, for example, government service websites, the existing government service websites have little technical accumulation and lack of network related talents, so that the subsequent operation of the websites is difficult, and meanwhile, the availability is only ensured after the system is online, and the problem of practicability is not ensured. The government service web site is only used for illustration, and can also be applied to other small and medium-sized portal web sites, such as enterprise portal web sites.
Fig. 1 is a schematic diagram of an architecture of a data retrieval system according to an embodiment of the present invention. The data retrieval system may include a user terminal 101, a network 102, and a server 103, the user terminal 101 and the server 103 communicating via the network 102. The user terminal 101 obtains a search word, which may be a determined text input by a user of the user terminal 101, and then sends the search word to the server 103 through the network 102, the server 103 matches the search word, determines whether an accurate search result can be obtained by directly using the search word, for example, determines whether the search word is in a dictionary tree, if the search word is not determined to be not matched in the dictionary tree, automatically corrects the search word, and uses matching content corresponding to the target reference word obtained after correction as the search result of the search word. The network 102 may include various connection types, such as a wired, wireless communication link, or an optical fiber cable, etc., the user terminal 101 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, the server 103 may be implemented by a stand-alone server or a server cluster formed by a plurality of servers, for example, a government server or a server cluster of a government platform, and may also be a cloud server providing cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligent platforms.
It may be understood that the schematic diagram of the architecture of the system described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application, and those skilled in the art can know that, with the evolution of the architecture of the system and the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is equally applicable to similar technical problems.
In one embodiment, as shown in fig. 2, a data error correction method is provided according to an embodiment of the present invention based on the data retrieval system of fig. 1. The present embodiment is mainly exemplified by the application of the method to the server 103 in fig. 1, and includes the following steps:
Step S201, obtaining search words input by a user.
In the embodiment of the invention, when a user needs to search information through the user terminal, the user can input the search word in the search box, so that the user terminal obtains the search word input by the user, the user terminal sends the search word to the server, and the server obtains the search word. The search box refers to an interaction control in the search engine system and is used for extracting corresponding accurate contents in massive information according to search characters input in the search box.
It should be noted that, in practical application, when a user inputs a search word in a search box, the user may manually input the search word, or may input the search word in a voice form, etc., and the method of inputting the search word by the user is not limited in the embodiment of the present application.
Step S202, matching the search word with a pre-created dictionary tree to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing a word segmentation segment of a reference word in a reference word list.
The dictionary tree (Trie tree) is also called as a prefix tree, and is a tree-shaped data structure, which comprises a plurality of nodes and can be used in the processes of character string matching, quick searching and the like. The method can furthest reduce the comparison times of meaningless character strings and improve the efficiency of word frequency statistics and character string sequencing. The key idea is to use the common prefix among character strings to reduce the cost of inquiry by constructing a tree structure and using space to change time. Dictionary trees generally have three properties: 1) The root node does not contain characters, each node except the root node only contains a character string, and the character string can be a word segmentation segment of a reference word in the reference word list; 2) All character strings on the path from the root node to a certain leaf node are connected together, namely, the combined character string corresponding to the node, and each combined character string can be a reference word; 3) All child nodes of each node contain different characters.
In the embodiment of the application, a pre-created dictionary tree is used for obtaining a plurality of reference words in a reference word list, then, word segmentation processing is carried out on each reference word to obtain a plurality of word segmentation fragments of each reference word, the word segmentation fragments of each reference word in the reference word list are sequentially stored into different nodes in one path of the dictionary tree by taking each word segmentation fragment as a unit, in the process of creating the dictionary tree, whether a node exists in characters of a first (or earlier) word segmentation fragment of the reference word is compared, if so, a pointer is pointed to the node, and if not, the node is created for the word segmentation fragment.
In one embodiment, the server obtains a plurality of original corpora from the database, wherein the corpora include some key data, and the key data includes keywords and corresponding occurrence numbers. The server can also obtain the search records of the user in a certain time from the search engine, obtain a plurality of search words and corresponding occurrence times, and generate a reference word list by counting key data and key words and search words in the search records of the user.
Specifically, after receiving a search word input by a user terminal, the server performs word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word, starts from a first word segmentation fragment of the search word, matches with a pre-created dictionary tree from a first layer child node of a root node to determine whether a node exists in the current word segmentation fragment, and continues to match the next word segmentation fragment of the search word from the current node in the dictionary tree when the node exists and the word segmentation fragment is matched, and generates a matching result when the word segmentation fragment in the search word does not exist in the node in the dictionary tree, so that the search word is considered to be not matched with the dictionary tree.
As a specific example of this embodiment, if the search term input by the user terminal is "driver license order", the dictionary tree includes nodes corresponding to each word segment of the reference term "driver license claim", after word segmentation processing is performed on the search term, the obtained "driver" license "and" claim "are multiple word segments of the search term, and the multiple word segments of the search term" driver "license" and "claim" are matched with the nodes in the dictionary tree, and when the word segment "claim" is matched, the corresponding nodes cannot be found in the dictionary tree due to the fact that the "collar" is input as the "claim", so that the search term and the dictionary tree are determined to be not matched.
Step 203, if the matching result indicates that the search word is not matched with the dictionary tree, obtaining a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word.
The feature vector refers to a vector which expresses natural language into a semantic meaning, and a text expression algorithm can be used for vectorizing a plurality of reference words in a search word and reference word list to obtain a corresponding feature vector, wherein the text expression algorithm comprises a method based on a vector idle model, a method based on a topic model, a method based on a neural network and the like.
Specifically, in order to determine the target reference word, the feature vector of each reference word in the plurality of reference words in the reference word list is first required to be obtained, after the feature vector of each reference word in the plurality of reference words included in the reference word list is obtained, the similarity between the feature vector of the search word and the feature vector of each reference word is calculated, and after the similarity is ordered, the reference word corresponding to the highest similarity is used as the target reference word.
Wherein the similarity is used to measure the similarity between the feature vector of the search term and the feature vector of each of the plurality of reference terms in the reference term list, the similarity may be calculated using a similarity algorithm, which includes, but is not limited to, a euclidean distance algorithm, a cosine similarity algorithm, a pearson correlation coefficient algorithm, a jaccard similarity coefficient algorithm, and the like.
For example, the search term input by the user terminal is "license order", and the reference term list includes a reference term of "license claim". The server may perform similarity calculation with the feature vector of each of the plurality of reference words in the reference word list after converting "driver license order" into the feature vector, and when "driver license claims" is returned as the reference word with the highest similarity, use "driver license claims" as the target search word.
And step S204, taking the content matched with the target reference word in the database as the search result of the search word.
Specifically, after the server obtains the target search word, the server searches in the database directly according to the target search word, and takes the searched corresponding search result as the search result of the search word. For example, when the search term input by the user terminal is "driver license order", after the target search term "driver license claim" is obtained through step S202 and step S203, the corresponding data resource is searched in the database directly using the "driver license claim" as the target reference term, as the search result of the search term.
According to the data error correction based method, after the search word is obtained, the search word is matched with the pre-established dictionary tree, whether the search word needs to be corrected or not is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally, the content matched with the target reference word in the database is used as the search result of the search word, so that whether the search word has errors or not can be intelligently judged, the error correction is performed, a user is not required to manually correct the search word, and the efficiency of inquiring data can be effectively improved.
In one embodiment, as shown in fig. 3, before the search word is matched with the pre-created dictionary tree to obtain a matching result, the method further includes a step of creating the dictionary tree, where the step specifically includes the following steps:
Step 301, extracting key data from contents included in a database, wherein the key data comprises a plurality of keywords and the occurrence number of each keyword in the plurality of keywords;
the server obtains a large number of original corpora from the database, wherein the corpora contain keywords (such as "license claims") and counts the occurrence times of the keywords.
Step 302, obtaining a user search record, wherein the user search record comprises a plurality of search words and the occurrence frequency of each search word in the plurality of search words;
Specifically, the server may obtain a search record of a user in a certain time from the search engine, and obtain a plurality of search terms and corresponding occurrence numbers.
In one embodiment, the server obtains a search record of a user in a certain time from the search engine, sorts the search records according to time, for the search record of a certain user, generally, the search behavior of the user is segmented, each segment has a relatively obvious interval, each segment is called a search session, the server obtains a plurality of search words of each search session of the user, and counts the corresponding occurrence times, because the search behavior of the user in one search session is to solve a problem, the search words input by the user in one search session are always related, repeated words or synonyms can be mined by counting the search words of the user in one search session, thereby further improving the completeness of a reference word list and improving the accuracy of query.
In one embodiment, step 302 may be performed first, and then step 301 may be performed. The specific order of steps 301 and 302 is not limited in this embodiment of the present invention.
And 303, creating a reference word list according to the key data and the user search record.
Specifically, by counting key data and key words and search words in a user search record, a reference word column in a reference word list is generated, and the occurrence frequency of the reference word is used as the word frequency corresponding to the reference word.
Illustratively, the list of reference words is shown in Table 1:
Table 1 list of reference words
| Reference words |
Word frequency |
Creation time |
| Long nurse |
42 |
2020-09-18 10:04:01 |
| Accumulation of money |
210 |
2020-09-18 10:04:01 |
| Fertility body paste |
21 |
2020-09-18 10:04:01 |
| Fertility patch |
22 |
2020-09-18 10:04:01 |
| Driver license declaration |
125 |
2020-09-18 10:04:01 |
| Xishan house |
1 |
2020-09-18 10:04:01 |
Step 304, obtaining a reference word list, wherein the reference word list comprises a plurality of reference words;
step 305, performing word segmentation processing on each reference word in the plurality of reference words to obtain a plurality of word segmentation fragments of each reference word;
in this embodiment, word segmentation refers to the segmentation of a text sequence into individual words, and word segmentation refers to the word segments obtained by processing the text sequence. For example, the search word input by the user is "driving license claim", and the driving license is obtained after word segmentation processing, wherein the driving license is a word segment.
And 306, generating nodes of the dictionary tree according to each word segmentation segment of each reference word so as to create the dictionary tree corresponding to the plurality of reference words.
In this embodiment, the word segmentation segment of each reference word in the reference word list is sequentially stored in different nodes in one path of the dictionary tree, in the process of creating the dictionary tree, whether a node exists in the first (or earlier occurrence) word segmentation segment of the reference word is compared, if so, a pointer is pointed to the node, and if not, a node is created for the word segmentation segment.
For example, as shown in fig. 4, taking three reference words, i.e., a "public accumulation fund", "a child career", and a "child career" included in the reference word list as an example, the server may create a root node, multiple nodes below the root node are all child nodes, and for the word segmentation "public" and "child" in the reference words "public accumulation fund", "child career", the server may determine that there are no child nodes connected to the root node and matched with the character, and the server creates a child node "public" and "child" connected to the root node. Similarly, for the word segment "product" in the reference word "public backlog", the server may determine that there are no child nodes connected to the node "public" that match the character, and then the server may create a child node "product" connected to the node "public"; for the word segment "gold" in the reference word "accumulation gold", the server may determine that there is no child node connected to the node "accumulation" and matching the character, and then the server may create a child node "gold" connected to the node "accumulation"; whereby the server can obtain a tree structure as shown in fig. 4 (1). Similarly, a tree structure as shown in fig. 4 can be obtained.
In one embodiment, matching the search term with a pre-created dictionary tree to obtain a matching result includes: performing word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word; matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-created dictionary tree; if the word segmentation segment is not matched with the nodes in the dictionary tree, a matching result is generated, and the matching result is used for indicating that the search word is not matched with the dictionary tree.
In this embodiment, after receiving a search word input by a user terminal, the server performs word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word, starting from a first word segmentation fragment of the search word, matching with a pre-created dictionary tree to determine whether a node exists in the current word segmentation fragment from a first layer child node of a root node, continuing to match the next word segmentation fragment of the search word from the current node in the dictionary tree when the node exists to match the word segmentation fragment, and generating a matching result when the word segmentation fragment in the search word does not exist in the node in the dictionary tree, wherein the search word is considered to be unmatched with the dictionary tree.
As a specific example of this embodiment, if the search term input by the user terminal is "driver license order", after the word segmentation processing is performed on the search term, the obtained "driver" "" claim "" "command" is a plurality of word segmentation fragments of the search term, the plurality of word segmentation fragments of the search term "driver" "" claim "" "command" is matched with the nodes in the dictionary tree, and when the word segmentation fragment "command" is matched, the corresponding nodes cannot be found in the dictionary tree due to the fact that the "collar" is input as the "command", so that the search term and the dictionary tree are not matched is determined.
According to the embodiment, the obtained search words are matched with the dictionary tree, so that whether the search words need error correction or not can be effectively confirmed, and the accuracy of search results returned by inquiry is ensured.
In one embodiment, after obtaining the feature vector of each reference word in the plurality of reference words included in the reference word list, the server calculates the similarity between the feature vector of the search word and the feature vector of each reference word, and may calculate the sub-reference word lists in the reference word list through a plurality of preset threads concurrently to obtain the reference word corresponding to the highest similarity of the sub-reference word lists. Specifically, the server may use bigram, trigram or other strategies to control the splitting size, thereby controlling the calculation amount, and divide the reference word list into K sub-reference word lists, where K is greater than or equal to 2, and each sub-reference word list contains N reference words, where N is greater than or equal to 1; the server calls a plurality of preset threads to respectively calculate the similarity between the feature vector of the search word input by the user and the feature vector of N reference words in each sub-reference word list, so as to obtain the reference word corresponding to the highest similarity in each sub-reference word list, and the target search word corresponding to the highest similarity is obtained by sorting according to the descending order of the similarity.
In this embodiment, each thread may calculate the similarity between the feature vector of the search term and the feature vectors of N reference terms in one sub-reference list, and multiple threads may process at the same time, so that the time required for searching may be reduced, and the processing speed of searching may be improved.
As a specific example of this embodiment, as shown in table 2, the server splits the table 1 reference word list into two reference word lists, and obtains feature vectors corresponding to a plurality of reference words in each reference word list, to obtain a reference word feature vector list (a) and a reference word feature vector list (b), for example, the user terminal inputs a search word "driver license claim", the server may perform similarity calculation with the feature vector of each reference word in the reference word feature vector list (a) and the reference word feature vector list (b) after converting the "driver license claim" into the feature vector, wherein the calculation with the reference word feature vector list (a) returns [ principal=0.0123 ], the calculation with the reference word feature vector list (b) returns [ driver license claim= 0.89102], and the server selects the reference word "driver license claim" with the highest similarity from the results returned from the reference word feature vector lists (a) and (b) as the target search word.
Illustratively, the list of reference word feature vectors is shown in Table 2:
table 2 reference word feature vector List (a)
| Reference words |
Feature vector |
| Nurse certificate |
[0.1233422,10.1292920d,101.929101] |
| Accumulation of money |
[1.1233422,10.1292920d,101.929101] |
| Fertility body paste |
[2.1233422,10.1292920d,101.929101] |
Table 2 reference word feature vector List (b)
| Reference words |
Feature vector |
| Fertility patch |
[3.1233422,10.1292920d,101.929101] |
| Driver license declaration |
[4.1233422,10.1292920d,101.929101] |
| Xishan house |
[5.1233422,10.1292920d,101.929101] |
In one embodiment, after the reference word list is obtained, the reference words in the reference word list may be converted into pinyin, the pinyin of the obtained reference words is subjected to word segmentation, and the word segmentation fragments of the pinyin subjected to word segmentation construct a pinyin dictionary tree.
Illustratively, the Pinyin list of the reference words is shown in Table 3:
TABLE 3 Pinyin list of reference words
| Reference words |
Pinyin |
Creation time |
| Long nurse |
hushizhang |
2020-09-18 10:05:10 |
| Accumulation of money |
gongjijin |
2020-09-18 10:05:10 |
| Fertility body paste |
shengyujintie |
2020-09-18 10:05:10 |
| Fertility patch |
shengyubutie |
2020-09-18 10:05:10 |
| Driver license declaration |
jiazhaoshenling |
2020-09-18 10:05:10 |
| Xishan house |
xishanju |
2020-09-18 10:05:10 |
After the server matches the search word input by the user with the dictionary tree, the matching result shows that the word segmentation segment of the search word cannot find the corresponding node in the dictionary tree, further, the pinyin corresponding to the search word is matched with the pinyin dictionary tree, and if the pinyin corresponding to the search word can find the corresponding node in the pinyin dictionary tree, the reference word corresponding to the pinyin character sequence of the matched node is used as the target reference word; if the corresponding node cannot be found in the pinyin dictionary tree by the pinyin corresponding to the search word, performing similarity calculation on the feature vector of each pinyin in the plurality of pinyins included in the pinyin list of the search word and the reference word, obtaining the pinyin with the highest corresponding similarity, and taking the reference word corresponding to the pinyin as the target reference word.
In one embodiment, the reference word list further includes a word frequency of each reference word in the plurality of reference words, and the step of using the reference word with the highest corresponding similarity as the target reference word includes: acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity; obtaining a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word; judging whether the difference value is smaller than or equal to a preset difference value threshold value or not; if yes, inquiring word frequency of the first reference word and word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word; and if not, taking the first reference word as a target reference word.
In this embodiment, after similarity calculation is performed on feature vectors of reference words and feature vectors of a plurality of reference words in a reference word list, a first reference word with the highest similarity and a second reference word with the second highest similarity are obtained, when editing distances between the first reference word and the second reference word and the search word are smaller, a difference value between the similarities corresponding to the first reference word and the second reference word is smaller than or equal to a preset difference threshold value, word frequencies corresponding to the first reference word and the second reference word are used as basis, and a reference word with the highest word frequency is selected as a target reference word. For example, when the search term input by the user is "fertility", the reference term list includes the reference terms "fertility post" and "fertility post", the edit distance between the search term "fertility" and the reference terms "fertility post" and "fertility post" is 2, and the difference between the similarity between the search term "fertility" and the reference terms "fertility post" and "fertility post" is smaller than the preset difference threshold, and the word frequency of the reference term "fertility post" is 21 and the word frequency of the reference term "fertility post" is 22, so that the "fertility post" is taken as the target reference term.
In one embodiment, the data error correction method further comprises the steps of: if the matching result indicates that the search word is matched with the dictionary tree, searching candidate content matched with the search word from a database; acquiring the correlation degree between the candidate content and the search word; if the correlation degree is smaller than or equal to a preset correlation degree threshold value, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word; and taking the content matched with the target reference word in the database as a search result of the search word.
In this embodiment, word segmentation fragments of search words input by a user are matched with a dictionary tree, when the word segmentation fragments of the search words input by the user can find corresponding nodes in the dictionary tree, the matching result indicates that the search words are matched with the dictionary tree, the search words input by the user are queried in a database, and corresponding search results are obtained as candidate contents.
Further, the relevance between the search word and the candidate content is obtained, for example, the candidate content returned in the database has ten web page articles, the relevance between the search word and each web page article is calculated by using a relevance algorithm, for example, a BM25 algorithm, finally, the average value of the relevance between the search word and the ten web page articles is compared with a preset relevance threshold, if 100 is divided into full scores, the preset relevance threshold is 50 scores, and if the relevance between the search word input by the user and the candidate content is lower than 50 scores, the target reference word is determined from a plurality of reference words included in the reference word list according to the feature vector of the search word input by the user.
In one embodiment, after a search word input by a user is queried in a database to obtain a corresponding search result as candidate content, whether the candidate content is a search result interesting to the user can be calculated by using a click model, for example, the click probability of the user clicking on a webpage article in the candidate content is calculated, when the predicted click probability is lower than a preset threshold value, a target reference word is determined from a plurality of reference words included in a reference word list according to a feature vector of the search word, and the content matched with the target reference word in the database is used as the search result of the search word.
The click model builds a probability map model based on some preconditions by mining information such as search words, search contents corresponding to the search words, clicked search results in the search contents corresponding to the search words, and the like, so as to model the search behavior of the user. Click models include, but are not limited to, cascading models, dynamic bayesian network models, and the like.
In one embodiment, the data error correction method further comprises the steps of: obtaining a search word error correction log, wherein the search word error correction log comprises a plurality of error correction records, and each error correction record in the plurality of error correction records comprises an input search word and a corresponding target reference word; acquiring a reference word to be added which is input according to an error correction record with errors in the search word error correction log; and adding the reference words to be added into the reference word list, and updating the dictionary tree.
As shown in fig. 5, the search word input by the user and the corresponding target search word are recorded through the error correction log, so as to obtain a plurality of error correction records. In a government website, there is actually no term "driver license", and the driver's license is correct, so that the driver's license is not added to the list of reference words in table 1, and thus the driver's license is likely to be corrected as a passport. When an operator obtains an error correction log from a server and finds out automatic error correction errors of search, the server can add the reference word to be added, namely the driver's license, to the reference word list by taking the driver's license as the reference word to be added.
Further, the server searches the reference word to be added in the dictionary tree, matches nodes in the dictionary tree with word segmentation fragments of the reference word to be added, and updates the nodes of the dictionary tree when the matching result indicates that the reference word to be added cannot find corresponding nodes in the dictionary tree for matching. Therefore, the nodes in the dictionary tree can be effectively and dynamically updated according to the error correction log, so that the reference words are effectively expanded, and the completeness and accuracy of the reference words stored in the dictionary tree are enhanced.
As shown in fig. 6, fig. 6 is a schematic structural diagram of a data error correction device according to an embodiment of the present application, including:
a data acquisition module 601, configured to acquire a search term input by a user;
The data matching module 602 is configured to match the search word with a pre-created dictionary tree to obtain a matching result, where the dictionary tree includes a plurality of nodes, and each node in the plurality of nodes is configured to represent a word segmentation segment of a reference word in a reference word list;
The data determining module 603 is configured to obtain a feature vector of the search word if the matching result indicates that the search word does not match the dictionary tree, and determine a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
and the data output module 604 is used for taking the content matched with the target reference word in the database as the search result of the search word.
In one embodiment, the data matching module 602 matches the search term with a pre-created dictionary tree to obtain a matching result, including:
performing word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word;
Matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-created dictionary tree;
if the word segmentation segment is not matched with the nodes in the dictionary tree, a matching result is generated, and the matching result is used for indicating that the search word is not matched with the dictionary tree.
In one embodiment, the data determining module 603 determines, according to the feature vector of the search term, a target reference term from a plurality of reference terms included in the reference term list, including:
Acquiring a feature vector of each reference word in a plurality of reference words included in the reference word list;
calculating the similarity between the feature vector of the search word and the feature vector of each reference word;
And taking the corresponding reference word with the highest similarity as the target reference word.
In one embodiment, the data determining module 603 uses, as the target reference word, the reference word with the highest corresponding similarity, including:
acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity;
Obtaining a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word;
Judging whether the difference value is smaller than or equal to a preset difference value threshold value or not;
If yes, inquiring word frequency of the first reference word and word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word;
and if not, taking the first reference word as a target reference word.
In one embodiment, if the matching result indicates that the search term matches the dictionary tree, the data determining module 603 is further configured to query the database for candidate content matching the search term;
the data determining module 603 is further configured to obtain a correlation degree between the candidate content and the search term;
The data determining module 603 is further configured to obtain a feature vector of the search term if the correlation is less than or equal to a preset correlation threshold, and determine a target reference term from a plurality of reference terms included in the reference term list according to the feature vector of the search term;
The data output module 604 is further configured to use the content in the database that matches the target reference word as a search result of the search word.
In one embodiment, before the search word is matched with the pre-created dictionary tree to obtain a matching result, the data obtaining module 601 is further configured to obtain a reference word list, where the reference word list includes a plurality of reference words;
The data obtaining module 601 is further configured to perform word segmentation processing on each reference word in the plurality of reference words, so as to obtain a plurality of word segmentation segments of each reference word;
The data obtaining module 601 is further configured to generate nodes of a dictionary tree according to each word segmentation segment of each reference word, so as to create a dictionary tree corresponding to the plurality of reference words.
In one embodiment, the data acquisition module 601 acquires a list of reference words, including:
extracting key data from content included in a database, wherein the key data comprises a plurality of keywords and the occurrence number of each keyword in the plurality of keywords;
Acquiring a user search record, wherein the user search record comprises a plurality of search words and the occurrence frequency of each search word in the plurality of search words;
and creating a reference word list according to the key data and the user search record.
In one embodiment, the data obtaining module 601 is further configured to obtain a search word error correction log, where the search word error correction log includes a plurality of error correction records, and each error correction record in the plurality of error correction records includes an input search word and a corresponding target reference word;
The data obtaining module 601 is further configured to obtain a reference word to be added according to an error correction record input in the error correction log of the search word;
the data obtaining module 601 is further configured to add the reference word to be added to the reference word list, and update the dictionary tree.
According to the data error correction device provided by the embodiment of the application, after the search word is obtained, the search word is matched with the pre-established dictionary tree, whether the search word needs error correction is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally, the content matched with the target reference word in the database is used as the search result of the search word, so that the automatic error correction can be accurately performed on the search word, and the efficiency and the accuracy of data query are improved.
Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application, where the internal structure of the server is shown in fig. 7, and the server includes an input device 701, an output device 702, a processor 703, a memory 704, a program 705, and a communication bus 706, where the input device 701, the output device 702, the processor 703, and the memory 704 complete communication with each other through the communication bus 706.
A memory 704 for storing a program 705;
The processor 703 is configured to execute the program 705 stored in the memory 704, thereby implementing the following steps:
Acquiring search words input by a user;
matching the search word with a pre-created dictionary tree to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing one word segmentation segment of a reference word in a reference word list;
if the matching result indicates that the search word is not matched with the dictionary tree, obtaining a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
and taking the content matched with the target reference word in the database as the search result of the search word.
In one embodiment, the processor 703 matches the search term to a pre-created dictionary tree to obtain a matching result, including:
performing word segmentation processing on the search word to obtain a plurality of word segmentation fragments of the search word;
Matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-created dictionary tree;
if the word segmentation segment is not matched with the nodes in the dictionary tree, a matching result is generated, and the matching result is used for indicating that the search word is not matched with the dictionary tree.
In one embodiment, the processor 703 determines a target reference word from the plurality of reference words included in the reference word list according to the feature vector of the search word, including:
Acquiring a feature vector of each reference word in a plurality of reference words included in the reference word list;
calculating the similarity between the feature vector of the search word and the feature vector of each reference word;
And taking the corresponding reference word with the highest similarity as the target reference word.
In one embodiment, the processor 703 takes the corresponding reference word with the highest similarity as the target reference word, and includes:
acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity;
Obtaining a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word;
Judging whether the difference value is smaller than or equal to a preset difference value threshold value or not;
If yes, inquiring word frequency of the first reference word and word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word;
and if not, taking the first reference word as a target reference word.
In one embodiment, the processor 703 is further configured to perform the following:
if the matching result indicates that the search word is matched with the dictionary tree, searching candidate content matched with the search word from a database;
acquiring the correlation degree between the candidate content and the search word;
if the correlation degree is smaller than or equal to a preset correlation degree threshold value, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;
and taking the content matched with the target reference word in the database as a search result of the search word.
In one embodiment, the processor 703 is further configured to perform the following operations before matching the search term with the pre-created dictionary tree to obtain a matching result:
acquiring a reference word list, wherein the reference word list comprises a plurality of reference words;
performing word segmentation processing on each reference word in the plurality of reference words to obtain a plurality of word segmentation fragments of each reference word;
Generating nodes of a dictionary tree according to each word segmentation segment of each reference word so as to create the dictionary tree corresponding to the plurality of reference words.
In one embodiment, the processor 703 obtains a list of reference words, including:
extracting key data from content included in a database, wherein the key data comprises a plurality of keywords and the occurrence number of each keyword in the plurality of keywords;
Acquiring a user search record, wherein the user search record comprises a plurality of search words and the occurrence frequency of each search word in the plurality of search words;
and creating a reference word list according to the key data and the user search record.
In one embodiment, the processor 703 is further configured to perform the following:
Obtaining a search word error correction log, wherein the search word error correction log comprises a plurality of error correction records, and each error correction record in the plurality of error correction records comprises an input search word and a corresponding target reference word;
acquiring a reference word to be added which is input according to an error correction record with errors in the search word error correction log;
and adding the reference words to be added into the reference word list, and updating the dictionary tree.
According to the server provided by the embodiment of the application, after the search word is obtained, the server is matched with the pre-established dictionary tree according to the search word, whether the search word needs error correction is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally, the content matched with the target reference word in the database is used as the search result of the search word, so that automatic error correction can be accurately performed on the search word, and the efficiency and accuracy of data query are improved.
The embodiment of the present application also provides a computer readable storage medium storing a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the program instructions may execute the steps executed by the server in the above embodiment.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by computer programs stored on a computer readable storage medium, which when executed, may include embodiments of the file management methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps performed in the embodiments of the methods described above.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.