CN112115232A

CN112115232A - Data error correction method and device and server

Info

Publication number: CN112115232A
Application number: CN202011016203.1A
Authority: CN
Inventors: 韩时通
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-22

Abstract

The embodiment of the invention discloses a data error correction method, a device and a server, wherein the method comprises the following steps: acquiring a search word input by a user; matching the search word with a pre-established dictionary tree to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing a word segmentation segment of a reference word in a reference word list; if the matching result indicates that the search word is not matched with the dictionary tree, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word; and taking the content matched with the target reference word in the database as the search result of the search word. The method can accurately carry out automatic error correction on the search terms, and improves the efficiency and accuracy of data query.

Description

Data error correction method and device and server

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data error correction method, apparatus, and server.

Background

With the rapid development of internet technology, the amount of information in the internet is also increasing, and people have paid more and more attention to how to more effectively acquire the required information in the internet. Most people complete the search process of their information through search engines, but when users input search words in search engines to query, often for various reasons, there are cases of inputting wrongly written or multi-written characters, for example, when users have homophonic characters, a "public accumulation fund" is input as "cock fund", the search engine may have a problem that the returned search results do not meet the expectations of the users, at this time, the users need to search for required information on a large number of search result pages, usually spend a lot of time looking up the search results, find the search word input mistake and try to search again more positive search words, or continuously change the search words in order to obtain effective information, this search method cannot achieve the purpose of intelligent query, and is inefficient.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a data error correction method, which can accurately perform automatic error correction on a search term, and improve the efficiency and accuracy of data query.

In a first aspect, an embodiment of the present invention provides a data error correction method, including:

acquiring a search word input by a user;

matching the search word with a pre-established dictionary tree to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing a word segmentation segment of a reference word in a reference word list;

if the matching result indicates that the search word is not matched with the dictionary tree, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;

and taking the content matched with the target reference word in the database as the search result of the search word.

In a second aspect, an embodiment of the present invention provides a data error correction apparatus, including:

the data acquisition module is used for acquiring search terms input by a user;

the data matching module is used for matching the search word with a pre-established dictionary tree to obtain a matching result, the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing a word segmentation segment of a reference word in a reference word list;

the data determination module is used for acquiring the feature vector of the search word if the matching result indicates that the search word is not matched with the dictionary tree, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;

and the data output module is used for taking the content matched with the target reference word in the database as the search result of the search word.

In a third aspect, an embodiment of the present application provides a server, where the server includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to perform an operation involved in the above-mentioned data error correction method.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the processor executes a program according to the above-mentioned data error correction method.

In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the above-mentioned one data error correction method.

According to the embodiment of the invention, for the obtained search word, the search word is firstly matched with the dictionary tree which is created in advance, whether the search word needs to be corrected is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the feature vector of the search word is obtained, the target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words included in the reference word list, and finally the content matched with the target reference word in the database is used as the search result of the search word, so that the automatic correction of the search word can be accurately carried out, and the efficiency and the accuracy of data query are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a data retrieval system according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a data error correction method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of creating a trie according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a structure of a dictionary tree according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an error correction log interface provided by an embodiment of the invention;

FIG. 6 is a schematic structural diagram of a data error correction apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, cloud storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

The distributed cloud storage system refers to a storage system which integrates a large number of storage devices (storage devices are also called storage nodes) of different types in a network through application software or application interfaces to cooperatively work through functions of cluster application, grid technology, distributed storage file system and the like, and provides data storage and service access functions to the outside.

Database (Database), which can be regarded as an electronic file cabinet in short, a place for storing electronic files, a user can add, query, update, delete, etc. to data in files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.

When the data error correction method provided by the application is used for correcting errors of input search words, technologies such as cloud computing, cloud storage and databases in an artificial intelligence technology need to be involved, the search words input by a user are matched with the dictionary tree, automatic error correction of the search words can be achieved, and the efficiency and accuracy of data query are improved.

Before explaining the embodiments of the present application in detail, an application scenario of the embodiments of the present application will be described.

The data error correction method in the embodiment of the application can be particularly applied to some small and medium-sized portal websites for correcting errors of search terms, for example, government affair service websites have the problems that technology accumulation of the existing government affair service websites is little, network related talents are lacked, so that subsequent operation of the websites is difficult, and meanwhile, the usability is only guaranteed after the system is on line, and the practicability is not guaranteed. The government affairs service website is only used as an example, and can also be applied to other small and medium-sized portals, such as an enterprise portal.

Fig. 1 is a schematic structural diagram of a data retrieval system according to an embodiment of the present invention. The data retrieval system may comprise a user terminal 101, a network 102 and a server 103, the user terminal 101 and the server 103 communicating via the network 102. The user terminal 101 obtains a search word, which may be a certain text input by a user of the user terminal 101, and then sends the search word to the server 103 through the network 102, the server 103 matches the search word, determines whether an accurate search result can be obtained by directly using the search word, for example, determines whether the search word is in a dictionary tree, and if the search word is not in the dictionary tree, determines that the search word is not matched, automatically corrects the search word, and uses matching content corresponding to a target reference word obtained after error correction as a search result of the search word. The network 102 may include various connection types, such as wired, wireless communication links, optical fiber cables, and the like, the user terminal 101 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 103 may be implemented by an independent server or a server cluster formed by a plurality of servers, such as a government affairs server or a server cluster of a government affairs platform, and may also be a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.

It should be understood that the architecture diagram of the system described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

In one embodiment, as shown in fig. 2, an embodiment of the present invention is a data error correction method provided based on the data retrieval system of fig. 1. The embodiment is mainly illustrated by applying the method to the server 103 in fig. 1, and includes the following steps:

step S201, search terms input by a user are obtained.

In the embodiment of the invention, when a user needs to search information through the user terminal, the user terminal can input the search word in the search box, so that the user terminal can acquire the search word input by the user, the user terminal sends the search word to the server, and the server acquires the search word. The search box refers to an interactive control in a search engine system and is used for extracting corresponding accurate contents in the mass information according to search characters input in the search box.

It should be noted that, in practical applications, when a user inputs a search term in a search box, the user may manually input the search term, or input the search term in a form of voice, and the like.

Step S202, matching the search word with a dictionary tree which is created in advance to obtain a matching result, wherein the dictionary tree comprises a plurality of nodes, and each node in the plurality of nodes is used for representing a word segmentation segment of a reference word in a reference word list.

The Trie (Trie) is also called a prefix tree, and is a tree-like data structure, which includes a plurality of nodes and can be used in processing processes such as string matching and fast search. The method can reduce the comparison times of meaningless character strings to the maximum extent and improve the efficiency of word frequency statistics and character string sequencing. The core idea is to reduce the query overhead by constructing a tree structure, replacing time by space and utilizing a common prefix among character strings. The trie generally has three traits: 1) the root node does not contain characters, each node except the root node only contains a character string, and the character string can be a word segmentation segment of a reference word in a reference word list; 2) from a root node to a certain leaf node, all characters on a path are connected in series to form a combined character string corresponding to the node, and each combined character string can be a reference word; 3) all children of each node contain different characters.

In the embodiment of the application, a pre-created dictionary tree obtains a plurality of reference words in a reference word list, then performs word segmentation processing on each reference word for the obtained plurality of reference words to obtain a plurality of word segmentation segments of each reference word, sequentially stores the word segmentation segments of each reference word in the reference word list into different nodes in a path of the dictionary tree with each word segmentation segment as a unit, and in the process of creating the dictionary tree, if a node exists in a character of a first (or earlier appearing) word segmentation segment of a reference word, a pointer points to the node, otherwise, a node is created for the word segmentation segment.

In one embodiment, the server obtains a large number of raw corpora from a database, where the corpora include some key data, and the key data includes keywords and corresponding occurrence times. The server can also acquire a search record of the user in a certain time from the search engine, acquire a plurality of search words and corresponding occurrence times, and generate a reference word list by counting key data and keywords and search words in the search record of the user.

Specifically, after receiving a search word input by a user terminal, a server performs word segmentation processing on the search word to obtain a plurality of word segmentation segments of the search word, matches whether a current word segmentation segment has a node from a first layer child node of a root node of the search word with a pre-created dictionary tree, and when the node is matched with the word segmentation segment, continues matching the next word segmentation segment of the search word with the current node in the dictionary tree, and when the node is not in the dictionary tree, generates a matching result and considers that the search word is not matched with the dictionary tree.

As a specific example of this embodiment, if the search word input by the user terminal is "driving license declaration", the dictionary tree includes nodes corresponding to respective word segmentation segments of the reference word "driving license declaration", the search word is subjected to word segmentation processing, the obtained "driving", "lighting", "declaration", and "command" are multiple word segmentation segments of the search word, the multiple word segmentation segments of the search word "driving", "lighting", "declaration", and "command" are matched with nodes in the dictionary tree, and when the word segmentation segment "command" is matched, the "neck" is input as the "command", so that corresponding nodes cannot be found in the dictionary tree, and therefore, it is determined that the search word does not match the dictionary tree.

Step S203, if the matching result indicates that the search word is not matched with the dictionary tree, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word.

The feature vector is a vector which can express self semantics by representing a natural language, and a text representation algorithm can be used for vectorizing a search word and a plurality of reference words in a reference word list to obtain a corresponding feature vector, wherein the text representation algorithm comprises a vector idle model-based method, a theme model-based method, a neural network-based method and the like.

Specifically, in order to determine the target reference word, firstly, a feature vector of each of a plurality of reference words in a reference word list needs to be obtained, after the feature vector of each of the plurality of reference words included in the reference word list is obtained, similarity between the feature vector of the search word and the feature vector of each of the reference words is calculated, and after the similarity is ranked, the reference word corresponding to the highest similarity is used as the target reference word.

Wherein the similarity is used for measuring the similarity between the feature vector of the search word and the feature vector of each of the plurality of reference words in the reference word list, and the similarity may be calculated using similarity algorithms including, but not limited to, euclidean distance algorithm, cosine similarity algorithm, pearson correlation coefficient algorithm, jackard similarity coefficient algorithm, and the like.

For example, the search word input by the user terminal is "license application", and the reference word list includes the reference word "license application". The server can perform similarity calculation with the feature vector of each reference word in the plurality of reference words in the reference word list after converting the driving license claim into the feature vector, and takes the driving license claim as a target search word when returning the reference word with the highest similarity.

And step S204, taking the content matched with the target reference word in the database as a search result of the search word.

Specifically, after obtaining the target search word, the server directly searches in the database according to the target search word, and uses the corresponding search result found as the search result of the search word. For example, when the search word input by the user terminal is "license declaration", the target search word "license declaration" is obtained through step S202 and step S203, and then the corresponding data resource is searched in the database by directly using "license declaration" as the target reference word as the search result of the search word.

According to the data error correction-based method, after the search word is obtained, matching is carried out according to the search word and a dictionary tree which is created in advance, whether the search word needs error correction is determined according to an obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, a target reference word is determined according to the similarity between the feature vector of the search word and a plurality of reference words which are included in a reference word list, and finally the content which is matched with the target reference word in a database is used as the search result of the search word, so that whether the search word has errors or not can be intelligently judged, error correction can be carried out, a user does not need to manually correct the search word, and the efficiency of inquiring data can be effectively improved.

In an embodiment, as shown in fig. 3, before matching the search term with a pre-created dictionary tree and obtaining a matching result, the method further includes a step of creating a dictionary tree, where the step specifically includes the following steps:

step 301, extracting key data from contents included in a database, wherein the key data includes a plurality of keywords and the occurrence frequency of each keyword in the plurality of keywords;

the server obtains a large amount of original corpora from a database, wherein the corpora contain some keywords (such as 'license application', and the like), and counts the occurrence times of the keywords.

Step 302, obtaining a user search record, wherein the user search record comprises a plurality of search terms and the occurrence frequency of each search term in the plurality of search terms;

specifically, the server may obtain a search record of the user in a certain time from the search engine, and obtain a plurality of search terms and corresponding occurrence times.

In one embodiment, a server obtains search records of a user within a certain time from a search engine, the search records are sorted according to time, for the search records of a certain user, the search behavior of the user is segmented, each segment has a relatively obvious interval, each segment is called a search session, the server obtains a plurality of search words of each search session of the user, and counts corresponding occurrence times.

In one embodiment, step 302 may be performed first, and then step 301 may be performed. The embodiment of the present invention does not limit the specific sequence of

steps

301 and 302.

Step 303, creating a reference word list according to the key data and the user search record.

Specifically, a reference word list in the reference word list is generated by counting key data and key words and search words in the user search records, and the occurrence frequency of the reference words is used as the word frequency corresponding to the reference words.

Illustratively, the list of reference words is shown in table 1:

TABLE 1 list of reference words

Reference word	Word frequency	Creation time
			Nurse chief	42	2020-09-18 10:04:01
Accumulation fund	210	2020-09-18 10:04:01
			Birth promoting and body fluid promoting plaster	21	2020-09-18 10:04:01
Birth patch	22	2020-09-18 10:04:01
			Driving license application collar	125	2020-09-18 10:04:01
Xishan ju (a Chinese character of 'xi' mountain)	1	2020-09-18 10:04:01

Step 304, obtaining a reference word list, wherein the reference word list comprises a plurality of reference words;

step 305, performing word segmentation processing on each reference word in the plurality of reference words to obtain a plurality of word segmentation segments of each reference word;

in this embodiment, the word segmentation processing refers to segmenting the text sequence into a single word, and the word segmentation refers to a word segment obtained by processing the text sequence. For example, the search word input by the user is "driving license claim", and the word segmentation processing is utilized to obtain "driving", "license", "claim" and "collar", wherein the "driving", "license", "claim" and "collar" are respectively a word segmentation segment.

And step 306, generating nodes of a dictionary tree according to each participle segment of each reference word so as to create dictionary trees corresponding to the plurality of reference words.

In this embodiment, the participle segment of each reference word in the reference word list is sequentially stored in different nodes in a path of the dictionary tree, and in the process of creating the dictionary tree, whether a node exists in a character of a first (or earlier appearing) participle segment of the reference word is compared, if so, the pointer is pointed to the node, and if not, a node is created for the participle segment.

For example, as shown in fig. 4, taking three reference words, namely "public deposit," "birth allowance," and "birth subsidy," included in the reference word list as an example, the server may create a root node, where a plurality of nodes below the root node are all child nodes, and for the participle segments "public" and "birth" in the reference words, "public deposit," "birth allowance," and "birth subsidy," the server may determine that there is no child node connected to the root node and matching the character, and the server creates a child node "public" and "birth" connected to the root node. Similarly, for a participle segment "product" in the reference word "public product fund", the server may determine that there is no child node connected to the node "public" and matching the character, and then the server may create a child node "product" connected to the node "public"; for the participle segment "gold" in the reference word "public accumulation gold", the server may determine that there is no child node connected to the node "product" and matching the character, and then the server may create a child node "gold" connected to the node "product"; whereby the server can obtain a tree structure as shown in (1) in fig. 4. Similarly, a tree structure as shown in fig. 4 can be obtained.

In one embodiment, matching the search term with a dictionary tree created in advance to obtain a matching result includes: performing word segmentation processing on the search word to obtain a plurality of word segmentation segments of the search word; matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-established dictionary tree; and if the word segmentation segment is not matched with the node in the dictionary tree, generating a matching result, wherein the matching result is used for indicating that the search word is not matched with the dictionary tree.

In this embodiment, after receiving a search word input by a user terminal, a server performs word segmentation processing on the search word to obtain a plurality of word segmentation segments of the search word, matches, starting from a first word segmentation segment of the search word, with a pre-created first-layer child node of a root node of a dictionary tree, whether a current word segmentation segment has a node, when the node exists and the word segmentation segment are matched, continues to match, from the current node in the dictionary tree, a next word segmentation segment of the search word, and when the word segmentation segment in the search word does not exist in a node in the dictionary tree, generates a matching result and considers that the search word is not matched with the dictionary tree.

As a specific example of this embodiment, if a search word input by a user terminal is "driving license declaration", the search word is subjected to word segmentation processing to obtain "driving", "lighting", "declaration", and "command", which are then multiple word segmentation segments of the search word, the multiple word segmentation segments of the search word "driving", "lighting", "declaration", and "command" are matched with nodes in a dictionary tree, and when the word segmentation segment "command" is matched, a corresponding node cannot be found in the dictionary tree due to the fact that a "lead" is input as the "command", so that it is determined that the search word is not matched with the dictionary tree.

According to the embodiment, the obtained search words are matched with the dictionary tree, whether the search words need to be corrected or not can be efficiently confirmed, and the accuracy of the search results returned by the query is guaranteed.

In one embodiment, after obtaining the feature vector of each reference word in the plurality of reference words included in the reference word list, the server may perform concurrent computation on each sub-reference word list in the reference word list through a plurality of preset threads when calculating the similarity between the feature vector of the search word and the feature vector of each reference word, so as to obtain the reference word corresponding to the highest similarity of each sub-reference word list. Specifically, the server may control the splitting size by using bigram, trigram and other strategies, so as to control the calculation amount, and divide the reference word list into K sub-reference word lists, where K is greater than or equal to 2, and each sub-reference word list includes N reference words, where N is greater than or equal to 1; and the server calls a plurality of preset threads to respectively calculate the similarity between the feature vector of the search word input by the user and the feature vectors of the N reference words in each sub-reference word list, obtains the reference word corresponding to the highest similarity in each sub-reference word list, and performs descending sorting according to the similarity to obtain the final target search word corresponding to the highest similarity.

It should be noted that, in this embodiment, each thread may calculate the similarity between the feature vector of the search word and the feature vectors of the N reference words in one sub-reference list, and multiple threads may process simultaneously without affecting each other, so that the time required for search may be reduced, and the processing speed of search may be increased.

As a specific example of this embodiment, as shown in table 2, the server splits the reference word list in table 1 into two reference word lists, and obtains feature vectors corresponding to a plurality of reference words in each reference word list, to obtain a reference word feature vector list (a) and a reference word feature vector list (b), for example, a user terminal inputs a search word "driver license claim", the server converts the "driver license claim" into a feature vector, and then performs similarity calculation with the feature vectors of each reference word in the reference word feature vector list (a) and the reference word feature vector list (b), wherein [ public credit is 0.0123] is returned after calculation with the reference word feature vector list (a), and [ driver license is 0.89102] is returned after calculation with the reference word feature vector list (b), and the server selects a reference word "driver license claim" with highest similarity from the results returned from the reference word feature vector lists (a) and (b) as a result A target search term.

Illustratively, the list of reference word feature vectors is shown in table 2:

table 2 reference word feature vector list (a)

Reference word	Feature vector
		Nurse certificate	[0.1233422,10.1292920d,101.929101]
Accumulation fund	[1.1233422,10.1292920d,101.929101]
		Birth promoting and body fluid promoting plaster	[2.1233422,10.1292920d,101.929101]

Table 2 reference word feature vector list (b)

Reference word	Feature vector
		Birth patch	[3.1233422,10.1292920d,101.929101]
Driving license application collar	[4.1233422,10.1292920d,101.929101]
		Xishan ju (a Chinese character of 'xi' mountain)	[5.1233422,10.1292920d,101.929101]

In one embodiment, on the basis of the above embodiment, after the reference word list is obtained, the reference words in the reference word list may be converted into pinyin, the obtained pinyin of the reference words is subjected to word segmentation, and word segmentation segments of the pinyin subjected to word segmentation are constructed into a pinyin dictionary tree.

Illustratively, the pinyin list of the reference word is shown in table 3:

TABLE 3 Pinyin List of reference words

Reference word	Phonetic alphabet	Creation time
			Nurse chief	hushizhang	2020-09-18 10:05:10
Accumulation fund	gongjijin	2020-09-18 10:05:10
			Birth promoting and body fluid promoting plaster	shengyujintie	2020-09-18 10:05:10
Birth patch	shengyubutie	2020-09-18 10:05:10
			Driving license application collar	jiazhaoshenling	2020-09-18 10:05:10
Xishan ju (a Chinese character of 'xi' mountain)	xishanju	2020-09-18 10:05:10

After matching the search word input by the user with the dictionary tree, the server displays that the word segmentation segment of the search word can not find a corresponding node in the dictionary tree according to the matching result, further matches the pinyin corresponding to the search word with the pinyin dictionary tree, and takes a reference word corresponding to the pinyin character sequence of the matching node as a target reference word if the pinyin corresponding to the search word can find a corresponding node in the pinyin dictionary tree; if the pinyin corresponding to the search word cannot find corresponding nodes in the pinyin dictionary tree, similarity calculation is carried out on the pinyin corresponding to the search word and the feature vector of each pinyin in a plurality of pinyins included in the pinyin list of the reference word to obtain the pinyin with the highest corresponding similarity, and the reference word corresponding to the pinyin is taken as the target reference word.

In one embodiment, the reference word list further includes a word frequency of each of the plurality of reference words, and the taking the corresponding reference word with the highest similarity as the target reference word includes: acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity; acquiring a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word; judging whether the difference value is smaller than or equal to a preset difference value threshold value or not; if yes, searching the word frequency of the first reference word and the word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word; and if not, taking the first reference word as a target reference word.

In this embodiment, after similarity calculation is performed on feature vectors of reference words and feature vectors of a plurality of reference words in a reference word list, a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity are obtained, when edit distances between the first reference word and the search word and edit distances between the second reference word and the search word are all small, a difference value between the similarity of the first reference word and the similarity of the second reference word is smaller than or equal to a preset difference value threshold, word frequencies corresponding to the first reference word and the second reference word are used as bases, and the reference word with the highest word frequency is selected as a target reference word. For example, when the search word input by the user is "birth", the reference word list includes the reference words "birth allowance" and "birth supplement", the edit distances between the search word "birth" and the reference words "birth allowance" and "birth supplement" are both 2, the difference between the similarity degrees corresponding to the search word "birth" and the reference words "birth allowance" and "birth supplement" is smaller than the preset difference threshold, and since the word frequency of the reference word "birth allowance" is 21 and the word frequency of the reference word "birth supplement" is 22, the "birth supplement" is taken as the target reference word.

In one embodiment, the data error correction method further comprises the steps of: if the matching result indicates that the search word is matched with the dictionary tree, searching candidate content matched with the search word from a database; obtaining the correlation degree between the candidate content and the search word; if the correlation degree is smaller than or equal to a preset correlation degree threshold value, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word; and taking the content matched with the target reference word in the database as a search result of the search word.

In this embodiment, the word segmentation segment of the search word input by the user is matched with the dictionary tree, and when the corresponding node can be found in the dictionary tree in the word segmentation segment of the search word input by the user, the matching result indicates that the search word is matched with the dictionary tree, the search word input by the user is queried in the database, and the corresponding search result is obtained and used as the candidate content.

Further, obtaining the relevance between the search word and the candidate content, for example, the candidate content returned in the database has ten web articles, calculating the relevance between the search word and each web article by using a relevance algorithm, for example, BM25 algorithm, and finally comparing the average of the relevance between the search word and the ten web articles with a preset relevance threshold, assuming that 100 is full score and the preset relevance threshold is 50 score, if the relevance between the search word input by the user and the candidate content is lower than 50 score, the target reference word is also determined from the multiple reference words included in the reference word list according to the feature vector of the search word input by the user.

In one embodiment, after a search word input by a user is queried in a database, and a corresponding search result is obtained as a candidate content, a click model may be further used to calculate whether the candidate content is a search result in which the user is interested, for example, calculate a click probability of the user clicking a web article in the candidate content, when the predicted click probability is lower than a preset threshold, a target reference word is determined from a plurality of reference words included in a reference word list according to a feature vector of the search word, and a content in the database that is matched with the target reference word is used as a search result of the search word, and the input search word is automatically replaced by predicting the click probability of the user on the candidate content in advance by using the click model, so that the user is prevented from spending time on a search result in which the user is not interested, and the efficiency of user query is improved.

The click model is used for establishing a probability graph model based on a plurality of premise hypotheses by mining information such as search words, search contents corresponding to the search words, clicked search results in the search contents corresponding to the search words and the like, so that the search behavior of a user is modeled. Click models include, but are not limited to, cascade models, dynamic bayesian network models, and the like.

In one embodiment, the data error correction method further comprises the steps of: acquiring a search word error correction log, wherein the search word error correction log comprises a plurality of error correction records, and each error correction record in the plurality of error correction records comprises an input search word and a corresponding target reference word; acquiring a reference word to be added input according to an error correction record of an error in the search word error correction log; and adding the reference word to be added into the reference word list, and updating the dictionary tree.

As shown in fig. 5, the search term input by the user and the corresponding target search term are recorded through the error correction log, so as to obtain a plurality of error correction records. In the government website, there is no "driver license" in fact, and it is the correct driver's license, so it is likely that the driver's license will be corrected to a passport without adding it to the list of reference words in table 1. When an operator obtains an error correction log from the server and finds that the search is wrong automatically in error correction, the "driver license" can be used as a reference word to be added, and the server adds the reference word "driver license" to be added to the reference word list.

Further, the server searches the dictionary tree for the reference word to be added, matches nodes in the dictionary tree with the word segmentation segments of the reference word to be added, and updates the nodes of the dictionary tree when the matching result indicates that the reference word to be added cannot find corresponding nodes in the dictionary tree for matching. Therefore, the nodes in the dictionary tree can be effectively and dynamically updated according to the error correction log, so that the reference words are effectively expanded, and the completeness and the accuracy of the reference words stored in the dictionary tree are enhanced.

As shown in fig. 6, fig. 6 is a schematic structural diagram of a data error correction apparatus provided in an embodiment of the present application, including:

a data obtaining module 601, configured to obtain a search term input by a user;

a data matching module 602, configured to match the search word with a pre-created dictionary tree to obtain a matching result, where the dictionary tree includes multiple nodes, and each node in the multiple nodes is used to represent a word segmentation segment of a reference word in a reference word list;

a data determining module 603, configured to, if the matching result indicates that the search word is not matched with the dictionary tree, obtain a feature vector of the search word, and determine a target reference word from multiple reference words included in the reference word list according to the feature vector of the search word;

and a data output module 604, configured to use the content in the database that matches the target reference word as a search result of the search word.

In one embodiment, the data matching module 602 matches the search term with a pre-created dictionary tree to obtain a matching result, including:

performing word segmentation processing on the search word to obtain a plurality of word segmentation segments of the search word;

matching each word segmentation segment in the plurality of word segmentation segments with each node in a pre-established dictionary tree;

and if the word segmentation segment is not matched with the node in the dictionary tree, generating a matching result, wherein the matching result is used for indicating that the search word is not matched with the dictionary tree.

In one embodiment, the determining the target reference word from the plurality of reference words included in the reference word list by the data determining module 603 according to the feature vector of the search word includes:

acquiring a feature vector of each reference word in a plurality of reference words included in the reference word list;

calculating the similarity between the feature vector of the search word and the feature vector of each reference word;

and taking the corresponding reference word with the highest similarity as a target reference word.

In one embodiment, the data determining module 603 sets the corresponding reference word with the highest similarity as the target reference word, including:

acquiring a first reference word with the highest corresponding similarity and a second reference word with the second highest corresponding similarity;

acquiring a difference value between the similarity corresponding to the first reference word and the similarity corresponding to the second reference word;

judging whether the difference value is smaller than or equal to a preset difference value threshold value or not;

if yes, searching the word frequency of the first reference word and the word frequency of the second reference word from the reference word list, and taking the reference word with the highest word frequency in the first reference word and the second reference word as a target reference word;

and if not, taking the first reference word as a target reference word.

In one embodiment, if the matching result indicates that the search term matches the dictionary tree, the data determination module 603 is further configured to query a database for candidate content matching the search term;

the data determining module 603 is further configured to obtain a correlation between the candidate content and the search term;

the data determining module 603 is further configured to, if the correlation degree is less than or equal to a preset correlation degree threshold, obtain a feature vector of the search word, and determine a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;

the data output module 604 is further configured to use the content in the database that matches the target reference word as a search result of the search word.

In an embodiment, before the search word is matched with a dictionary tree created in advance and a matching result is obtained, the data obtaining module 601 is further configured to obtain a reference word list, where the reference word list includes a plurality of reference words;

the data obtaining module 601 is further configured to perform word segmentation processing on each reference word in the plurality of reference words to obtain a plurality of word segmentation segments of each reference word;

the data obtaining module 601 is further configured to generate a node of a dictionary tree according to each participle segment of each reference word, so as to create a dictionary tree corresponding to the plurality of reference words.

In one embodiment, the data obtaining module 601 obtains the reference word list, including:

extracting key data from contents included in a database, the key data including a plurality of keywords and a number of occurrences of each of the plurality of keywords;

acquiring a user search record, wherein the user search record comprises a plurality of search terms and the occurrence frequency of each search term in the search terms;

and creating a reference word list according to the key data and the user search records.

In one embodiment, the data obtaining module 601 is further configured to obtain a search term error correction log, where the search term error correction log includes a plurality of error correction records, and each error correction record in the plurality of error correction records includes an input search term and a corresponding target reference term;

the data acquisition module 601 is further configured to acquire a reference word to be added, which is input according to an error correction record of an error in the search word error correction log;

the data obtaining module 601 is further configured to add the reference word to be added to the reference word list, and update the dictionary tree.

According to the data error correction device provided by the embodiment of the application, after the search word is obtained, the search word is matched with the dictionary tree which is created in advance, whether the search word needs error correction or not is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and the multiple reference words included in the reference word list, and finally the content matched with the target reference word in the database is used as the search result of the search word, so that automatic error correction can be accurately performed on the search word, and the efficiency and accuracy of data query are improved.

Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present disclosure, where an internal structure of the server is as shown in fig. 7, and the server includes an input device 701, an output device 702, a processor 703, a memory 704, a program 705, and a communication bus 706, where the input device 701, the output device 702, the processor 703, and the memory 704 complete communication with each other through the communication bus 706.

A memory 704 for storing a program 705;

the processor 703, when executing the program 705 stored in the memory 704, implements the following steps:

acquiring a search word input by a user;

In one embodiment, the processor 703 matches the search term with a dictionary tree created in advance to obtain a matching result, including:

In one embodiment, the processor 703 determines a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word, including:

In one embodiment, the processor 703 takes the corresponding reference word with the highest similarity as the target reference word, including:

and if not, taking the first reference word as a target reference word.

In one embodiment, the processor 703 is further configured to perform the following operations:

if the matching result indicates that the search word is matched with the dictionary tree, searching candidate content matched with the search word from a database;

obtaining the correlation degree between the candidate content and the search word;

if the correlation degree is smaller than or equal to a preset correlation degree threshold value, acquiring a feature vector of the search word, and determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word;

and taking the content matched with the target reference word in the database as a search result of the search word.

In one embodiment, before matching the search term with a pre-created dictionary tree to obtain a matching result, the processor 703 is further configured to:

acquiring a reference word list, wherein the reference word list comprises a plurality of reference words;

performing word segmentation processing on each reference word in the plurality of reference words to obtain a plurality of word segmentation segments of each reference word;

and generating nodes of a dictionary tree according to each participle segment of each reference word so as to create the dictionary trees corresponding to the plurality of reference words.

In one embodiment, the processor 703 obtains a list of reference words, including:

acquiring a search word error correction log, wherein the search word error correction log comprises a plurality of error correction records, and each error correction record in the plurality of error correction records comprises an input search word and a corresponding target reference word;

acquiring a reference word to be added input according to an error correction record of an error in the search word error correction log;

and adding the reference word to be added into the reference word list, and updating the dictionary tree.

According to the server provided by the embodiment of the application, after the server acquires the search word, the search word is matched with the dictionary tree which is created in advance, whether the search word needs to be corrected or not is determined according to the obtained matching result, when the matching result indicates that the search word is not matched with the dictionary tree, the target reference word is determined according to the similarity between the feature vector of the search word and the multiple reference words included in the reference word list, and finally the content matched with the target reference word in the database is used as the search result of the search word, so that automatic correction can be accurately performed on the search word, and the efficiency and accuracy of data query are improved.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a processor, the steps performed by the server in the foregoing embodiments may be performed.

It will be understood by those skilled in the art that all or part of the processes of the method for implementing the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the file management method. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps performed in the embodiments of the methods described above.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of data error correction, comprising:

acquiring a search word input by a user;

2. The method of claim 1, wherein matching the search term with a pre-created dictionary tree to obtain a matching result comprises:

3. The method according to claim 1, wherein the determining a target reference word from a plurality of reference words included in the reference word list according to the feature vector of the search word comprises:

4. The method according to claim 3, wherein the reference word list further includes a word frequency of each of the plurality of reference words, and the using the corresponding reference word with the highest similarity as the target reference word comprises:

and if not, taking the first reference word as a target reference word.

5. The method of claim 1, further comprising:

6. The method according to any one of claims 1 to 5, wherein before the matching the search word with a pre-created dictionary tree to obtain a matching result, the method further comprises:

7. The method of claim 6, wherein obtaining the reference word list comprises:

8. The method of claim 1, further comprising:

9. A data error correction apparatus, comprising:

the data acquisition module is used for acquiring search terms input by a user;

10. A server, comprising a memory and a processor, wherein the memory stores a set of program codes, and the processor calls the program codes stored in the memory to execute the method of any one of 1 to 8.