CN113779176A

CN113779176A - Query request completion method and device, electronic equipment and storage medium

Info

Publication number: CN113779176A
Application number: CN202011476378.0A
Authority: CN
Inventors: 邹波; 刘丹; 邱立坤
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-12-10

Abstract

The disclosure provides a query request completion method and device, electronic equipment and a computer-readable storage medium, and relates to the technical field of retrieval. The query request completion method comprises the following steps: constructing a standardized corpus of the query request; constructing a prefix tree based on the standardized corpus; when a prefix of a query request input by a user is acquired, querying a node query corpus matched with the prefix in the prefix tree; and completing the query request based on the node query corpus. Through the technical scheme disclosed by the invention, based on prefix retrieval of the prefix tree, the problems of high system overhead and overlong request time caused by an es recall + sorting algorithm in the related technology can be well solved, and then the jamming occurring in continuous input of a user is reduced, the completion effect is improved, and the input experience of the user is improved.

Description

Query request completion method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of retrieval technologies, and in particular, to a method and an apparatus for completing a query request, an electronic device, and a computer-readable storage medium.

Background

Query (a Query request, a message sent by a search engine or a database in order to search a specific file, a website, a record or a series of records in the database) is automatically supplemented and commonly used in the search engine, the aim is to predict a complete Query in the process of inputting the Query by a user, the Query is ranked according to relevance and then recommended to the user, and the Query is input by the user in an auxiliary manner, so that the user experience is improved, and the Query with wrong spelling or fuzzy expression is prevented from being input. With the rise of business conversation systems, query autocompletion has also been introduced.

In the related art, the main process of query automatic completion includes: after prefix (prefix) input by a user is obtained, a batch of candidate queries related to the user input are recalled from a preset query database through a specific recall algorithm, then the candidate queries are subjected to relevance sorting through a specific sorting algorithm, and finally a plurality of queries with the highest ranking are recommended to the user, but the method has the following defects at present:

the completion process relates to the retrieval of tens of millions of linguistic data elastic search (es for short), and the retrieval is sequenced and matched, so that the user is stuck in the continuous input process, or the current query completion result is not returned, and the user starts to input the next word, so that the completion effect is influenced, and the input experience of the user is also influenced.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a query request completion method, apparatus, electronic device, and computer-readable storage medium, which can solve, at least to some extent, the problems of high system overhead and long request time caused by a completion method in the related art.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, there is provided a query request completion method, including: constructing a standardized corpus of the query request; constructing a prefix tree based on the standardized corpus; when a prefix of a query request input by a user is acquired, querying a node query corpus matched with the prefix in the prefix tree; and completing the query request based on the node query corpus.

In one embodiment, the constructing the standardized corpus of query requests includes: and constructing the standardized corpus based on the historical query corpus and the corresponding first consulting volume and/or the pre-stored standard query sentence between the user and the robot and the corresponding second consulting volume.

In one embodiment, the constructing the standardized corpus based on the historical query corpus and the corresponding first consulting volume, and/or the pre-stored standard query question between the user and the robot and the corresponding second consulting volume comprises: performing screening operation on the historical query corpus based on preset screening conditions, and generating a first corpus set based on screening results and the first consulting volume; acquiring the corresponding standard query question based on preset intention classification, and generating a second corpus based on the standard query question and the second consulting volume;

and generating the standardized corpus based on the first corpus and/or the second corpus.

In one embodiment, the performing a filtering operation on the historical query corpus based on a preset filtering condition, and the generating a first corpus based on a filtering result and the first consulting volume includes: selecting a dialog log which is related to the query request and is in a forward preset time length from the current moment; deleting stop words in the dialog log to generate a corpus set to be processed; extracting multiple classes of similar query requests in the corpus to be processed based on an edit distance algorithm, and merging the similar query requests by each class to obtain multiple classes of merged query requests; counting the consulting quantity of each type of similar query request to serve as the first consulting quantity; and screening the multi-class merged query requests based on the relation between the first consulting quantity and the consulting quantity threshold value, and determining the screened merged query requests as the first corpus.

In one embodiment, further comprising: generating an inquiry table based on the merged query request and the first consulting volume; the generating a second corpus based on the standard query question and the second consulting volume comprises: performing similarity matching on the standard query sentence and the inquiry form to determine the second consulting amount; and generating the second corpus based on the standard query question and the consulting amount of the standard query question.

In one embodiment, the constructing a prefix tree based on the normalized corpus comprises: performing word segmentation on the linguistic data in the standardized corpus set to form word segmentation character strings of different layers; determining the number of consultations of the word cutting character string based on the first consultations amount and/or the second consultations amount; and constructing the prefix tree by taking the word cutting character string as an edge and the consultation times of the word cutting character string as nodes.

In one embodiment, the constructing the prefix tree further includes, by taking the word cutting character string as an edge and the number of consulting times of the word cutting character string as a node: and sequencing the nodes generated by the word cutting character strings of each layer according to the consultation times of the word cutting character strings to construct the prefix tree.

In one embodiment, the constructing a prefix tree based on the normalized corpus comprises: acquiring entity information in a specified field; extracting entity characters in the word cutting character string based on the entity information; replacing the entity character with the same generalized character to generate the generalized prefix tree.

In an embodiment, when a prefix of a query request input by a user is obtained, querying a node query corpus matching the prefix in the prefix tree includes: when the prefix is obtained, extracting the entity character in the prefix based on a named body identification operation; replacing the entity character with the generalization character to perform generalization processing on the entity character; and executing query operation in the prefix tree based on the generalized characters and other characters in the prefix so as to obtain the corresponding node query corpus.

In an embodiment, when a prefix of a query request input by a user is obtained, querying a node query corpus matching the prefix in the prefix tree further includes: and replacing the generalized characters in the node query corpus with the entity characters so as to complete the query request based on the replaced node query corpus.

In one embodiment, the updating dialog log of the query request is obtained regularly; generating a standardized update corpus based on the update dialog log and the standard query question; generalizing the standardized updating corpus to obtain a generalized corpus; updating the prefix tree based on the generalized corpus.

According to a second aspect of the present disclosure, there is provided a query request completion apparatus, including: the construction module is used for constructing a standardized corpus of the query request; the construction module is used for constructing a prefix tree based on the standardized corpus; the query module is used for querying a node query corpus matched with a prefix in the prefix tree when the prefix of a query request input by a user is acquired; and the completion module is used for completing the query request based on the node query corpus.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform the query request completion method of any of the above via execution of executable instructions.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the query request completion method of any one of the above.

According to the query request completion scheme provided by the embodiment of the disclosure, the prefix tree is constructed according to the standardized corpus, so that when the prefix of the query request is received, the node query corpus matched with the prefix is queried in the prefix tree and serves as a part to be completed by the query to complete completion of the query, the construction of the standardized corpus is favorable for standardizing the input of a user, and further the reliability of query completion can be ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 is a diagram illustrating a structure of a query completion system according to an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a query request completion method according to an embodiment of the present disclosure;

FIG. 3 is a flow diagram illustrating another method for query completion in an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating a method for completion of a query request according to yet another embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating a prefix tree in an embodiment of the present disclosure;

FIG. 6 illustrates a flow chart of yet another query request completion method of an embodiment of the present disclosure;

FIG. 7 illustrates a flow chart of yet another query request completion method of an embodiment of the present disclosure;

FIG. 8 illustrates a modular flow diagram of a query request completion method of an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating an apparatus for query completion in an embodiment of the present disclosure;

fig. 10 shows a schematic diagram of an electronic device in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The scheme provided by the application constructs the prefix tree according to the standardized corpus, when the prefix of the query request query is received, the node query corpus matched with the prefix is queried in the prefix tree and serves as a part to be completed by the query, completion of the query is completed, construction of the standardized corpus is favorable for standardizing input of a user, reliability of query completion can be guaranteed, prefix retrieval based on the prefix tree is achieved, the problems that system overhead caused by es recall and sorting algorithm in the related technology is large, request time is too long are solved, stagnation during continuous input of the user is reduced, completion effect is improved, and input experience of the user is improved.

For ease of understanding, the following first explains several terms referred to in this application.

Query: i.e., a query, specifically a message sent by a search engine or database in order to find a particular file, web site, record, or series of records in the database. Query refers to a Query request in this disclosure.

Prefix: that is, the prefix is specifically an intermediate text generated by the user during the query input process, and is usually the first half of the final query.

Elastic search (es) is specifically a search and data analysis engine.

Named Entity Recognition (NER) is an important basic tool in application fields of information extraction, question-answering systems, syntactic analysis, machine translation and the like, and plays an important role in the process of bringing natural language processing technology into practical use. Generally speaking, the task of named entity recognition is to identify named entities in three major categories (entity category, time category and number category), seven minor categories (person name, organization name, place name, time, date, currency and percentage) in the text to be processed.

Prefix trees, also known as word-lookup trees, Trie trees, are tree-like structures that are variations of hash trees. Typical applications are for statistics, sorting and storing a large number of strings (but not limited to strings), and are therefore often used by search engine systems for text word frequency statistics. It has the advantages that: the public prefix of the character string is utilized to reduce the query time, so that unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is higher than that of a Hash tree.

The Edit Distance algorithm, Edit Distance (MED), also called Levenshtein Distance, refers to the Minimum number of Edit operations required to change one character string into another character string. The allowed editing operations include: replacing one character with another (substition, s), inserting one (insert, i) or deleting one (delete, d).

The scheme provided by the embodiment of the application relates to technologies such as face recognition and machine learning, and is specifically explained by the following embodiment.

Fig. 1 shows a schematic structural diagram of a query completion system in an embodiment of the present disclosure, which includes a plurality of terminals 120 and a server cluster 140.

The terminal 120 may be a mobile terminal such as a mobile phone, a game console, a tablet Computer, an e-book reader, smart glasses, an MP4(Moving Picture Experts Group Audio Layer IV) player, an intelligent home device, an AR (Augmented Reality) device, a VR (Virtual Reality) device, or a Personal Computer (PC), such as a laptop Computer and a desktop Computer.

Among them, the terminal 120 may have an application installed therein for providing the query request completion.

The terminals 120 are connected to the server cluster 140 through a communication network. Optionally, the communication network is a wired network or a wireless network.

The server cluster 140 is a server, or is composed of a plurality of servers, or is a virtualization platform, or is a cloud computing service center. The server cluster 140 is used to provide background services for providing the query request completion application. Optionally, the server cluster 140 undertakes primary computational work and the terminal 120 undertakes secondary computational work; alternatively, the server cluster 140 undertakes secondary computing work and the terminal 120 undertakes primary computing work; alternatively, the terminal 120 and the server cluster 140 perform cooperative computing by using a distributed computing architecture.

In some alternative embodiments, the server cluster 140 is used to store a query request completion model, and the like.

Alternatively, the clients of the applications installed in different terminals 120 are the same, or the clients of the applications installed on two terminals 120 are clients of the same type of application of different control system platforms. Based on different terminal platforms, the specific form of the client of the application program may also be different, for example, the client of the application program may be a mobile phone client, a PC client, or a World Wide Web (Web) client.

Those skilled in the art will appreciate that the number of terminals 120 described above may be greater or fewer. For example, the number of the terminals may be only one, or several tens or hundreds of the terminals, or more. The number of terminals and the type of the device are not limited in the embodiments of the present application.

Optionally, the system may further include a management device (not shown in fig. 1), and the management device is connected to the server cluster 140 through a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

Hereinafter, each step in the query request completion method in the present exemplary embodiment will be described in more detail with reference to the drawings and the embodiments.

Fig. 2 is a flowchart illustrating a query completion method in an embodiment of the disclosure. The method provided by the embodiment of the present disclosure may be performed by any electronic device with computing processing capability, for example, the terminal 120 and/or the server cluster 140 in fig. 1. In the following description, the terminal 120 is taken as an execution subject for illustration.

As shown in fig. 2, the terminal 120 executes the query request completion method, which includes the following steps:

step S202, a standardized corpus of the query request is constructed.

The standardized corpus refers to a corpus which is easy to understand for the robot executing the query request, and the structure of the standardized corpus is beneficial to standardizing the input of a user and reducing the difficulty of recognizing and responding the subsequent robot meaning.

And step S204, constructing a prefix tree based on the standardized corpus.

The search path can be optimized based on the prefix tree constructed by the standardized corpus set, and therefore time consumption of query completion is reduced.

Step S206, when the prefix of the query request input by the user is obtained, querying the node query corpus matched with the prefix in the prefix tree.

And completing the query instruction based on the result of searching the prefix tree.

Step S208, the query request is completed based on the node query corpus.

The query request is completed based on the node query corpus, and is specifically displayed in a prefix input box according to a specified mode.

In the embodiment, the prefix tree is constructed according to the standardized corpus, so that when the prefix of the query request is received, the node query corpus matched with the prefix is queried in the prefix tree to serve as a part to be completed by the query, so as to complete completion of the query.

Furthermore, prefix retrieval based on a prefix tree can well solve the problems of high system overhead and long request time caused by an es recall + sorting algorithm in the related technology, so that the jamming of a user during continuous input is reduced, the completion effect is improved, and the input experience of the user is improved.

In one embodiment, the step S202 of constructing the normalized corpus of the query request includes: and constructing a standardized corpus of the query request based on the historical query corpus and the corresponding first consulting volume and/or pre-stored standard query sentences between the user and the robot and the corresponding second consulting volume.

The historical query corpus represents collected related corpora input by the user, and the first consulting volume represents the query frequency of each historical query corpus.

The standard query question between the user and the robot refers to the intention classification of the user predefined by the intelligent customer service system and the standard question corresponding to each classification, and the standard question is a sentence which is manually screened, accords with the grammar specification and is easy to understand. If the user directly asks the questions, the robot can smoothly finish the intention identification and response process, and the second consulting amount represents the inquiry frequency of each standard inquiry question.

In the embodiment, a standardized corpus is constructed based on the historical query corpus and the standard query sentence, so that a standard query can be recommended to a user, the input of the user is normalized, and the difficulty in subsequent robot intention identification and response is reduced. At the same time, the data set naturally filters long tails, non-canonical expressions, dirty words, particularly short or particularly long sentences due to high frequency and standardization issues.

For example, a standard query refers to a sentence in a chat robot, the standard question is a sentence that can be correctly recognized to intention, such as a chat robot built according to a text classification system, wherein one classification is "price guarantee" and the corresponding standard question is "i want to apply for price guarantee".

In one embodiment, constructing a standardized corpus of query requests based on historical query corpuses and corresponding first consulting volumes, and/or pre-stored standard query sentences between a user and a robot and corresponding second consulting volumes comprises:

and performing screening operation on the historical query corpus based on a preset screening condition, and generating a first corpus set based on a screening result and a first consulting amount.

The screening condition is preset to a condition related to time and consultation amount, such as a conversation log with historical query corpus of a week or a month, a conversation log with consultation amount greater than a certain threshold, or a conversation log with consultation amount in the first few digits.

Acquiring corresponding standard query sentences based on preset intention classification, and generating a second corpus based on the standard query sentences and a second consulting volume; a normalized corpus is generated based on the first corpus and/or the second corpus.

The historical query corpus comprises sentences similar to the standard query question, and the consulting amount of the standard query question is counted based on the consulting amount of the similar sentences.

As shown in fig. 3, in an embodiment, a screening operation is performed on a historical query corpus based on a preset screening condition, a first corpus is generated based on a screening result and a first consulting volume, and a corresponding standard query question is obtained based on a preset intention classification, so as to generate a second corpus based on the standard query question and a second consulting volume; one specific implementation of generating the normalized corpus based on the first corpus and/or the second corpus comprises:

step S302, selecting a dialog log related to the query request within a preset time length from the current time onward.

And step S304, deleting stop words in the dialog logs, and generating a corpus to be processed.

Step S306, extracting the multi-class similar query requests in the corpus to be processed based on the edit distance algorithm, and merging the similar query requests by each class to obtain the multi-class merged query request.

Step S308, the consulting quantity of each type of similar query request is counted to be used as a first consulting quantity.

Step S310, a plurality of types of merged query requests are screened based on the relation between the first consulting quantity and the consulting quantity threshold value, and the screened merged query requests are determined to be a first corpus.

In step S312, an inquiry table is generated based on the combined query request and the first consulting volume.

The historical query corpus is specifically historical high-frequency corpus, and the robot can understand and answer high-frequency questions well under normal conditions, so that the high-frequency corpus is preferred when data are prepared. Specifically, the high-frequency corpus can be found by the following method:

and selecting a dialog system log of the last week, and organizing all queries according to the day dimension.

All queries go to word processing, such as: where | the | those | where | there.

Setting a threshold value larger than 0.85 to be equal by adopting an edit distance algorithm, finding out similar queries, merging, and calculating the consultation amount of the queries. This step can result in query _ competition _ map, i.e., an inquiry table where key is the processed similar query and value is the first query volume of the query.

And selecting data with the top 50% of consulting quantity in the query _ conversation _ map as an alternative, adding all the data together, removing repeated contents, and taking the data as a screening result to generate a first corpus.

Step S314, the standard query sentence is subjected to similarity matching with the inquiry form so as to determine a second consulting amount.

Step S316, generating a second corpus based on the standard query question and the consulting amount of the standard query question.

Step S318, performing duplication elimination and stir-frying on the basis of the first corpus and the second corpus to generate a standardized corpus.

Specifically, the intelligent customer service system will typically predefine the user's intent categories and give each category a standard query sentence. These standard query sentences are sentences that have been manually screened, conform to grammatical specifications, and are easy to understand. If the user directly asks the questions, the robot can smoothly finish the intention identification and response process.

These standard query questions are found directly in the robot database. Such as: "refund not reached", "do a white charge", "coupon usage restrictions", which can provide about 2000 standard query questions.

And giving out consulting quantity data for each standard query question, and using the consulting quantity data after the prefix tree is established. Here, we will use each standard query sentence in turn to match the similarity of each element in the query _ containment _ map obtained in the first step, and the same edit distance is greater than 0.85 to consider the two equal. Thus, the method is equivalent to the real consultation condition of the online data, and constructs consultation quantity information for the offline standard inquiry question so as to obtain the second corpus.

And adding the first corpus and the second corpus together, and removing the duplication to obtain a standardized data set.

In the embodiment, because the standardized corpus is generated based on high-frequency and standardized corpora, irregular sentences such as long tails, irregular expression, dirty words, particularly short or particularly long sentences are filtered out from the data set, so that the standardized corpus is constructed to recommend a standard query to a user, the user inputs the standard query through completion operation, the specification of the input content of the user is realized, and the difficulty of subsequent robot intention identification and response can be reduced.

As shown in fig. 4, the step S204 of constructing the prefix tree based on the normalized corpus includes:

step S402, performing word segmentation processing on the corpus in the standardized corpus set to form word segmentation character strings in different layers.

And S404, determining the consultation frequency of the word cutting character string based on the first consultation amount and/or the second consultation amount.

Step S406, constructing a prefix tree by taking the word cutting character string as an edge and the consulting times of the word cutting character string as nodes.

Step S408, aiming at the nodes generated by the word cutting character strings of each layer, sequencing is carried out according to the consultation times of the word cutting character strings, and a prefix tree is generated.

Specifically, all the corpora in the standardized corpus set are subjected to word segmentation, and the number of consultations of each word segmentation character string is counted. Such as: "when my east and west is sent", in query _ containment _ map in 3.2.2, query number of consultation times cot _1 of the query can be inquired; "what i bought yesterday is today priced down", query this query for the number of consultations cot _ 2; "my mobile phone screen is broken", inquiring the number of consultation times of the query, cot _ 3. As shown in table 1, the number of consultations cot _1+ cot _2+ cot _3 of the word-cutting string "i"; the word cutting character string 'my' consultation times cot _1+ cot _ 3; the word cutting character string "i just" consults the number of times cot _ 2.

TABLE 1

All the word cutting character strings are used as the edges of the prefix tree, the consultation times of the word cutting character strings are recorded in each child node, all the child nodes of each layer are sequentially arranged from left to right according to the consultation times, the more the consultation times are, all the root nodes are the times of appearance of each sentence of standard corpus in the corpus, and the prefix tree built based on the process is shown in fig. 5.

In the embodiment, prefix retrieval at each time is realized by constructing the prefix tree, and due to the advantages of the prefix tree in the aspect of searching efficiency, the problems of high system overhead and overlong request time caused by a two-step mode of a traditional es recall + sorting algorithm can be well solved.

As shown in fig. 6, in one embodiment, constructing the prefix tree based on the normalized corpus comprises:

step S602, entity information in the specified domain is acquired.

Step S604, extracting entity characters in the word cutting character string based on the entity information.

Step S606, the same generalized character is used to replace the entity character to generate a generalized prefix tree.

The generalization processing of the corpus is realized through the NER technology so as to further reduce the whole corpus and identify more user problems at the same time.

The generalization processing procedure will be specifically described below with the e-commerce field as a designated field.

The NER technology in the E-business field can identify commodities in the query, such as computers, mobile phones and the like, can be identified as commodity entities and is marked as [ PRODSORT ], namely generalized characters.

Aiming at the user problem of how the mobile phone bought by me does not arrive yet, the generalization is that the 'PRODSORT' bought by me does not arrive yet; the user question "my computer needs to return", generalizes to "my [ prodport ] needs to return" and saves in the data set.

Corresponding to the on-line flow, an NER module is added for trade name identification. When a user really asks for ' the mobile phone bought by me ', the mobile phone is generalized to ' PRODSORT ' bought by me ', and then retrieval is carried out;

and for the retrieval return result, corresponding replacement is carried out and then the retrieval return result is displayed to the user. Here [ PRODSORT ] would be replaced by "merchandise", eventually returning "how do I buy goods not yet available"

In the embodiment, by generalizing the speech, the probability that the retrieval result is empty due to the inconsistency of the entities can be effectively reduced, and the information desired by the user can be provided through flexible matching operation.

As shown in fig. 7, in one embodiment, in step S206, when a prefix of a query request input by a user is obtained, querying a node query corpus matching the prefix in the prefix tree includes:

step S702, when the prefix is obtained, extracting entity characters in the prefix based on the named body identification operation.

Step S704, the generalized character is used to replace the entity character, so as to perform the generalized processing on the entity character.

Step S706, based on the generalization character and other characters in the prefix, executes query operation in the prefix tree to obtain the corresponding node query corpus.

Step S708, replacing generalized characters in the node query corpus with entity characters to complete the query request based on the replaced node query corpus.

Specifically, when a user inputs prefix for an incoming line consultation problem, namely query automatic completion service is triggered, the method comprises the following steps:

the prefix firstly performs entity recognition to obtain entity characters, for example, a user inputs 'my mobile phone', the entity characters are generalized to 'my [ prodport ]', namely generalized characters.

And based on the generalization result, directly searching in the prefix tree to find the child node with the largest consulting amount, wherein the edge of the child node is the returned query information.

To avoid that all the recommended queries are similar sentences, we have some skill in the specific search. As shown in fig. 6, when the user inputs "you", if directly finding the TOP3 child node with the largest consultation amount, it is likely that all are under the "your" node. In order to make the returned information more diversified, here we will select the child nodes with the largest respective consultation amount under the three nodes of "your", "hello" and "you is".

And directly returning the retrieved data to the user, and if an entity identification result exists, replacing the data and then returning the data.

In the embodiment, the query automatic completion based on the prefix tree is different from the traditional recall and sequencing two-step flow, and the result can be returned by only once querying the prefix tree, so that the previous online processing flow is greatly simplified.

Due to the natural advantages of the tree model in querying, the overall query time can be controlled to the millisecond level. In practical experiments, the single query time of data of hundreds of thousands of levels can be controlled to be about 2 ms.

In one embodiment, an updating dialog log of the query request is obtained periodically; generating a standardized updating corpus based on the updating dialog log and the standard query question; generalizing the standardized updated corpus to obtain a generalized corpus; the prefix tree is updated based on the generalized corpus.

Because the traditional strategy of es recall and sorting is limited by a huge corpus of es and a long corpus preparation process which needs manual intervention, the automatic updating of the corpus can be hardly realized.

The method comprises the steps of corpus standardization, corpus generalization and prefix tree construction, and the whole process is lighter, so that the prefix tree can be updated automatically. The specific automatic updating process is as follows:

and acquiring the dialog system log regularly. Generally, in the morning of each day, all the queries of the users on the day are automatically acquired.

The data standardization process generally adopts data of nearly 10 days, and in practical cases, data of longer time period can be used for calculation. First, query _ compensation _ map is calculated, where the advisory amount is a _ all ═ a _ cot _ total + b ═ cot _ yesterday + … + i ═ cot _ last _9day + j · last _10day, where a to j are sequentially set to a ═ 1.0, b ═ 0.9, c ═ 0.8, d ═ 0.7, e ═ 0.6, f ═ 0.5, g ═ 0.4, h ═ 0.3, i ═ 0.2, and j ═ 0.1, the weight ratio in the advisory amount is larger as the current data is closer to the current data. The consulting quantity data calculated in the way not only can well reflect the hot problem of the current consultation of the user, but also can retain the consulting information of a week to a certain extent.

The data generalization procedure refers to the generalization procedure described above. Currently, less than one hundred thousand data processes take about 1 hour as a whole.

The prefix tree construction process refers to the construction process described above. The time for constructing the prefix tree is fast, and the prefix tree can be completed in several minutes due to the fact that the prefix tree is constructed by hundreds of thousands of levels of data.

In this embodiment, the overall process can be completed within 2 hours by automatic data acquisition and prefix tree construction. The prefix tree can be automatically updated and optimized by completely utilizing the time that no user enters the line for consultation in the morning, so that the bottom data can be kept up to date at any time. And further, the fact that enough recommendable queries can still be kept along with changes of the chat robot service is guaranteed, and then the click rate of the user on the recommendable queries is guaranteed.

As shown in FIG. 8, the entire query completion process can be divided into online and offline portions.

Wherein, the online part comprises a step S802 of obtaining a prefix input by a user; step S804, prefix entity identification; step S806, searching prefix tree; and step S808, returning to query.

The off-line part comprises a step S810 of constructing a standardized corpus set; step S812, linguistic data generalization; step S814, constructing a prefix tree; step S816, periodically acquires a log. In step S816, steps S810 to S814 are repeatedly performed to implement the updating of the prefix tree.

It is to be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to an exemplary embodiment of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

The query request completion apparatus 900 according to this embodiment of the present invention is described below with reference to fig. 9. The query request completion apparatus 900 shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of the embodiments of the present invention.

The query completion means 900 is represented in the form of a hardware module. The components of the query request completion apparatus 900 may include, but are not limited to: a constructing module 902, configured to construct a standardized corpus of query requests; a constructing module 904, configured to construct a prefix tree based on the standardized corpus; a query module 906, configured to query, when a prefix of a query request input by a user is obtained, a node query corpus matched with the prefix in the prefix tree; a completion module 908 configured to complete the query request based on the node query corpus.

In one embodiment, the constructing module 902 is further configured to: and constructing the standardized corpus based on the historical query corpus and the corresponding first consulting volume and/or the pre-stored standard query sentence between the user and the robot and the corresponding second consulting volume.

In one embodiment, the constructing module 902 is further configured to: performing screening operation on the historical query corpus based on preset screening conditions, and generating a first corpus set based on screening results and the first consulting volume; acquiring the corresponding standard query question based on preset intention classification, and generating a second corpus based on the standard query question and the second consulting volume; and generating the standardized corpus based on the first corpus and/or the second corpus.

In one embodiment, the constructing module 902 is further configured to: selecting a dialog log which is related to the query request and is in a forward preset time length from the current moment; deleting stop words in the dialog log to generate a corpus set to be processed; extracting multiple classes of similar query requests in the corpus to be processed based on an edit distance algorithm, and merging the similar query requests by each class to obtain multiple classes of merged query requests; counting the consulting quantity of each type of similar query request to serve as the first consulting quantity; and screening the multi-class merged query requests based on the relation between the first consulting quantity and the consulting quantity threshold value, and determining the screened merged query requests as the first corpus.

In one embodiment, the constructing module 902 is further configured to: generating an inquiry table based on the merged query request and the first consulting volume; the generating a second corpus based on the standard query question and the second consulting volume comprises: performing similarity matching on the standard query sentence and the inquiry form to determine the second consulting amount; and generating the second corpus based on the standard query question and the consulting amount of the standard query question.

In one embodiment, the building module 904 is further configured to: performing word segmentation on the linguistic data in the standardized corpus set to form word segmentation character strings of different layers; determining the number of consultations of the word cutting character string based on the first consultations amount and/or the second consultations amount; and constructing the prefix tree by taking the word cutting character string as an edge and the consultation times of the word cutting character string as nodes.

In one embodiment, the building module 904 is further configured to: and sequencing the nodes generated by the word cutting character strings of each layer according to the consultation times of the word cutting character strings to construct the prefix tree.

In one embodiment, the building module 904 is further configured to: acquiring entity information in a specified field; extracting entity characters in the word cutting character string based on the entity information; replacing the entity character with the same generalized character to generate the generalized prefix tree.

In one embodiment, the query module 906 is further for: when the prefix is obtained, extracting the entity character in the prefix based on a named body identification operation; replacing the entity character with the generalization character to perform generalization processing on the entity character; and executing query operation in the prefix tree based on the generalized characters and other characters in the prefix so as to obtain the corresponding node query corpus.

In one embodiment, the query module 906 is further for: and replacing the generalized characters in the node query corpus with the entity characters so as to complete the query request based on the replaced node query corpus.

In one embodiment, the system further comprises an update module 910, and the update module 910 is configured to: regularly acquiring an update dialog log of the query request; generating a standardized update corpus based on the update dialog log and the standard query question; generalizing the standardized updating corpus to obtain a generalized corpus; updating the prefix tree based on the generalized corpus.

An electronic device 1000 according to this embodiment of the invention is described below with reference to fig. 10. The electronic device 1000 shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 10, the electronic device 1000 is embodied in the form of a general purpose computing device. The components of the electronic device 1000 may include, but are not limited to: the at least one processing unit 1010, the at least one memory unit 1020, and a bus 1030 that couples various system components including the memory unit 1020 and the processing unit 1010.

Where the storage unit stores program code that may be executed by the processing unit 1010 to cause the processing unit 1010 to perform the steps according to various exemplary embodiments of the present invention described in the "exemplary methods" section above in this specification. For example, the processing unit 1010 may perform steps S202, S204 to S208 as shown in fig. 2, and other steps defined in the query request completion method of the present disclosure.

The storage unit 1020 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)10201 and/or a cache memory unit 10202, and may further include a read-only memory unit (ROM) 10203.

The memory unit 1020 may also include a program/utility 10204 having a set (at least one) of program modules 10205, such program modules 10205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1030 may be any one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and a local bus using any of a variety of bus architectures.

The electronic device 1000 may also communicate with one or more external devices 1060 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1000 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interfaces 1050. Also, the electronic device 1000 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1050. As shown, the network adapter 1050 communicates with the other modules of the electronic device 1000 via a bus 1030. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when the program product is run on the terminal device.

According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for completing a query request, comprising:

constructing a standardized corpus of the query request;

constructing a prefix tree based on the standardized corpus;

when a prefix of the query request input by a user is obtained, querying a node query corpus matched with the prefix in the prefix tree;

and completing the query request based on the node query corpus.

2. The method according to claim 1, wherein constructing the normalized corpus of query requests comprises:

and constructing the standardized corpus based on the historical query corpus and the corresponding first consulting volume and/or the pre-stored standard query sentence between the user and the robot and the corresponding second consulting volume.

3. The method according to claim 2, wherein the constructing the standardized corpus based on the historical query corpus and the corresponding first consulting volume and/or the pre-stored standard query question between the user and the robot and the corresponding second consulting volume comprises:

performing screening operation on the historical query corpus based on preset screening conditions, and generating a first corpus set based on screening results and the first consulting volume;

acquiring the corresponding standard query question based on preset intention classification, and generating a second corpus based on the standard query question and the second consulting volume;

4. The method according to claim 3, wherein the performing a filtering operation on the historical query corpus based on a preset filtering condition, and the generating a first corpus based on the filtering result and the first query volume comprises:

selecting a dialog log which is related to the query request and is in a forward preset time length from the current moment;

deleting stop words in the dialog log to generate a corpus set to be processed;

extracting multiple classes of similar query requests in the corpus to be processed based on an edit distance algorithm, and merging the similar query requests by each class to obtain multiple classes of merged query requests;

counting the consulting quantity of each type of similar query request to serve as the first consulting quantity;

and screening the multi-class merged query requests based on the relation between the first consulting quantity and the consulting quantity threshold value, and determining the screened merged query requests as the first corpus.

5. The query request completion method according to claim 4, further comprising:

generating an inquiry table based on the merged query request and the first consulting volume;

the generating a second corpus based on the standard query question and the second consulting volume comprises:

performing similarity matching on the standard query sentence and the inquiry form to determine the second consulting amount;

and generating the second corpus based on the standard query question and the consulting amount of the standard query question.

6. The query completion method according to any of claims 2 to 5, wherein said constructing a prefix tree based on said normalized corpus comprises:

performing word segmentation on the linguistic data in the standardized corpus set to form word segmentation character strings of different layers;

determining the number of consultations of the word cutting character string based on the first consultations amount and/or the second consultations amount;

and constructing the prefix tree by taking the word cutting character string as an edge and the consultation times of the word cutting character string as nodes.

7. The method according to claim 6, wherein the constructing the prefix tree using the word-cutting character string as an edge and the number of consulting times of the word-cutting character string as a node further comprises:

and sequencing the nodes generated by the word cutting character strings of each layer according to the consultation times of the word cutting character strings to construct the prefix tree.

8. The query completion method according to claim 6, wherein said constructing a prefix tree based on said normalized corpus comprises:

acquiring entity information in a specified field;

extracting entity characters in the word cutting character string based on the entity information;

replacing the entity character with the same generalized character to generate the generalized prefix tree.

9. The method according to claim 8, wherein the querying the node query corpus matching the prefix in the prefix tree when the prefix of the query request input by the user is obtained comprises:

when the prefix is obtained, extracting the entity character in the prefix based on a named body identification operation;

replacing the entity character with the generalization character to perform generalization processing on the entity character;

and executing query operation in the prefix tree based on the generalized characters and other characters in the prefix so as to obtain the corresponding node query corpus.

10. The method according to claim 9, wherein when a prefix of a query request input by a user is obtained, querying a node query corpus matching the prefix in the prefix tree further comprises:

and replacing the generalized characters in the node query corpus with the entity characters so as to complete the query request based on the replaced node query corpus.

11. The query request completion method according to any one of claims 2 to 5, further comprising:

regularly acquiring an update dialog log of the query request;

generating a standardized update corpus based on the update dialog log and the standard query question;

generalizing the standardized updating corpus to obtain a generalized corpus;

updating the prefix tree based on the generalized corpus.

12. An apparatus for query completion, comprising:

the construction module is used for constructing a standardized corpus of the query request;

the construction module is used for constructing a prefix tree based on the standardized corpus;

the query module is used for querying a node query corpus matched with a prefix in the prefix tree when the prefix of the query request input by a user is acquired;

and the completion module is used for completing the query request based on the node query corpus.

13. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the query request completion method of any one of claims 1-11 via execution of the executable instructions.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the query request completion method according to any one of claims 1 to 11.