CN111858899A - Statement processing method, device, system and medium - Google Patents

Statement processing method, device, system and medium Download PDF

Info

Publication number
CN111858899A
CN111858899A CN202010764814.8A CN202010764814A CN111858899A CN 111858899 A CN111858899 A CN 111858899A CN 202010764814 A CN202010764814 A CN 202010764814A CN 111858899 A CN111858899 A CN 111858899A
Authority
CN
China
Prior art keywords
question
statement
determining
sentence
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010764814.8A
Other languages
Chinese (zh)
Other versions
CN111858899B (en
Inventor
范晓东
张文慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
ICBC Technology Co Ltd
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
ICBC Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC, ICBC Technology Co Ltd filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202010764814.8A priority Critical patent/CN111858899B/en
Publication of CN111858899A publication Critical patent/CN111858899A/en
Application granted granted Critical
Publication of CN111858899B publication Critical patent/CN111858899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The present disclosure provides a statement processing method, including: acquiring an inquiry statement; determining question categories aiming at question sentences in a pre-constructed question library, wherein the question library comprises the question sentences of p categories; determining candidate item statements of a pre-constructed item library aiming at an inquiry statement, wherein the item library comprises q item statements; and determining a reply sentence for the inquiry sentence according to the question category and the candidate sentence. The present disclosure also provides a sentence processing apparatus, a computer system and a computer-readable storage medium. The method and the device provided by the disclosure can be used in the field of artificial intelligence, the field of big data and other fields.

Description

Statement processing method, device, system and medium
Technical Field
The present disclosure relates to the field of intelligent question answering technologies, and in particular, to a sentence processing method, apparatus, system, and medium.
Background
With the rapid development of artificial intelligence technology, iterative updating of learning algorithms, and base classes of massive question-answering knowledge data, the intelligent question-answering technology has been rapidly developed in many fields.
In implementing the disclosed concept, the inventors found that there are at least the following problems in the related art: in the related art, the question-answering technology is mainly realized by making a large number of expert rules or pre-training a multi-layer deep neural network algorithm. Wherein, the expert rules set the corresponding relation between the question and the answer. For special fields like the government field, a large amount of question and answer corpora cannot be accumulated due to slow informatization progress, and rule making and algorithm training are not facilitated. Furthermore, since the query sentence of the user is usually spoken seriously, the answer satisfactory to the user cannot be accurately matched.
Disclosure of Invention
In view of the above, the present disclosure provides a sentence processing method, apparatus, system, and medium for improving accuracy of a reply sentence obtained by matching.
One aspect of the present disclosure provides a statement processing method, including: acquiring an inquiry statement; determining question categories aiming at question sentences in a pre-constructed question library, wherein the question library comprises the question sentences of p categories; determining alternative item statements aiming at inquiry statements in a pre-constructed item library, wherein the item library comprises q item statements; and determining a reply sentence aiming at the inquiry sentence according to the question category and the alternative sentence, wherein p and q are integers which are more than or equal to 2.
According to the embodiment of the disclosure, determining the question category for the question sentence in the pre-constructed question library comprises: inputting the inquiry sentences into a pre-trained classification model, and determining alternative problem categories aiming at the inquiry sentences and probability values of the inquiry sentences aiming at the alternative problem categories; and determining the alternative problem category as the problem category aiming at the inquiry statement under the condition that the probability value is more than or equal to the preset probability value, wherein the classification model is obtained by training according to the problem statements of p categories.
According to the embodiment of the disclosure, the statement processing method further comprises the steps of determining a preset probability value according to the question statements of the p categories; the method comprises the following steps: obtaining m training samples and n test samples according to the p classes of question sentences, wherein the m training samples are used for training a preset classification model to obtain a pre-trained classification model; inputting n test samples into a pre-trained classification model, and determining probability values of the n test samples respectively aiming at alternative problem categories to obtain n probability values; and determining the preset probability value as the average value of n probability values, wherein m and n are integers which are more than or equal to 2.
According to an embodiment of the present disclosure, the obtaining m training samples and n test samples includes: obtaining r associated question sentences associated with the question sentences of the p categories by at least one of: replacing words in the question sentences of the p categories according to the synonym library to obtain r associated question sentences; replacing the matter statements included in the question statements of multiple categories according to the p matter statements to obtain r associated question statements; the question sentences of p categories are translated back to obtain r associated question sentences; dividing the r associated question sentences into categories to which the question sentences with the association relationship belong to obtain enhanced p categories of question sentences; and dividing the enhanced p categories of question sentences into m training samples and n test samples, wherein r is an integer greater than or equal to 1.
According to an embodiment of the present disclosure, determining a candidate statement for an inquiry statement comprises: determining a word vector for the query statement as a first word vector; acquiring a word vector aiming at each item statement in q item statements to obtain q second word vectors; determining the similarity between each second word vector and the first word vector in the q second word vectors to obtain q first similarities; determining a target word vector in the q second word vectors according to the relation between the q first similarities and the preset similarity; and determining the matter statement aimed at by the target word vector as a candidate matter statement aimed at the inquiry statement.
According to an embodiment of the present disclosure, determining a word vector for the query statement as the first word vector comprises: performing word segmentation processing on an inquiry statement to obtain s first words; according to the stop word library, the stop words in the s first words are removed to obtain t second words; counting the occurrence times of t second words in the query statement; and determining a first word vector of the query statement according to a preset word library and the occurrence times, wherein s and t are integers which are more than or equal to 2, and s is more than or equal to t.
According to an embodiment of the present disclosure, the similarity between each second word vector and the first word vector includes a jaccard similarity; the statement processing method further comprises the following steps: determining a word vector of each item statement in the item library to obtain q second word vectors; and storing the q second word vectors into the file in the NPZ format for reading, wherein the q second word vectors are determined according to a preset word library.
According to an embodiment of the present disclosure, the sentence processing method further includes determining a preset similarity according to the plurality of item sentences; the method comprises the following steps: determining the similarity among q item sentences to obtain [ q (q-1)/2] second similarities; and determining the preset similarity according to the value distribution of the [ q (q-1)/2] second similarities, wherein the preset similarity comprises a first preset similarity and a second preset similarity, and the first preset similarity is greater than the second preset similarity.
According to an embodiment of the present disclosure, determining a target word vector of the plurality of second word vectors comprises: determining whether candidate word vectors with the similarity greater than or equal to a first preset similarity are included in the q second word vectors; determining the alternative word vector as the target word vector under the condition that the q second word vectors comprise alternative word vectors; and in the case that no alternative word vector is included in the q second word vectors: determining a word vector to be selected, of the q second word vectors, with the similarity between the first word vector and the q second word vectors, less than a first preset similarity and greater than or equal to a second preset similarity; according to the similarity between the word vectors and the first word vector from large to small, the word vectors to be selected are sequenced to obtain a word vector sequence to be selected; and determining a preset number of word vectors to be selected which are ranked earlier in the word vector sequence to be selected as target word vectors.
According to an embodiment of the present disclosure, the performing word segmentation processing on the query statement to obtain s first words includes: replacing words in the query sentences according to the synonym library and the target field lexicon to obtain replaced query sentences; and performing word segmentation processing on the replaced inquiry sentence to obtain s first words.
According to an embodiment of the present disclosure, the number of the above-described problem categories is one; determining a reply sentence to the query sentence includes: determining a reply sentence having a mapping relation with both the candidate sentence and the question category as a reply sentence for the question sentence from a pre-constructed reply sentence library in the case that the number of the candidate sentences is one; in the case where the number of candidate sentences is at least two: determining a target item statement aiming at an inquiry statement in at least two alternative item statements by adopting a lightweight semantic model; and determining a reply sentence having a mapping relation with both the target matter sentence and the question category as a reply sentence for the question sentence from the pre-constructed reply sentence library.
According to an embodiment of the present disclosure, determining the target matter statement for the query statement includes: generating at least two standard inquiry sentences aiming at the at least two candidate sentence sentences respectively according to the question category and the at least two candidate sentence sentences; taking the query statement and each standard query statement in at least two standard query statements as a statement pair, and inputting a lightweight semantic model to obtain the similarity between the query statement and each standard query statement; and determining the candidate item statement which is the standard inquiry statement and has the maximum similarity with the inquiry statement as the target item statement.
Another aspect of the present disclosure provides a sentence processing apparatus including: the statement acquisition module is used for acquiring the inquiry statement; the system comprises a category determining module, a question searching module and a question searching module, wherein the category determining module is used for determining the category of questions aiming at inquiry sentences in a pre-constructed question library, and the question library comprises p categories of question sentences; the system comprises an item determining module, a query statement generating module and a query statement generating module, wherein the item determining module is used for determining candidate item statements aiming at query statements in a pre-constructed item library, and the item library comprises q item statements; and the answer determining module is used for determining an answer sentence aiming at the inquiry sentence according to the question category and the alternative sentence, wherein p and q are integers which are more than or equal to 2.
Another aspect of the present disclosure provides a computer system comprising: one or more processors; and a storage device for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above statement processing method.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for performing the sentence processing method as described above when executed by a processor.
Another aspect of the present disclosure provides a computer program comprising computer executable instructions for implementing the statement processing method as described above when executed.
According to the embodiment of the disclosure, the technical problems that the information development is slow and the fields with many professional expressions in the related art can not be matched with the answers met by the user can be at least partially avoided. According to the method and the device, the question category and the candidate statement are respectively matched from the pre-constructed question library and the pre-constructed matter library according to the inquiry statement, so that the core intention of the user question can be effectively represented, the accuracy of the answer statement determined according to the question category and the candidate statement can be improved, and the user experience is improved.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario of a statement processing method, apparatus, system and medium according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a statement processing method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow diagram for determining question categories for an inquiry statement according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart for determining a preset probability value according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow diagram for determining an alternative statement for an inquiry statement according to an embodiment of the disclosure;
FIG. 6 schematically illustrates a distribution histogram of a plurality of second similarities according to an embodiment of the present disclosure;
fig. 7 schematically shows a block diagram of a sentence processing apparatus according to an embodiment of the present disclosure; and
fig. 8 schematically shows a block diagram of a computer system adapted to perform a statement processing method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The embodiment of the disclosure provides a statement processing method, which first obtains an inquiry statement. Then, the question category for the question sentence in the pre-constructed question bank is determined, and the question bank comprises a plurality of categories of question sentences. A candidate item statement for the query statement in a pre-built item library is then determined, the item library including a plurality of item statements. And finally, determining a reply sentence aiming at the inquiry sentence according to the question category and the alternative sentence.
Fig. 1 schematically illustrates an application scenario of a statement processing method, apparatus, system and medium according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of an application scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the application scenario 100 of this embodiment may include, for example, terminal devices 111, 112, 113, a network 120 and an application server 130, the network being a medium for providing communication links between the terminal devices 111, 112, 113 and the application server 130. Network 120 may include various connection types, such as wired, wireless communication links, and so forth.
The terminal devices 111, 112, 113 may be, for example, various electronic devices with display screens and with processing functions, including but not limited to smartphones, tablets, laptop convenience computers, desktop computers, smart wearable devices, and the like. The terminal device may be installed with various client applications, such as a web browsing application, a client application of each service organization, an instant messaging application, and the like.
Illustratively, various knowledge bases may be maintained in the application server 130 in advance, for example, and the client application in the terminal device 111, 112, 113 may acquire an inquiry sentence in response to a user through an operation on an input device or voice information of the user, acquire knowledge from the knowledge base maintained by the application server 130 according to the inquiry sentence, and present the acquired knowledge to the user via the terminal device.
Illustratively, knowledge in the knowledge base may be set according to actual needs, for example, for the field of government affairs, the knowledge base may include, for example, a question base and a reply sentence base. The question bank comprises a plurality of question sentences, the answer sentence bank comprises a plurality of answer sentences, and mapping relations are established between the question sentences in the question sentence bank and the answer sentences in the answer sentence bank. To determine a reply sentence having a mapping relation by looking up a question sentence matching the question sentence.
According to the embodiment of the disclosure, in consideration of the fields of corpus insufficiency and corpus wording specialties, such as the government field, in order to avoid the situation that the answer sentences cannot be accurately matched due to the serious spoken expression of the query sentences, the knowledge base maintained by the application server may include not only a question base and an answer sentence base, but also a question sentence base, for example, and the question base may include question sentences from a plurality of question angles, and establish the mapping relationship between the answer sentences, the question sentences and the question angles. Therefore, when the answer sentence is determined, the matter sentence matched with the inquiry sentence and the question angle corresponding to the question sentence can be determined respectively, and then the matched answer sentence can be determined. Therefore, compared with the direct problem matching, the method can grasp the key information of the inquiry sentences in a finer granularity, judge the user intention more accurately and improve the accuracy of the determined reply sentences.
Illustratively, as shown in fig. 1, the application scenario may further include a server 140, for example, the server 140 interacting with the terminal device and the application server via the network 120. The terminal device may, for example, send the query sentence to the server 140, with the server 140 determining the reply sentence from a knowledge base maintained in the application server 130.
It should be noted that the statement processing method according to the embodiment of the present disclosure may be generally executed by a terminal device. Accordingly, the sentence processing apparatus of the embodiment of the present disclosure may be generally disposed in the terminal device. Alternatively, the statement processing method according to the embodiment of the present disclosure may be executed by a server. Accordingly, the sentence processing apparatus of the embodiment of the present disclosure may also be disposed in the server.
It should be understood that the terminal devices, networks, application servers, and servers in fig. 1 are merely illustrative. There may be any type of terminal device, network, application server, and server, as the implementation requires.
The statement processing method according to the embodiment of the present disclosure will be described in detail with reference to fig. 2 to 6 in the following application scenario described with reference to fig. 1.
FIG. 2 schematically shows a flow chart of a statement processing method according to an embodiment of the present disclosure.
As shown in fig. 2, the sentence processing method of this embodiment may include operations S210 to S240.
In operation S210, a query statement is acquired.
According to the embodiment of the present disclosure, the query statement may be obtained in response to a user's operation of a peripheral input device of the terminal device, or an input module built in the terminal device, for example. Alternatively, the query sentence may be obtained by converting voice information into text information in response to the voice information of the user.
In operation S220, question categories for the question sentences in the pre-constructed question bank are determined, and the question bank includes p categories of question sentences. Wherein p is an integer of 2 or more.
According to the embodiment of the present disclosure, the category of the question sentence included in the question bank may be set according to actual needs, for example. For example, in the field of government, the p categories may include, for example, a "transaction materials" category, a "transaction location" category, a "transaction time limit" category, a "transaction scope" category, a "transaction fees" category, a "transaction flow" category, a "transaction conditions" category, a "materials requirements" category, and the like. Wherein question statements of the "transact materials" category may include, for example, "what materials need to be prepared? "," what materials need to be provided? "and the like.
According to the embodiment of the present disclosure, the question sentences of p categories in the question bank may be respectively matched with the question sentences, and the category to which the question sentence having the highest similarity with the question sentence belongs may be determined as the question category for the question sentence. When matching is performed, the question statement and the question statement may be converted into statement vectors respectively, and then the similarity between the statement vectors is used as the similarity between the question statement and the question statement. The similarity can be embodied by any one of the following parameter forms: cosine similarity, Jacobs's similarity coefficient, Pearson's correlation coefficient, Spireman's correlation coefficient, etc. The sentence vector can be obtained, for example, by segmenting a sentence, and converting a word sequence obtained by segmenting the sentence into a vector by a word to vector (word to vector) method.
According to the embodiment of the disclosure, a classification model can be obtained by training according to p categories of question sentences in a question bank. And then inputting the inquiry sentences into a pre-trained classification model, and determining the question categories aiming at the inquiry sentences through the output of the classification model.
Illustratively, the output of the classification model may be, for example, a probability vector composed of probabilities of the query statement for each of a plurality of predetermined question classes, one question class for each element in the probability vector. And finally, determining the question category corresponding to the element with the largest value in the probability vector as the question category aiming at the inquiry statement.
For example, the output vector of the classification model may be a vector with values of 0 or 1 for each element, and each element corresponds to a problem category. And finally, determining the question category corresponding to the element with the value of 1 as the question category aiming at the inquiry statement.
Illustratively, the output of the classification model may also include, for example, a word vector for the question category of the question sentence and the probability that the question sentence is for the question category. The question category with the highest degree of matching can be determined as the question category for the question sentence directly by matching the word vector of the question category with the predetermined question category.
In an embodiment, the question category for the query statement may also be implemented by the following procedure described in fig. 3, which is not described herein again.
In operation S230, candidate item statements for the query statement in a pre-constructed item library are determined, the item library including q item statements. Wherein q is an integer of 2 or more.
According to the embodiment of the present disclosure, the q transaction statements included in the transaction library may be set according to actual requirements, for example. For example, for the field of government affairs, the matter sentence may include, for example, a sentence for indicating a matter such as "social insurance", "pension", "complaint", and the like.
According to the embodiment of the present disclosure, the query statement and q item statements in the item library may be respectively matched, and the item statement with the highest similarity to the query statement may be determined as the candidate item statement for the query statement. When matching, the question statement and the matter statement may be converted into statement vectors, and then the similarity between the statement vectors is used as the similarity between the question statement and the question statement. The sentence vector can be obtained, for example, by segmenting a sentence, and converting a word sequence obtained by segmenting the sentence into a vector by a word to vector (word to vector) method.
According to the embodiment of the present disclosure, in order to improve the accuracy of the determined similarity, for example, the similarity at the word level may be calculated when calculating the similarity, considering that the query sentence of the user tends to be poor in specificity for a field with strong specificity (for example, a government field). Therefore, the operation S230 may be specifically realized by the flow described in the following fig. 5, for example, and is not described herein again.
In operation S240, a reply sentence to the query sentence is determined according to the question category and the candidate sentence.
According to an embodiment of the present disclosure, in order to facilitate determining a reply sentence, the knowledge base of the embodiment may be maintained in advance with a plurality of reply sentences, for example, constituting a pre-constructed reply sentence base, and for each reply sentence, a mapping relationship between the reply sentence and the question category and the matter sentence is established. In one embodiment, it may be that a reply sentence having a mapping relationship with the question category and the candidate sentence is determined as a reply sentence to the question sentence.
According to the embodiment of the present disclosure, a situation in which the user cannot be answered accurately due to at least two reply sentences resulting from at least two determined matter sentences is avoided. The embodiment may first determine whether a transaction statement exists. And if the candidate item sentences exist, determining the number of the candidate item sentences. If the number of candidate sentences is one, since there is only one question category determined by the method in operation S220, it is possible to directly determine, from the pre-constructed reply sentences, a reply sentence having a mapping relationship with both the candidate sentences and the question categories as a reply sentence for the question sentence.
For example, when the number of the candidate statement is at least two, for example, the at least two candidate statements may be first spliced with the words representing the problem categories to form a new statement, so as to obtain at least two new statements. And then determining the candidate item statement included in the statement with the highest similarity with the query statement as the target item statement. And finally determining the answer sentences which have mapping relations with the target item sentences and the question categories from the pre-constructed answer sentences as answer sentences aiming at the inquiry sentences.
For example, to improve accuracy, a predetermined model may be further used to determine a target item statement for the query statement in the at least two candidate item statements. In order to ensure the response speed of the intelligent question answering, the predetermined model can adopt a lightweight semantic model. In one embodiment, the lightweight semantic model may employ, for example, an Albert model.
According to the embodiment of the present disclosure, when determining the target matter statement by using the lightweight semantic model, at least two standard query statements respectively for at least two candidate matter statements may be generated according to the question category and the at least two candidate matter statements. Wherein each standard query statement may be formed by concatenating a candidate statement and a word characterizing the question category. The query statement and each standard query statement are then formed into a statement pair, resulting in at least two statement pairs. And then, the at least two statement pairs are respectively input into the lightweight semantic model, and the similarity between each standard query statement and each query statement is output to obtain at least two similarities. And finally, determining the similarity with the maximum value in at least two similarities. The candidate item sentence for which the standard query sentence corresponding to the maximum similarity is directed is set as the target item sentence.
In summary, according to the embodiment of the present disclosure, the question category and the candidate statement are respectively matched from the pre-constructed question library and the pre-constructed matter library according to the query statement, so that the core intention of the user question can be effectively represented, and therefore, the accuracy of the reply statement determined according to the question category and the candidate statement can be improved, and the user experience can be improved.
FIG. 3 schematically illustrates a flow diagram for determining question categories for an inquiry statement according to an embodiment of the present disclosure.
As shown in fig. 3, in the present embodiment, the operation S220 of determining the question category for the question sentence may include, for example, operations S321 to S322.
In operation S321, the query sentence is input into the pre-trained classification model, and the candidate question category for the query sentence and the probability value of the query sentence for the candidate question category are determined.
In operation S322, in the case where the probability value is greater than or equal to the preset probability value, it is determined that the candidate question category is a question category for the query sentence.
According to embodiments of the present disclosure, the pre-trained classification model may comprise, for example, a Convolutional Neural Network (CNN) model. For example, to improve the intended understanding of the query statement, the pre-trained classification model may be, for example, a CNN + CRF (Conditional Random Field) classification model.
According to an embodiment of the present disclosure, in order to avoid a case where a problem asked by a user is a problem unrelated to a function provided by a current client application, a problem category determined according to an output of a pre-trained classification model may be taken as an alternative problem category, and a preset probability value may be assigned to the problem category for a question sentence. And determining the alternative question category as the question category aiming at the question statement only when the probability value of the question statement output by the pre-trained classification model aiming at the alternative question category is larger than or equal to the preset probability value. The preset probability value may be set according to actual requirements, for example. For example, the preset probability value may be any value greater than 0.5.
In an embodiment, in order to improve the accuracy of the determined question categories, the statement processing method of this embodiment may further include, for example, an operation of determining preset probability values according to the question statements of p categories, and this operation may be specifically implemented by, for example, a flow described in subsequent fig. 4.
Fig. 4 schematically shows a flow chart for determining a preset probability value according to an embodiment of the present disclosure.
As shown in fig. 4, the operation of determining the preset probability value according to the p categories of question sentences may include, for example, operations S451 to S453.
In operation S451, m training samples and n test samples are obtained according to the p classes of question sentences, where the m training samples are used to train a predetermined classification model to obtain a pre-trained classification model. Wherein m and n are integers greater than or equal to 2.
According to an embodiment of the present disclosure, the predetermined classification model may be, for example, an acquired open source classification model. The embodiment can divide all question sentences in the question bank into two parts, wherein one part is used for generating the training sample, and the other part is used for generating the test sample. In order to improve the accuracy of the model, when all question sentences are divided into two parts, the question sentences can be uniformly distributed according to question categories. For example, the question statements for each question category may be divided into two parts, one for generating training samples and one for generating test samples.
After all question sentences are divided into two parts, a label capable of indicating the question category to which the question sentences belong is allocated to each question sentence, and the question sentences allocated with the labels are converted into sentence vectors to obtain training samples or test samples. The conversion into a statement vector may be implemented, for example, by segmenting a sentence, and then converting a word sequence obtained by segmenting a word into a vector by a word to vector method.
After the training samples are obtained, the m training samples can be sequentially input into a preset classification model, and parameters of the preset classification model are adjusted and optimized by comparing the output of the preset classification model with the labels of the training samples, so that the pre-trained classification model is obtained.
According to the embodiment of the disclosure, in order to avoid the lack of model accuracy caused by less corpus in the special field, the embodiment may also supplement question sentences in the question bank by Data Augmentation (Data Augmentation) before generating the training samples and the test samples. Specifically, for example, a plurality of associated question sentences associated with a plurality of categories of question sentences may be obtained by at least one of the following methods, and the question sentences are expanded according to the associated question sentences.
For example, the application server may maintain a synonym library in advance, and in this embodiment, words in the question sentences of p categories may be replaced according to the synonym library to obtain r associated question sentences. For example, for the question sentence "how to pay social security", an associated question sentence "how to pay social security" can be obtained by replacing the word "pay" with "pay" according to the thesaurus. Wherein r is an integer of 1 or more.
Illustratively, the matter statements included in the question statements of p categories may be replaced from a plurality of matter statements, resulting in r associated question statements. For example, for the question sentence "how to pay social security", an associated question sentence "how to pay pension" can be obtained by replacing the sentence "social security" representing the matter with "pension". Wherein r is an integer of 1 or more.
Illustratively, by back-translating p categories of question sentences, r associated question sentences result. For example, for a question sentence expressed in chinese, the question sentence may be translated into an english sentence, and then the translated english sentence may be translated into a sentence expressed in chinese, so as to obtain an associated question sentence. It can be understood that the associated question statement of the question statement can also be obtained according to methods such as chinese translation, chinese translation and the like. Wherein r is an integer of 1 or more.
After obtaining the associated question sentences, in order to facilitate generating the training samples and the test samples, r associated question sentences may be divided into categories to which the question sentences having an association relationship with the associated question sentences belong, and the r associated question sentences and the plurality of question sentences may be regarded as enhanced p categories of question sentences. And finally, dividing the enhanced p classes of question sentences into two parts to obtain m training samples and n test samples.
In operation S452, n test samples are input into the pre-trained classification model, and probability values of the n test samples respectively for the candidate problem categories are determined, so as to obtain n probability values.
In operation S453, the preset probability value is determined as an average of the n probability values.
According to the embodiment of the disclosure, n candidate problem categories respectively aiming at the n test samples and probability values of the n test samples aiming at the candidate problem categories aiming at the n test samples can be obtained through outputting the n test samples respectively as the input of a pre-trained classification model, and n probability values corresponding to the n test samples one by one are obtained in total. And finally, taking the average value of the n probability values as a preset probability value.
FIG. 5 schematically shows a flow diagram for determining an alternative statement for an inquiry statement according to an embodiment of the disclosure.
As shown in fig. 5, operation S230 of the embodiment of determining the backlog statement for the query statement may include, for example, operations S531 to S535.
In operation S531, a word vector for the query statement is determined as a first word vector.
According to the embodiment of the disclosure, word segmentation processing may be performed on the query sentence first to obtain s first words of the query sentence, and for example, word segmentation processing may be performed by using a Bert tool. And then converting the s first words according to a predetermined dictionary to obtain a first word vector. It is to be understood that the above method of word segmentation processing is only used as an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto, and any method may be adopted for word segmentation processing in the present disclosure. Wherein s is an integer of 2 or more.
According to the embodiment of the present disclosure, it is considered that stop words such as "what" and "what" are not generally present in the matter sentence, but stop words are generally present in the query sentence due to the fact that the expression is relatively spoken. Therefore, in order to avoid interference of the disabled words on statement matching, after s first words are obtained through word segmentation processing, the disabled word library may be maintained in this embodiment, and the disabled words in the s first words are removed according to the disabled word library to obtain t second words. Finally, the first word vector is obtained by converting the t second words. Wherein t is an integer of 2 or more, and s is t or more.
According to the embodiment of the present disclosure, in order to further avoid low matching accuracy caused by spoken language transformation of the query sentence, in this embodiment, before performing word segmentation processing on the query sentence, for example, word segmentation processing may be performed on the query sentence first, and words in the query sentence are replaced according to the synonym library and the target domain lexicon, so as to obtain a replaced query sentence. After the replaced inquiry statement is obtained, word segmentation processing can be performed on the replaced inquiry statement to obtain s first words. Various professional words of the target field can be maintained in the target field word bank. For example, for the field of government affairs, professional words such as "social insurance", "proportion paid", "visit", etc. may be maintained. When "social security" is included in the query sentence, the target domain thesaurus may be used to replace "social security" with "social security".
According to an embodiment of the present disclosure, in order to further improve the accuracy of similarity calculation, the present embodiment may, for example, pass the number of occurrences of each of a plurality of words in a query sentence when converting into a word vector according to the plurality of words (first words or second words) of the query sentence. Finally, a first word vector of the query statement is determined based on the number of occurrences and a predetermined word library. Each element in the first word vector corresponds to a word in a predetermined word bank. And the value of each element is the number of occurrences of the corresponding word in the query statement.
For example, for the query statement "how to handle social insurance and loss of business", the resulting plurality of second words may be: the "what", "do", "reason", "society", "party", "insurance", "loss", "business" and "insurance" are counted to obtain that the number of occurrences of "what", "do", "reason", "society", "party", "insurance", "loss" and "business" is 1 and the number of occurrences of "insurance" is 2. If the predetermined word stock includes 15 words and the positions of "what", "do", "reason", "society", "club", "insurance", "loss" and "business" in the 15 words are 1, 2, 4, 5, 7, 9, 10, 11, 13 and 15, respectively, the converted first word vector a can be represented as: l 110110101120101 l.
In operation S532, a word vector for each of the q statement sentences is obtained, resulting in q second word vectors.
According to the embodiment of the present disclosure, the operation S532 may obtain the word vector of each matter statement, for example, in a similar manner as the operation S531. It should be noted that, since the matter sentence does not generally include the stop word, the operation of removing the stop word is not required to be performed.
According to the embodiment of the present disclosure, in order to increase the statement processing rate, and considering that the transaction library is maintained in advance and has small variations, the word vector of each transaction statement in the transaction library may be determined in advance by a method similar to operation S531, and q second word vectors may be obtained. After q second word vectors are obtained, the q second word vectors may be stored in a predetermined file and the compressed file may be stored in a predetermined storage space for reading in the statement processing. The predetermined file may be, for example, a file in NPZ format, specifically, a file with a suffix ". NPZ", and the file is a compressed file. The file can be read, for example, using the NumPy tool in Python using the np.load () function. The q second word vectors may be stored into a file, for example, by using a numpy. Accordingly, operation S532 may obtain q second word vectors by reading the file suffixed with ". npz" in the predetermined storage space. The NumPy tool is an open-source numerical calculation extension tool of Python, can be used for storing and processing large matrixes, supports a large amount of dimensional data and matrix operation, and provides a large amount of mathematical function libraries for array operation.
In operation S533, a similarity between each of the q second word vectors and the first word vector is determined, resulting in q first similarities.
According to the embodiment of the present disclosure, the first similarity may be embodied by any one of the foregoing parameter forms.
Illustratively, the similarity may be embodied by a Jacard similarity coefficient. The first word vector and the second word vector are word frequency vectors. For the first word vector a and any second word vector B, the similarity between the two can be calculated by the following formula:
Figure BDA0002613034480000151
according to the formula, the Jacard similarity coefficient is the ratio of the size of the intersection of the two vectors to the size of the union of the two vectors. Wherein the intersection represents co-occurrence information of the query statement and the matter statement. The size of the intersection divided by the size of the union characterizes the proportion of the co-occurrence information to the overall information of the query statement and the matter statement.
According to the embodiments of the present disclosure, it is considered that the query sentence is generally a spoken expression, and the matter sentence is generally a professional expression, which may result in a large difference in the character length and the content of the expression between the query sentence and the matter sentence to some extent. If the similarity between the query statement and the item statement is calculated by directly using the above formula, there is a problem that since many words that do not exist in the item statement appear in the spoken expression, the size of the union is uncontrollably increased, the calculated similarity is low, and thus a matched item statement cannot be obtained. To avoid this problem, the present embodiment may, for example, determine the first word vector and the second word vector from the same predetermined dictionary, thereby defining the range of words to which the union relates. In this way, the number of elements included in the determined first word self-vector and the number of elements included in the second word vector are equal.
Illustratively, following the embodiment where the aforementioned query statement is "how to handle social insurance and loss of business", a ═ 110110101120101 |, and if the transaction statement is "social insurance", B ═ 0000 |00101110000, A and B each include elements equal in number to the number of words included in the predetermined dictionary. If the element a is in the same position in A and BiAnd biWhen all the elements are not 0, the element at the corresponding position obtained by calculating the intersection is aiAnd biThe smaller value in the sum is calculated, and the element at the corresponding position obtained by the union is taken as aiAnd biThe larger value of (d). The vector obtained by the intersection of a and B can be represented as | 000000101110000 |, then in the above formula, the value of a ∞ B is the sum of all elements in the vector, and a & B is calculated as 4: the vector obtained by the union of a and B may be represented as | 110110101120101 |, the value of a £ B in the above formula may be represented as the sum of all elements in the vector, and a ueb ═ 11 is calculated. The resulting calculated Jacard's similarity factor is 4/11, which is approximately equal to 0.364.
In operation S534, a target word vector of the q second word vectors is determined according to a relationship between the q first similarities and a preset similarity.
In operation S535, the matter statement for which the target word vector is directed is determined to be a candidate matter statement for the query statement.
According to the embodiment of the disclosure, a word vector, of the second word vector, having a first similarity with the first word vector not less than a preset similarity may be determined as a target word vector. The preset similarity can be set according to actual requirements. For example, the preset similarity may take any value less than 1, such as 0.5 or 0.6.
According to an embodiment of the present disclosure, in order to improve adaptability, in a case where the first similarity between each second word vector and the first word vector is less than the preset similarity, the embodiment may further select a predetermined number of word vectors with larger first similarities from the q second word vectors as the target word vector. The predetermined number may be determined according to the total number of matter statements in the matter library, for example, the predetermined number may be a predetermined ratio of the total number, for example, a value of 0.1 times and a value of 0.2 times the total number may be taken. Alternatively, the predetermined number may take any value less than the total number, and may be, for example, 2, 5, 8, 10, etc. Alternatively, the predetermined number may be set according to actual requirements, which is not limited by the present disclosure.
According to the embodiment of the disclosure, in order to avoid the situation that the reply sentence fed back according to the obtained candidate sentence cannot meet the user requirement due to too low similarity, the preset similarities may also be set to two, a first preset similarity of the two preset similarities is used as an upper similarity limit, and a second preset similarity of the two similarities is used as a lower similarity limit. Accordingly, the first preset similarity is greater than the second preset similarity. The two preset similarities can be set according to actual requirements.
Accordingly, operation S534 may, for example, first determine whether q second word vectors include candidate word vectors having a similarity greater than or equal to a first preset similarity with the first word vector. And if so, determining the alternative word vector as the target word vector. If not, determining the word vector to be selected, of the q second word vectors, with the similarity between the first word vector and the candidate word vectors being smaller than the first preset similarity and larger than or equal to the second preset similarity. And then according to the similarity between the word vectors and the first word vector from large to small, sequencing the word vectors to be selected to obtain a word vector sequence to be selected. And finally, determining a preset number of word vectors to be selected which are ranked earlier in the word vector sequence to be selected as target word vectors.
In summary, the embodiment can maximally weaken redundant information of the query statement and the matter statement by calculating the word-level similarity, so that the determined similarity can effectively focus on the part of the query statement, which is in common with the matter statement.
According to the embodiment of the disclosure, the situation that the preset similarity cannot be set according to actual requirements under the cold start condition is avoided. The preset similarity may be set according to the similarity between the plurality of item sentences in the item library. Specifically, for example, the similarity between q item statements in the item library may be determined first, and then [ q (q-1)/2] second similarities are obtained. I.e. determining the similarity between any two of the q transaction statements. And then determining the preset similarity according to the value distribution of the [ q (q-1)/2] second similarities.
Fig. 6 schematically shows a distribution histogram of [ q (q-1)/2] second similarities according to an embodiment of the present disclosure.
Illustratively, the similarity between any two item sentences can be obtained, for example, by a method similar to the aforementioned method of calculating the first similarity. In an embodiment, the distribution of the values of [ q (q-1)/2] second similarities may be reflected by a histogram as shown in fig. 6, for example. Wherein, the abscissa is the value of the similarity, and the ordinate is the percentage of the number of the similarities corresponding to each value in the [ q (q-1)/2] second similarities.
According to the histogram of fig. 6, it can be seen that, in the plurality of second similarities, the values of the similarities are mainly concentrated between 0.1 and 0.9. This can be explained to some extent that more matter statements are less distinguishable and that there is less similarity between two matter statements above 0.9, or below 0.1. In an embodiment, for example, the first predetermined similarity may be set to 0.9, and the second predetermined similarity may be set to 0.1.
It should be understood that the method for determining the preset similarity and the method for embodying the distribution of the values of the plurality of second similarities are only examples to facilitate understanding of the disclosure, and the disclosure is not limited thereto.
Fig. 7 schematically shows a block diagram of a sentence processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 7, the sentence processing apparatus 700 of this embodiment may include a sentence acquisition module 710, a category determination module 720, a matter determination module 730, and a reply determination module 740.
The statement obtaining module 710 is configured to obtain the query statement. In an embodiment, the statement obtaining module 710 may be configured to perform operation S210 described in fig. 2, for example, and is not described herein again.
The category determination module 720 is used to determine the question categories for the question sentences in the pre-constructed question library, which includes p categories of question sentences. In an embodiment, the category determining module 720 may be configured to perform operation S220 described in fig. 2, for example, and is not described herein again. Wherein p is an integer of 2 or more.
The matter determining module 730 is configured to determine candidate matter statements for the query statement in a pre-constructed matter library, where the matter library includes q matter statements. In an embodiment, the event determining module 730 may be configured to perform the operation S230 described in fig. 2, for example, and is not described herein again. Wherein q is an integer of 2 or more.
The answer determining module 740 is configured to determine an answer sentence for the query sentence according to the question category and the candidate sentence. In an embodiment, the reply determining module 740 may be configured to perform the operation S240 described in fig. 2, for example, and is not described herein again.
According to the embodiment of the present disclosure, the category determining module 720 may specifically determine the question category for the question sentence by performing operations S321 to S322 described in fig. 3, for example, and details are not repeated here.
According to an embodiment of the present disclosure, the sentence processing apparatus 700 may further include a probability value determining module, for example, configured to determine the preset probability value by performing operations S451 to S453 described in fig. 4, which is not described herein again.
According to the embodiment of the present disclosure, the matter determining module 730 may determine the matter alternative statement for the query statement by performing operations S531 to S535 described in fig. 5, for example, and is not described herein again.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
Fig. 8 schematically shows a block diagram of a computer system adapted to perform a statement processing method according to an embodiment of the present disclosure.
As shown in fig. 8, a computer system 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 801 may also include onboard memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.
In the RAM 803, various programs and data necessary for the operation of the computer system 800 are stored. The processor 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM802 and/or RAM 803. Note that the programs may also be stored in one or more memories other than the ROM802 and RAM 803. The processor 801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, computer system 800 may also include an input/output (I/O) interface 805, input/output (I/O) interface 805 also connected to bus 804. Computer system 800 may also include one or more of the following components connected to I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program, when executed by the processor 801, performs the above-described functions defined in the computer system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM802 and/or RAM 803 described above and/or one or more memories other than the ROM802 and RAM 803.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (15)

1. A statement processing method, comprising:
acquiring an inquiry statement;
determining question categories aiming at the question sentences in a pre-constructed question library, wherein the question library comprises the question sentences of p categories;
determining candidate item statements for the query statement in a pre-constructed item library, wherein the item library comprises q item statements; and
determining a reply sentence to the question sentence based on the question category and the candidate sentence,
wherein p and q are integers of 2 or more.
2. The method of claim 1, wherein the determining a question category for the question statement in the pre-built question bank comprises:
inputting the query statement into a pre-trained classification model, and determining an alternative problem category aiming at the query statement and a probability value aiming at the alternative problem category of the query statement; and
determining the candidate question category as a question category for the question sentence if the probability value is greater than or equal to a preset probability value,
wherein the classification model is obtained by training according to the question sentences of the p classes.
3. The method of claim 2, further comprising: determining the preset probability value according to the p categories of question sentences; the method comprises the following steps:
obtaining m training samples and n test samples according to the p classes of question sentences, wherein the m training samples are used for training a preset classification model to obtain a pre-trained classification model;
inputting the n test samples into the pre-trained classification model, and determining probability values of the n test samples respectively aiming at the candidate problem categories to obtain n probability values; and
determining the preset probability value as an average of the n probability values,
wherein m and n are integers greater than or equal to 2.
4. The method of claim 3, wherein the obtaining m training samples and n test samples comprises:
obtaining r associated question sentences associated with the p categories of question sentences by at least one of: replacing words in the question sentences of the p categories according to the synonym library to obtain r associated question sentences; replacing the matter statements included in the question statements of the p categories according to the q matter statements to obtain r associated question statements; the question sentences of the p categories are translated back to obtain r associated question sentences;
dividing the r associated question sentences into categories to which the question sentences with the association relationship belong to obtain enhanced p categories of question sentences; and
the enhanced p classes of question statements are divided into m training samples and n test samples,
wherein r is an integer of 1 or more.
5. The method of claim 1, wherein determining a candidate statement for the query statement comprises:
determining a word vector for the query statement as a first word vector;
acquiring a word vector aiming at each item statement in the q item statements to obtain q second word vectors;
determining the similarity between each second word vector in the q second word vectors and the first word vector to obtain q first similarities;
determining a target word vector in the q second word vectors according to the relation between the q first similarities and a preset similarity; and
determining that the matter statement targeted by the target word vector is a candidate matter statement for the query statement.
6. The method of claim 5, wherein determining a word vector for the query statement as a first word vector comprises:
performing word segmentation processing on the query sentence to obtain s first words;
according to the stop word library, the stop words in the s first words are removed to obtain t second words;
counting the occurrence times of the t second words in the query statement; and
determining a first word vector of the query statement based on a predetermined word bank and the number of occurrences,
wherein s and t are integers greater than or equal to 2, and s is greater than or equal to t.
7. The method of claim 5, wherein the similarity between each second word vector and the first word vector comprises a Jacard similarity; the method further comprises the following steps:
determining a word vector of each item statement in the item library to obtain q second word vectors; and
storing the q second word vectors in a compressed file for reading,
and the q second word vectors are determined according to a preset word library.
8. The method of claim 5, further comprising: determining the preset similarity according to the q item sentences; the method comprises the following steps:
determining the similarity among the q item sentences to obtain [ q (q-1)/2] second similarities; and
determining the preset similarity according to the value distribution of the [ q (q-1)/2] second similarities,
the preset similarity comprises a first preset similarity and a second preset similarity, and the first preset similarity is larger than the second preset similarity.
9. The method of claim 8, wherein determining a target word vector of the q second word vectors comprises:
determining whether candidate word vectors with similarity greater than or equal to the first preset similarity are included in the q second word vectors;
determining the candidate word vector as the target word vector if the candidate word vector is included in the q second word vectors; and
in the case that the alternative word vector is not included in the q second word vectors:
determining a word vector to be selected, of the q second word vectors, with the similarity between the first word vector and the q second word vectors, smaller than the first preset similarity and larger than or equal to the second preset similarity;
according to the similarity between the word vectors and the first word vector from large to small, the word vectors to be selected are sequenced to obtain a word vector sequence to be selected; and
and determining a preset number of word vectors to be selected which are ranked earlier in the word vector sequence to be selected as the target word vector.
10. The method of claim 6, wherein said dividing said query statement into s first words comprises:
replacing words in the query sentences according to the synonym library and the target field thesaurus to obtain replaced query sentences; and
and performing word segmentation processing on the replaced inquiry statement to obtain the s first words.
11. The method of claim 1, wherein the number of question categories is one; determining a reply sentence to the query sentence includes:
determining, from a pre-constructed reply sentence library, a reply sentence having a mapping relationship with both the candidate sentence and the question category as a reply sentence for the question sentence, in a case where the number of the candidate sentences is one;
in the case where the number of candidate sentences is at least two:
determining a target item statement aiming at the inquiry statement in at least two alternative item statements by adopting a lightweight semantic model; and
determining a reply sentence having a mapping relation with both the target matter sentence and the question category as a reply sentence for the question sentence from a pre-constructed reply sentence library.
12. The method of claim 11, wherein determining a target matter statement for the query statement comprises:
generating at least two standard inquiry sentences aiming at the at least two candidate sentence respectively according to the question category and the at least two candidate sentence sentences;
taking the query statement and each standard query statement in the at least two standard query statements as a statement pair, and inputting the lightweight semantic model to obtain the similarity between the query statement and each standard query statement; and
and determining the candidate item statement aiming at the standard inquiry statement with the maximum similarity to the inquiry statement as the target item statement.
13. A sentence processing apparatus comprising:
the statement acquisition module is used for acquiring the inquiry statement;
the category determination module is used for determining the category of the question aiming at the question statement in a pre-constructed question library, and the question library comprises the question statements of p categories;
the item determining module is used for determining candidate item statements aiming at the inquiry statement in a pre-constructed item library, and the item library comprises q item statements; and
a reply determination module for determining a reply sentence for the question sentence according to the question category and the candidate sentence,
wherein p and q are integers of 2 or more.
14. A computer system, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-12.
15. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 12.
CN202010764814.8A 2020-07-31 2020-07-31 Statement processing method, device, system and medium Active CN111858899B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010764814.8A CN111858899B (en) 2020-07-31 2020-07-31 Statement processing method, device, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010764814.8A CN111858899B (en) 2020-07-31 2020-07-31 Statement processing method, device, system and medium

Publications (2)

Publication Number Publication Date
CN111858899A true CN111858899A (en) 2020-10-30
CN111858899B CN111858899B (en) 2023-09-15

Family

ID=72954343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010764814.8A Active CN111858899B (en) 2020-07-31 2020-07-31 Statement processing method, device, system and medium

Country Status (1)

Country Link
CN (1) CN111858899B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154501A (en) * 2022-02-09 2022-03-08 南京擎天科技有限公司 Chinese address word segmentation method and system based on unsupervised learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959529A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Determination method, apparatus, equipment and the storage medium of problem answers type
CN109840277A (en) * 2019-02-20 2019-06-04 西南科技大学 A kind of government affairs Intelligent Service answering method and system
CN110377721A (en) * 2019-07-26 2019-10-25 京东方科技集团股份有限公司 Automatic question-answering method, device, storage medium and electronic equipment
US20200065389A1 (en) * 2017-10-10 2020-02-27 Tencent Technology (Shenzhen) Company Limited Semantic analysis method and apparatus, and storage medium
CN111241245A (en) * 2020-01-14 2020-06-05 百度在线网络技术(北京)有限公司 Human-computer interaction processing method and device and electronic equipment
CN111428010A (en) * 2019-01-10 2020-07-17 北京京东尚科信息技术有限公司 Man-machine intelligent question and answer method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200065389A1 (en) * 2017-10-10 2020-02-27 Tencent Technology (Shenzhen) Company Limited Semantic analysis method and apparatus, and storage medium
CN108959529A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Determination method, apparatus, equipment and the storage medium of problem answers type
CN111428010A (en) * 2019-01-10 2020-07-17 北京京东尚科信息技术有限公司 Man-machine intelligent question and answer method and device
CN109840277A (en) * 2019-02-20 2019-06-04 西南科技大学 A kind of government affairs Intelligent Service answering method and system
CN110377721A (en) * 2019-07-26 2019-10-25 京东方科技集团股份有限公司 Automatic question-answering method, device, storage medium and electronic equipment
CN111241245A (en) * 2020-01-14 2020-06-05 百度在线网络技术(北京)有限公司 Human-computer interaction processing method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154501A (en) * 2022-02-09 2022-03-08 南京擎天科技有限公司 Chinese address word segmentation method and system based on unsupervised learning
CN114154501B (en) * 2022-02-09 2022-04-26 南京擎天科技有限公司 Chinese address word segmentation method and system based on unsupervised learning

Also Published As

Publication number Publication date
CN111858899B (en) 2023-09-15

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
WO2022007823A1 (en) Text data processing method and device
US10262062B2 (en) Natural language system question classifier, semantic representations, and logical form templates
JP5901001B1 (en) Method and device for acoustic language model training
EP3958145A1 (en) Method and apparatus for semantic retrieval, device and storage medium
EP3852000A1 (en) Method and apparatus for processing semantic description of text entity, device and storage medium
CN110377733B (en) Text-based emotion recognition method, terminal equipment and medium
CN113407677B (en) Method, apparatus, device and storage medium for evaluating consultation dialogue quality
WO2021007159A1 (en) Identifying entity attribute relations
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
WO2022022049A1 (en) Long difficult text sentence compression method and apparatus, computer device, and storage medium
JP2022003544A (en) Method for increasing field text, related device, and computer program product
CN112632956A (en) Text matching method, device, terminal and storage medium
CN111858899B (en) Statement processing method, device, system and medium
WO2023116572A1 (en) Word or sentence generation method and related device
US20230070966A1 (en) Method for processing question, electronic device and storage medium
CN116127001A (en) Sensitive word detection method, device, computer equipment and storage medium
CN115358817A (en) Intelligent product recommendation method, device, equipment and medium based on social data
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN112579774A (en) Model training method, model training device and terminal equipment
Kearns et al. Resource and response type classification for consumer health question answering
CN111666770A (en) Semantic matching method and device
CN116226478B (en) Information processing method, model training method, device, equipment and storage medium
CN117390170B (en) Method and device for matching data standards, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant