WO2022222942A1 - 问答记录生成方法、装置、电子设备及存储介质 - Google Patents

问答记录生成方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022222942A1
WO2022222942A1 PCT/CN2022/087818 CN2022087818W WO2022222942A1 WO 2022222942 A1 WO2022222942 A1 WO 2022222942A1 CN 2022087818 W CN2022087818 W CN 2022087818W WO 2022222942 A1 WO2022222942 A1 WO 2022222942A1
Authority
WO
WIPO (PCT)
Prior art keywords
question
word
answer
record
chat
Prior art date
Application number
PCT/CN2022/087818
Other languages
English (en)
French (fr)
Inventor
朱章春
Original Assignee
康键信息技术(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 康键信息技术(深圳)有限公司 filed Critical 康键信息技术(深圳)有限公司
Publication of WO2022222942A1 publication Critical patent/WO2022222942A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, apparatus, electronic device, and computer-readable storage medium for generating a question and answer record.
  • the inventor realized that the process of asking questions and answering them, whether in a live broadcast room with strong manual operation, in a teacher's classroom, or in people's daily communication It is the top priority, and the generated question and answer record can also be used as a reference for subsequent communication.
  • the existing question and answer record generation method usually directly matches the acquired question with the existing question and answer database and responds, without considering the frequency of the occurrence of the question, which will cause repeated matching of the same or the same type of question. This method is less efficient in generating Q&A records.
  • a method for generating question and answer records includes:
  • Select one of the word segmentations in turn according to the arrangement order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat record to obtain the question corresponding to the search term;
  • the question obtained by retrieval is input into the question intent classification model to obtain a question intent, the question is answered according to the question intent, a question and answer record is generated, and the question and answer record is pushed to the client.
  • the present application also provides a question and answer record generating device, the device comprising:
  • the word segmentation extraction module is used to perform word segmentation processing on the acquired chat records, and count the frequency of each word segmentation;
  • a hot word list generation module configured to summarize the word segments whose frequencies are greater than a preset threshold to obtain a hot word segment set; sort the segment words in the hot word segment set according to the frequency to generate a hot word list;
  • a record retrieval module configured to sequentially select one of the word segmentations according to the arrangement order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat records to obtain the question corresponding to the search term;
  • the training corpus generation module is used to obtain the original question and answer data set, extract each process node in the original question and answer data set and the corpus data corresponding to the process node, and mark and merge the process nodes of the corpus data to obtain training. corpus;
  • a model training module configured to perform feature encoding on the training corpus to obtain a training corpus vector, and use the training corpus to train a preset multi-classification model to obtain a problem intent classification model;
  • a question-and-answer record generation module is used to input the question obtained by retrieval into the question intent classification model to obtain a question intent, answer the question according to the question intent, generate a question and answer record, and push the question and answer record to the client.
  • the present application also provides an electronic device, the electronic device comprising:
  • the processor executes the instructions stored in the memory to implement the method for generating question and answer records as described below:
  • Select one of the word segmentations in turn according to the arrangement order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat record to obtain the question corresponding to the search term;
  • the question obtained by retrieval is input into the question intent classification model to obtain a question intent, the question is answered according to the question intent, a question and answer record is generated, and the question and answer record is pushed to the client.
  • the present application also provides a computer-readable storage medium, where at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the following method for generating a question and answer record:
  • Select one of the word segmentations in turn according to the arrangement order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat record to obtain the question corresponding to the search term;
  • the question obtained by retrieval is input into the question intent classification model to obtain a question intent, the question is answered according to the question intent, a question and answer record is generated, and the question and answer record is pushed to the client.
  • FIG. 1 is a schematic flowchart of a method for generating a question and answer record provided by an embodiment of the present application
  • Fig. 2 is a schematic flowchart of one of the steps in the question and answer record generation method shown in Fig. 1;
  • FIG. 3 is a functional block diagram of an apparatus for generating a question and answer record provided by an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of an electronic device implementing the method for generating a question and answer record provided by an embodiment of the present application.
  • the embodiment of the present application provides a method for generating a question and answer record.
  • the execution body of the method for generating a question and answer record includes, but is not limited to, at least one of electronic devices, such as a server and a terminal, that can be configured to execute the method provided by the embodiments of the present application.
  • the method for generating the question and answer record may be executed by software or hardware installed on the terminal device or the server device, and the software may be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • FIG. 1 it is a schematic flowchart of a method for generating a question and answer record provided by an embodiment of the present application.
  • the method for generating a question and answer record includes:
  • chat records may be acquired from a preset IM (Instant Messenger) system, where the IM system is a customizable communication system integrating a multi-person video conference function.
  • IM Instant Messenger
  • performing word segmentation processing on the acquired chat records includes:
  • the word segmentation is filtered from the word segmentation chat set.
  • the preset rule refers to removing special symbols and stop words in the chat record
  • the special symbols refer to some mathematical symbols that are used less frequently and are difficult to input directly, Unit symbols and tabs, such as @, #, $, etc.
  • the stop words refer to some words that have no actual meaning, such as words such as "ah", "he", "de”, etc.
  • the stop word list can be the obtained "Harbin Institute of Technology Stop Word Thesaurus” and “Sichuan University Machine Learning Intelligent Laboratory Stop Word Thesaurus".
  • a preset Jieba tokenizer may be used to perform word segmentation on the initial chat record to obtain a segmented chat set.
  • the preset keyword dictionary contains professional terms in a preset field, and the word segmentation is selected from the word segmentation chat set to extract keywords that fit the actual application scenario, avoiding Useless data is extracted and data redundancy is created.
  • the chat record includes: "#Will high blood lipids cause cerebral infarction?", "Doctor, how should cerebral infarction be treated?”, and the chat record is processed by removing special symbols and removing stop words, The two special symbols "#” and " ⁇ " in the chat record are removed, and the processed chat record is subjected to word segmentation processing to obtain a word segmentation chat set: "Excuse me/high blood lipids/will/cause/cerebral infarction/?
  • the method before performing word segmentation on the acquired chat record, the method further includes:
  • the chat record corresponding to the user is retained.
  • the identity verification of the user corresponding to the chat record is to check whether the user is on a predetermined user list, and only the speech of the user on the user list can be processed as a follow-up.
  • the chat records of users who are not on the user list will not be accepted, and the chat records obtained at this time have no practical reference significance.
  • the preset threshold is a criterion for determining whether the word segment corresponding to the frequency is a popular word segment. If the frequency is greater than the preset threshold, the word segment corresponding to the frequency is a popular word segment. If the frequency If it is less than or equal to the preset threshold, the number of occurrences of the word segmentation corresponding to the frequency is relatively small, and it cannot be determined as a popular word segmentation.
  • the word segmentation in the popular word segmentation set is sorted according to the size of the frequency to generate a hot word list, including:
  • the preset number may be 10.
  • the purpose of intercepting the preset number of word segmentations in the initial list is to further screen the initial list.
  • the initial list is obtained.
  • the initial list may contain many low-frequency word segmentations, so a preset number of word segmentations in the initial list are intercepted to generate a hot word list.
  • the hot word list can facilitate the host in the live broadcast room to more intuitively understand the relevant topics and questions that the audience wants to know, and the host can follow the questions on the hot word list from the top. Answer in the order below.
  • S403 Calculate the similarity between the search term vector and the keyword vector, and select the question corresponding to the keyword whose similarity is greater than or equal to a preset similarity threshold as the question corresponding to the search term.
  • the initial chat record may include multiple chat keywords, and the chat keywords are medical terms in the medical field. Calculate the difference between the search term and the multiple chat keywords. As long as a chat keyword in the sentence in the initial chat record matches, the corresponding sentence is used as the question corresponding to the search term.
  • the question corresponding to the keyword with the similarity greater than or equal to the preset similarity threshold when selected as the question corresponding to the search term, the question may be a chat record presented in the form of a question sentence. If the retrieved keyword appears in a chat record that is not a question sentence, the corresponding chat record will not be classified as a question corresponding to the search term.
  • the search term is "cerebral infarction”
  • the initial chat record is: "cerebral infarction and heart disease occur very frequently in people in modern society, does there really exist a truly effective method for treating cerebral infarction?"
  • the really effective ones are mainly the following three, including"
  • the chat keywords in the initial chat record are: "cerebral infarction” and heart disease, by calculating the search term vector and For the similarity between the keyword vectors, the question corresponding to the initial chat record containing "brain infarction" may be used as the question corresponding to the search term.
  • many calculation methods can be used to calculate the similarity between the search word vector and the keyword vector, including, but not limited to, using the cosine similarity formula to calculate, using Euclidean distance to calculate calculation etc.
  • the calculating the similarity between the search word vector and the keyword vector includes:
  • This embodiment of the present application may perform vectorization processing on the search term and the chat keyword according to a preset word2vec algorithm, to obtain the search term vector and the keyword vector.
  • the original question answering data set may be medical questions and corresponding answers contained in the business scenario of intelligent question answering.
  • each process node in the original question-and-answer data set may be each round of dialogue between the medical expert and the questioner, and the corresponding corpus data refers to the dialogue between the medical expert and the questioner in each round of dialogue. Answers from TCM experts.
  • the corpus data are marked with process nodes and then merged to obtain training corpus.
  • one-hot encoding, target encoding, Bayesian target encoding, etc. may be used to perform feature encoding on the training corpus, wherein the present application uses preset one-hot encoding to perform feature encoding on the training corpus , get the training corpus vector.
  • performing feature encoding on the training corpus to obtain a training corpus vector including:
  • the initial matrix vector is constructed and obtained;
  • the position of the number of columns corresponding to the training corpus in the initial matrix vector is set as the first numerical value, and the number of other columns is the second numerical value to obtain the training corpus vector.
  • the total number of the corpus is 5, that is, if there are five rounds of dialogue, the training corpus vector of the first round of training corpus is [1, 0, 0, 0, 0].
  • the preset multi-classification model may be a shallow neural network or a support vector machine model.
  • the question intent classification model obtained by training the multi-classification model can identify the intent corresponding to each question.
  • a problem intent classification model including:
  • Classify the training corpus by using the preset multi-classification model to obtain one or more classification intentions;
  • the embodiment of the present application marks the intent category in the original question answering dataset by judging the actual intent of the medical expert's answer in the dialogue between the medical expert and the questioner in each round.
  • the first round of dialogue is: the questioner: "Can you answer the questions related to cerebral infarction?", medical expert: “Yes, I can.”
  • the intent category of the first round of dialogue In order to confirm the field of answers from medical experts, the second round of dialogue is: Questioner: "What are the effective treatment options for cerebral infarction?" Medical expert: "Currently, there are mainly the following common treatment options for the treatment of cerebral infarction. First, ", the intention category of the second round of dialogue is to confirm the solution to specific problems, etc.
  • the question obtained by the retrieval is input into the question intent classification model to obtain the question intent.
  • the question intent can roughly estimate the scope and field of the question, and correspond to different user intents. can improve the efficiency of answering questions.
  • the question obtained by the retrieval is answered according to the question intention to obtain the corresponding answer, and the question and the corresponding answer are aggregated to generate the question and answer record.
  • answering the question and generating a question and answer record according to the question intent includes:
  • a question and answer record is generated based on the unanswered question and the answer to the unanswered question.
  • question-and-answer database includes common questions corresponding to some related intents and answers corresponding to the common questions.
  • the embodiment of the present application marks the question as an unanswered question, and the unanswered question cannot be searched for a matching question in the question and answer database,
  • the answer can be obtained by manual answering.
  • the pushing the question and answer record to the client includes:
  • a list of hot words is generated by summarizing the word segmentations whose frequency is greater than a preset threshold in the chat records and sorting them according to the frequency of word segmentation, and the hot word list includes a plurality of word segmentations with high frequency of mentioning , which is convenient for subsequent targeted answers.
  • the hot word list also represents the common concerns of everyone.
  • FIG. 3 it is a functional block diagram of an apparatus for generating a question and answer record provided by an embodiment of the present application.
  • the question and answer record generating apparatus 100 described in this application may be installed in an electronic device. According to the realized functions, the question and answer record generating apparatus 100 may include a word segmentation extraction module 101 , a hot word list generation module 102 , a record retrieval module 103 , a training corpus generation module 104 , a model training module 105 , and a question and answer record generation module 106 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the word segmentation extraction module 101 is used to perform word segmentation processing on the acquired chat records, and count the frequency of occurrence of each word segmentation;
  • the hot word list generating module 102 is configured to summarize the word segments whose frequencies are greater than a preset threshold to obtain a hot word segment set; sort the segment words in the hot word segment set according to the size of the frequency to generate a hot word list ;
  • the record retrieval module 103 is configured to select one of the word segmentations in turn according to the arrangement order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat records, and obtain the corresponding search terms.
  • the training corpus generation module 104 is used to obtain the original question and answer data set, extract each process node in the original question and answer data set and the corpus data corresponding to the process node, and mark and merge the process nodes of the corpus data. , get the training corpus;
  • the model training module 105 is configured to perform feature encoding on the training corpus to obtain a training corpus vector, and use the training corpus to train a preset multi-classification model to obtain a problem intent classification model;
  • the question and answer record generating module 106 is configured to input the question obtained by retrieval into the question intent classification model to obtain the question intent, answer the question according to the question intent and generate a question and answer record, The Q&A record is pushed to the client.
  • each module of the question and answer record generating apparatus 100 is as follows:
  • Step 1 The word segmentation extraction module 101 performs word segmentation processing on the acquired chat records, and counts the frequency of occurrence of each word segmentation.
  • chat records may be acquired from a preset IM (Instant Messenger) system, where the IM system is a customizable communication system integrating a multi-person video conference function.
  • IM Instant Messenger
  • performing word segmentation processing on the acquired chat records includes:
  • the word segmentation is filtered from the word segmentation chat set.
  • the preset rule refers to removing special symbols and stop words in the chat record
  • the special symbols refer to some mathematical symbols that are used less frequently and are difficult to input directly, Unit symbols and tabs, such as @, #, $, etc.
  • the stop words refer to some words that have no actual meaning, such as words such as "ah", "he", "de”, etc.
  • the stop word list can be the obtained "Harbin Institute of Technology Stop Word Thesaurus” and “Sichuan University Machine Learning Intelligent Laboratory Stop Word Thesaurus".
  • a preset Jieba tokenizer may be used to perform word segmentation on the initial chat record to obtain a segmented chat set.
  • the preset keyword dictionary contains professional terms in a preset field, and the word segmentation is selected from the word segmentation chat set to extract keywords that fit the actual application scenario, avoiding Useless data is extracted and data redundancy is created.
  • the chat record includes: "#Will high blood lipids cause cerebral infarction?", "Doctor, how should cerebral infarction be treated?”, and the chat record is processed by removing special symbols and removing stop words, The two special symbols "#” and " ⁇ " in the chat record are removed, and the processed chat record is subjected to word segmentation processing to obtain a word segmentation chat set: "Excuse me/high blood lipids/will/cause/cerebral infarction/?
  • the word segmentation extraction module 101 before the word segmentation extraction module 101 performs word segmentation processing on the acquired chat records, the word segmentation extraction module 101 is further configured to:
  • the chat record corresponding to the user is retained.
  • the identity verification of the user corresponding to the chat record is to check whether the user is on a predetermined user list, and only the speech of the user on the user list can be processed as a follow-up.
  • the chat records of users who are not on the user list will not be accepted, and the chat records obtained at this time have no practical reference significance.
  • Step 2 The hot word list generation module 102 summarizes the word segmentations whose frequencies are greater than a preset threshold, obtains a popular word segmentation set, and obtains a popular word segmentation set, and sorts the word segmentations in the popular word segmentation set, according to the selected words. Generate a list of hot words by sorting the description.
  • the preset threshold is a criterion for determining whether the word segment corresponding to the frequency is a popular word segment. If the frequency is greater than the preset threshold, the word segment corresponding to the frequency is a popular word segment. If the frequency If it is less than or equal to the preset threshold, the number of occurrences of the word segmentation corresponding to the frequency is relatively small, and it cannot be determined as a popular word segmentation.
  • Step 3 Sort the word segments in the hot word segment set according to the frequency, and generate a hot word list.
  • the word segmentation in the popular word segmentation set is sorted according to the size of the frequency to generate a hot word list, including:
  • the preset number may be 10.
  • the purpose of intercepting the preset number of word segmentations in the initial list is to further screen the initial list.
  • the initial list is obtained.
  • the initial list may contain many low-frequency word segmentations, so a preset number of word segmentations in the initial list are intercepted to generate a hot word list.
  • the hot word list can facilitate the host in the live broadcast room to more intuitively understand the relevant topics and questions that the audience wants to know, and the host can follow the questions on the hot word list from the top. Answer in the order below.
  • Step 4 Select one of the word segmentations in turn according to the order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat record to obtain the question corresponding to the search term.
  • the record retrieval module 103 selects one of the word segmentations in turn according to the arrangement order of the word segmentations in the hot word list, and uses the selected word segmentation as a search word to search in the chat records to obtain the Questions corresponding to search terms:
  • one of the word segmentations is selected in turn according to the order of word segmentation in the hot word list, the selected word segmentation is used as a search word, and the search word is vectorized to obtain the search word vector extraction method. Chat keywords in the initial chat record, and vectorize the chat keywords to obtain keyword vectors;
  • the similarity between the search term vector and the keyword vector is calculated, and the question corresponding to the keyword whose similarity is greater than or equal to a preset similarity threshold is selected as the question corresponding to the search term.
  • the initial chat record may include multiple chat keywords, and the chat keywords are medical terms in the medical field. Calculate the difference between the search term and the multiple chat keywords. As long as a chat keyword in the sentence in the initial chat record matches, the corresponding sentence is used as the question corresponding to the search term.
  • the question corresponding to the keyword with the similarity greater than or equal to the preset similarity threshold when selected as the question corresponding to the search term, the question may be a chat record presented in the form of a question sentence. If the retrieved keyword appears in a chat record that is not a question sentence, the corresponding chat record will not be classified as a question corresponding to the search term.
  • the search term is "cerebral infarction”
  • the initial chat record is: "cerebral infarction and heart disease occur very frequently in people in modern society, does there really exist a truly effective method for treating cerebral infarction?"
  • the really effective ones are mainly the following three, including"
  • the chat keywords in the initial chat record are: "cerebral infarction” and heart disease, by calculating the search term vector and For the similarity between the keyword vectors, the question corresponding to the initial chat record containing "brain infarction" may be used as the question corresponding to the search term.
  • many calculation methods can be used to calculate the similarity between the search word vector and the keyword vector, including, but not limited to, using the cosine similarity formula to calculate, using Euclidean distance to calculate calculation etc.
  • the calculating the similarity between the search word vector and the keyword vector includes:
  • This embodiment of the present application may perform vectorization processing on the search term and the chat keyword according to a preset word2vec algorithm, to obtain the search term vector and the keyword vector.
  • Step 5 Obtain the original question and answer data set, extract each process node in the original question and answer data set and the corpus data corresponding to the process node, and mark and merge the process nodes of the corpus data to obtain training corpus.
  • the original question answering data set may be medical questions and corresponding answers contained in the business scenario of intelligent question answering.
  • each process node in the original question-and-answer data set may be each round of dialogue between the medical expert and the questioner, and the corresponding corpus data refers to the dialogue between the medical expert and the questioner in each round of dialogue. Answers from TCM experts.
  • the corpus data are marked with process nodes and then merged to obtain training corpus.
  • Step 6 Perform feature coding on the training corpus to obtain a training corpus vector, and use the training corpus to train a preset multi-classification model to obtain a problem intent classification model.
  • one-hot encoding, target encoding, Bayesian target encoding, etc. may be used to perform feature encoding on the training corpus, wherein the present application uses preset one-hot encoding to perform feature encoding on the training corpus , get the training corpus vector.
  • performing feature encoding on the training corpus to obtain a training corpus vector including:
  • the initial matrix vector is constructed and obtained;
  • the position of the number of columns corresponding to the training corpus in the initial matrix vector is set as the first numerical value, and the number of other columns is the second numerical value to obtain the training corpus vector.
  • the total number of the corpus is 5, that is, if there are five rounds of dialogue, the training corpus vector of the first round of training corpus is [1, 0, 0, 0, 0].
  • the preset multi-classification model may be a shallow neural network or a support vector machine model.
  • the question intent classification model obtained by training the multi-classification model can identify the intent corresponding to each question.
  • a problem intent classification model including:
  • Classify the training corpus by using the preset multi-classification model to obtain one or more classification intentions;
  • the embodiment of the present application marks the intent category in the original question answering dataset by judging the actual intent of the medical expert's answer in the dialogue between the medical expert and the questioner in each round.
  • the first round of dialogue is: the questioner: "Can you answer the questions related to cerebral infarction?", medical expert: “Yes, I can.”
  • the intent category of the first round of dialogue In order to confirm the field of answers from medical experts, the second round of dialogue is: Questioner: "What are the effective treatment options for cerebral infarction?" Medical expert: "Currently, there are mainly the following common treatment options for the treatment of cerebral infarction. First, ", the intention category of the second round of dialogue is to confirm the solution to specific problems, etc.
  • Step 7 The question and answer record generation module 106 inputs the question obtained by retrieval into the question intent classification model, obtains the question intent, answers the question according to the question intent and generates a question and answer record, The Q&A record is pushed to the client.
  • the question obtained by the retrieval is input into the question intent classification model to obtain the question intent.
  • the question intent can roughly estimate the scope and field of the question, and correspond to different user intents. can improve the efficiency of answering questions.
  • the question obtained by the retrieval is answered according to the question intention to obtain the corresponding answer, and the question and the corresponding answer are aggregated to generate the question and answer record.
  • answering the question and generating a question and answer record according to the question intent includes:
  • a question and answer record is generated based on the unanswered question and the answer to the unanswered question.
  • the question and answer database includes some common questions and answers corresponding to the common questions.
  • the embodiment of the present application will mark the question as an unanswered question, and the unanswered question cannot be searched for a matching question in the question and answer database , which can be answered manually.
  • the pushing the question and answer record to the client includes:
  • FIG. 4 it is a schematic structural diagram of an electronic device for implementing a method for generating a question and answer record provided by an embodiment of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a question and answer record generation program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, Optical discs, etc., the computer-readable storage medium may be non-volatile or volatile.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the question and answer record generation program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (such as question and answer) stored in the memory 11. record generation program, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 4 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch panel, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the question and answer record generation program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, can realize:
  • Select one of the word segmentations in turn according to the arrangement order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat record to obtain the question corresponding to the search term;
  • the question obtained by retrieval is input into the question intent classification model to obtain a question intent, the question is answered according to the question intent, a question and answer record is generated, and the question and answer record is pushed to the client.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
  • the present application also provides a computer-readable storage medium, where the readable storage medium stores a computer program, and when executed by a processor of an electronic device, the computer program can realize:
  • Select one of the word segmentations in turn according to the arrangement order of the word segmentations in the hot word list, and use the selected word segmentation as a search term to search in the chat record to obtain the question corresponding to the search term;
  • the question obtained by retrieval is input into the question intent classification model to obtain the question intent, the question is answered according to the question intent, a question and answer record is generated, and the question and answer record is pushed to the client.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

本申请涉及数据分析技术,揭露了一种问答记录生成方法,包括提取聊天记录中的分词,并计算每一个分词出现的频率;将频率大于预设阈值的分词进行汇总,得到热门分词集,并对热门分词集中的分词进行排序处理,根据排序生成热词榜单;依次选择热词榜单中的其中一个分词,将选择的分词作为检索词在聊天记录中检索,得到检索词对应的问题;对检索得到的问题进行问题意图分类并进行解答并生成问答记录,将问答记录推送到用户端。此外,本申请还涉及区块链技术,所述热门分词集可存储于区块链的节点。本申请还提出一种问答记录生成装置、电子设备以及计算机可读存储介质。本申请可以解决将问题与问答数据库进行匹配生成问答记录时效率较低的问题。

Description

问答记录生成方法、装置、电子设备及存储介质
本申请要求于2021年04月21日提交中国专利局、申请号为202110429297.3,发明名称为“问答记录生成方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种问答记录生成方法、装置、电子设备及计算机可读存储介质。
背景技术
随着科学技术的快速发展,发明人意识到,无论是在具有强人工运营性质的直播间还是在老师的课堂上又或者是人们的日常交流当中,提出问题并对问题进行解答这一过程都是重中之重,生成的问答记录同时也可以作为后续沟通的一个参考。
现有的问答记录生成方法通常是将获取到的问题与现有的问答数据库直接进行匹配并进行回复,并没有考虑到问题出现的频率大小,会造成重复匹配同一个或者同一类型的问题,利用这种方法生成问答记录时效率较低。
发明内容
本申请提供的一种问答记录生成方法,包括:
对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
本申请还提供一种问答记录生成装置,所述装置包括:
分词提取模块,用于对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
热词榜单生成模块,用于对所述频率大于预设阈值的分词进行汇总,得到热门分词集;对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
记录检索模块,用于按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
训练语料生成模块,用于获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
模型训练模块,用于对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
问答记录生成模块,用于将检索得到的所述问题输入至所述问题意图分类模型中,得 到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
本申请还提供一种电子设备,所述电子设备包括:
存储器,存储至少一个指令;及
处理器,执行所述存储器中存储的指令以实现如下所述的问答记录生成方法:
对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如下所述的问答记录生成方法:
对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
附图说明
图1为本申请一实施例提供的问答记录生成方法的流程示意图;
图2为图1所示的问答记录生成方法中其中一个步骤的流程示意图;
图3为本申请一实施例提供的问答记录生成装置的功能模块图;
图4为本申请一实施例提供的实现所述问答记录生成方法的电子设备的结构示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例提供一种问答记录生成方法。所述问答记录生成方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的电子设备中的至少一种。换言之,所述问答记录生成方法可以由安装在终端设备或服务端设备的软件或硬件来执行,所述软件可以是区块链平台。所述服务端包括但不限于:单台服务器、服务器集群、云端服务器或云端服务器集群等。
参照图1所示,为本申请一实施例提供的问答记录生成方法的流程示意图。在本实施例中,所述问答记录生成方法包括:
S1、对获取的聊天记录进行分词处理,并统计每个分词出现的频率。
本申请实施例中,可以从预设的IM(Instant Messenger)系统中获取到聊天记录,其中,所述IM系统是可定制的、集成多人视频会议功能的通信系统。
本申请其中一个实施例中,所述对获取的聊天记录进行分词处理,包括:
按照预设规则对所述聊天记录进行预处理,得到初始聊天记录;
利用分词工具对所述初始聊天记录进行分词处理,得到分词聊天集;
根据预设的关键词词典,从所述分词聊天集中筛选出分词。
其中,在本申请实施例中,所述预设规则是指去除所述聊天记录中的特殊符号和停用词,所述特殊符号是指一些使用频率较低,且难以直接输入的数学符号、单位符号和制表符,例如,@、#、¥等。所述停用词是指一些没有实际含义的词语,例如:“啊”、“呵”、“的”等词语,去除聊天记录中的停用词可以参考预设的停用词表,所述停用词表可以为获取到的“哈工大停用词词库”和“四川大学机器学习智能实验室停用词词库”。
本申请其中一个实施例可以利用预设的Jieba分词器对所述初始聊天记录进行分词处理,得到分词聊天集。
进一步地,在本申请实施例中,所述预设的关键词词典中包含预设领域的专业术语,从所述分词聊天集中筛选出分词是为了提取出贴合实际应用场景的关键词,避免提取出无用的数据且造成数据冗余。
例如,所述聊天记录包括:“#请问高血脂会导致脑¥梗吗?”,“医生,脑梗应该怎么治疗?”,对所述聊天记录进行去除特殊符号处理和去除停用词处理,将所述聊天记录中“#”和“¥”这两个特殊符号进行去除,对处理后的聊天记录进行分词处理,得到分词聊天集:“请问/高血脂/会/导致/脑梗/吗?”,“医生/脑梗/应该/怎么/治疗?”,进一步地统计所述分词聊天集中各个分词出现的频率,并判断其是否出现在预设的关键词词典中,其中,“脑梗”出现的频率为2次,“血脂”出现的频率为1次。
此外,本申请另一个实施例中,在所述对获取的聊天记录进行分词处理之前,所述方法还包括:
识别出所述聊天记录对应的用户;
判断所述用户是否通过身份校验;
若所述用户未通过所述身份校验,则删除所述用户的聊天记录;
若所述用户通过所述身份校验,则保留所述用户对应的聊天记录。
详细地,对所述聊天记录对应的用户进行所述身份校验是为了核对所述用户是否在预先规定好的用户名单上,只有在所述用户名单上的用户的发言才能够作为后续进行处理的聊天记录,没有在所述用户名单上的用户发言将不会被采纳,此时获取到的的聊天记录没有实际参考意义。
S2、对所述频率大于预设阈值的分词进行汇总,得到热门分词集。
本申请实施例中,所述预设阈值是判定所述频率的对应分词是否是热门分词的标准,若所述频率大于预设阈值,则所述频率对应的分词为热门分词,若所述频率小于或者等于所述预设阈值,则所述频率对应的分词出现的次数相对较少,无法判定为热门分词。
S3、对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单。
本申请其中一个实施例中,所述对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单,包括:
获取所述热门分词集中的分词和所述分词的对应频率;
对所述分词按照其对应频率从大到小的顺序进行排列处理,得到初始榜单;截取所述初始榜单中预设个数的分词,生成热词榜单。
其中,所述预设个数可以为10。
详细地,截取所述初始榜单中预设个数的分词是为了对所述初始榜单进行进一步地筛选,虽然已经将所述频率大于预设阈值的分词进行汇总进行排列处理,得到了初始榜单,但是所述初始榜单中可能会包含较多低频率的分词,因此截取所述初始榜单中预设个数的分词,生成热词榜单。在发明其中一个应用场景中,所述热词榜单可以便于直播间中的主播更加直观地了解观众想要了解的相关话题和问题,主播可以对所述热词榜单上的问题按照从上到下的顺序进行解答。
S4、按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题。
本申请实施例中,参阅图2所示,所述按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题,包括:
S401、通过遍历操作,按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词,并对所述检索词进行向量化处理,得到检索词向量;
S402、提取所述初始聊天记录中的聊天关键词,并对所述聊天关键词进行向量化处理,得到关键词向量;
S403、计算所述检索词向量和所述关键词向量之间的相似度,选择所述相似度大于或者等于预设的相似阈值的关键词对应的问题作为所述检索词对应的问题。
具体地,在本申请实施例中,所述初始聊天记录中可能包含多个聊天关键词,所述聊天关键词为医学领域的医学名词,计算所述检索词和所述多个聊天关键词之间的相似度,只要所述初始聊天记录中的句子中一个聊天关键词符合,则将对应的句子作为所述检索词对应的问题。
详细地,本申请实施例选择所述相似度大于或者等于预设的相似阈值的关键词对应的问题作为所述检索词对应的问题时,问题可以是以问句形式呈现出来的聊天记录。若检索到的所述关键词出现在不是问句的聊天记录中,则对应的聊天记录不会归为所述检索词对应的问题。
例如,所述检索词为“脑梗”,所述初始聊天记录为:“脑梗和心脏病在现代社会人的身上出现的频率很高,治疗脑梗的真正有效的方法到底存在吗?,但其中真正有效的主要是以下三种,包括·······。”,所述初始聊天记录中的聊天关键词为:“脑梗”和心脏病,通过计算所述检索词向量和所述关键词向量之间的相似度,可以将所述初始聊天记录对应的包含“脑梗”的问题作为所述检索词对应的问题。
其中,本申请实施例可以采用很多计算方法计算所述检索词向量和所述关键词向量之间的相似度,包括,但不限于,采用余弦相似度公式进行计算,采用欧几里得距离进行计算等。
可选地,本申请其中一个实施例中,所述计算所述检索词向量和所述关键词向量之间的相似度,包括:
利用下述公式计算所述检索词向量和所述关键词向量之间的相似度:
Figure PCTCN2022087818-appb-000001
其中,cos(a,b)为相似度,a为所述检索词向量,b为所述关键词向量,|a|、|b|分别为所述检索词向量对应的模和所述关键词向量对应的模。
本申请实施例可以根据预设的word2vec算法对所述检索词和所述聊天关键词进行向量化处理,得到所述检索词向量和所述关键词向量。
S5、获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节 点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料。
本申请实施例中,所述原始问答数据集可以是智能问答的业务场景下包含的医学问题和对应的答案。
详细地,所述原始问答数据集中的每个流程节点可以是所述医学专家和提问者之间的每一轮对话,以及对应的语料数据是指每轮对话中医学专家和提问者之间对话中医学专家的回答。
本申请实施例对所述语料数据进行流程节点标记后进行合并,得到训练语料。
S6、对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型。
本申请实施例中,对所述训练语料进行特征编码可以采用独热编码、目标编码、贝叶斯目标编码等方法,其中,本申请利用预设的独热编码对所述训练语料进行特征编码,得到训练语料向量。
具体地,所述对所述训练语料进行特征编码,得到训练语料向量,包括:
对所述原始问答数据集中的训练语料进行语料总数汇总,得到语料总数;
以所述训练语料为预设矩阵的行数,以所述语料总数为所述预设矩阵的列数,构建得到初始矩阵向量;
设置所述初始矩阵向量中所述训练语料对应的列数所在的位置为第一数值,其余列数为第二数值,得到训练语料向量。
例如,所述语料总数为5,即假设有五轮对话,那么第一轮训练语料的训练语料向量就是[1,0,0,0,0]。
本申请实施例中,所述预设的多分类模型可以为浅层的神经网络或者支持向量机模型。本申请实施例中,对所述多分类模型进行训练得到的问题意图分类模型可以识别出每个问题对应的意图。
详细地,所述利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型,包括:
利用所述预设的多分类模型对所述训练语料进行分类,得到一种或者多种分类意图;
标注所述原始问答数据集中的意图类别,计算所述意图类别和所述分类意图之间的重复度;
当所述重复度小于预设的分类阈值时,对所述预设的分类模型进行迭代更新,重新对所述最终表示向量进行分类;
当所述重复度大于或者等于预设的分类阈值时,得到问题意图分类模型。
详细地,本申请实施例通过判断所述每轮医学专家和提问者之间对话中医学专家的回答的实际意图来标注所述原始问答数据集中的意图类别。例如,第一轮对话为:提问者人员:“请问您能对脑梗相关的问题进行解答吗?”,医学专家:“是的,我可以。”,则所述第一轮对话的意图类别为确认医学专家解答领域,第二轮对话为:提问者:“脑梗的有效治疗方案有哪些呢?”,医学专家:“现在关于脑梗治疗的普遍治疗方案主要有以下几种,第一………”,则所述第二轮对话的意图类别为确认具体问题的解决方法等。
S7、将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
本申请实施例中,将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,所述问题意图可以粗略估计出问题所述的范围和领域,根据不同的用户意图进行对应的解答,可以提高解答问题的效率。
其中,根据所述问题意图对检索得到的所述问题进行解答,得到对应的答案,并将所述问题和对应的答案进行汇总,生成所述问答记录。
具体地,所述根据所述问题意图对所述问题进行解答并生成问答记录,包括:
根据所述问题意图选择对应的预设的问答数据库,利用所述问答数据库对所述问题进行匹配处理,判断所述问题是否与所述问答数据库中的问题匹配;
若所述问题与所述问答数据库中的问题匹配,将所述问答数据库中的问题对应的答案作为所述问题的答案,并根据所述问题和所述答案生成问答记录;
若所述问题与所述问答数据库中的问题不匹配,将所述问题标记为未解答问题并进行对其进行问题解答,得到所述未解答问题的答案;
根据所述未解答问题和所述未解答问题的答案生成问答记录。
其中,不同的问题意图对应着不同的问答数据库,所述问答数据库中包含一些相关意图对应的常见问题及所述常见问题对应的回答。
详细地,若所述问题与所述问题数据库中的问题不匹配,本申请实施例将所述问题标记为未解答问题,所述未解答问题在所述问答数据库中搜索不到匹配的问题,可以通过人工解答得到答案。
具体地,所述将所述问答记录推送到用户端,包括:
根据问答记录的传输文件将所述问答记录传输至数据推送引擎;
利用所述数据推送引擎推送所述问答记录至用户端。
本申请通过将所述聊天记录中频率大于预设阈值的分词进行汇总并按照分词的频率大小进行排序,生成热词榜单,所述热词榜单中包含提及频率较高的多个分词,便于后续进行有针对性的回答,同时热词榜单也代表了大家所共同关切的问题,训练一个问题意图分类模型对根据分词检索出的问题进行意图分类,所述问题意图可以粗略估计出问题所述的范围和领域,根据不同的用户意图进行对应的解答,可以提高解答问题的效率。因此本申请提出的问答记录生成方法可以解决将问题与问答数据库进行匹配生成问答记录时效率较低的问题。
如图3所示,是本申请一实施例提供的问答记录生成装置的功能模块图。
本申请所述问答记录生成装置100可以安装于电子设备中。根据实现的功能,所述问答记录生成装置100可以包括分词提取模块101、热词榜单生成模块102、记录检索模块103、训练语料生成模块104、模型训练模块105、及问答记录生成模块106。本申请所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。
在本实施例中,关于各模块/单元的功能如下:
所述分词提取模块101,用于对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
所述热词榜单生成模块102,用于对所述频率大于预设阈值的分词进行汇总,得到热门分词集;对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
所述记录检索模块103,用于按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
所述训练语料生成模块104,用于获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
所述模型训练模块105,用于对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
所述问答记录生成模块106,用于将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
详细地,所述问答记录生成装置100各模块的具体实施方式如下:
步骤一、所述分词提取模块101对获取的聊天记录进行分词处理,并统计每个分词出 现的频率。
本申请实施例中,可以从预设的IM(Instant Messenger)系统中获取到聊天记录,其中,所述IM系统是可定制的、集成多人视频会议功能的通信系统。
本申请其中一个实施例中,所述对获取的聊天记录进行分词处理,包括:
按照预设规则对所述聊天记录进行预处理,得到初始聊天记录;
利用分词工具对所述初始聊天记录进行分词处理,得到分词聊天集;
根据预设的关键词词典,从所述分词聊天集中筛选出分词。
其中,在本申请实施例中,所述预设规则是指去除所述聊天记录中的特殊符号和停用词,所述特殊符号是指一些使用频率较低,且难以直接输入的数学符号、单位符号和制表符,例如,@、#、¥等。所述停用词是指一些没有实际含义的词语,例如:“啊”、“呵”、“的”等词语,去除聊天记录中的停用词可以参考预设的停用词表,所述停用词表可以为获取到的“哈工大停用词词库”和“四川大学机器学习智能实验室停用词词库”。
本申请其中一个实施例可以利用预设的Jieba分词器对所述初始聊天记录进行分词处理,得到分词聊天集。
进一步地,在本申请实施例中,所述预设的关键词词典中包含预设领域的专业术语,从所述分词聊天集中筛选出分词是为了提取出贴合实际应用场景的关键词,避免提取出无用的数据且造成数据冗余。
例如,所述聊天记录包括:“#请问高血脂会导致脑¥梗吗?”,“医生,脑梗应该怎么治疗?”,对所述聊天记录进行去除特殊符号处理和去除停用词处理,将所述聊天记录中“#”和“¥”这两个特殊符号进行去除,对处理后的聊天记录进行分词处理,得到分词聊天集:“请问/高血脂/会/导致/脑梗/吗?”,“医生/脑梗/应该/怎么/治疗?”,进一步地统计所述分词聊天集中各个分词出现的频率,并判断其是否出现在预设的关键词词典中,其中,“脑梗”出现的频率为2次,“血脂”出现的频率为1次。
此外,本申请另一个实施例中,在所述分词提取模块101对获取的聊天记录进行分词处理之前,所述分词提取模块101还用于:
识别出所述聊天记录对应的用户;
判断所述用户是否通过身份校验;
若所述用户未通过所述身份校验,则删除所述用户的聊天记录;
若所述用户通过所述身份校验,则保留所述用户对应的聊天记录。
详细地,对所述聊天记录对应的用户进行所述身份校验是为了核对所述用户是否在预先规定好的用户名单上,只有在所述用户名单上的用户的发言才能够作为后续进行处理的聊天记录,没有在所述用户名单上的用户发言将不会被采纳,此时获取到的的聊天记录没有实际参考意义。
步骤二、所述热词榜单生成模块102对所述频率大于预设阈值的分词进行汇总,得到热门分词集,得到热门分词集,并对所述热门分词集中的分词进行排序处理,根据所述排序生成热词榜单。
本申请实施例中,所述预设阈值是判定所述频率的对应分词是否是热门分词的标准,若所述频率大于预设阈值,则所述频率对应的分词为热门分词,若所述频率小于或者等于所述预设阈值,则所述频率对应的分词出现的次数相对较少,无法判定为热门分词。
步骤三、对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单。
本申请其中一个实施例中,所述对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单,包括:
获取所述热门分词集中的分词和所述分词的对应频率;
对所述分词按照其对应频率从大到小的顺序进行排列处理,得到初始榜单;截取所述初始榜单中预设个数的分词,生成热词榜单。
其中,所述预设个数可以为10。
详细地,截取所述初始榜单中预设个数的分词是为了对所述初始榜单进行进一步地筛选,虽然已经将所述频率大于预设阈值的分词进行汇总进行排列处理,得到了初始榜单,但是所述初始榜单中可能会包含较多低频率的分词,因此截取所述初始榜单中预设个数的分词,生成热词榜单。在发明其中一个应用场景中,所述热词榜单可以便于直播间中的主播更加直观地了解观众想要了解的相关话题和问题,主播可以对所述热词榜单上的问题按照从上到下的顺序进行解答。
步骤四、按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题。
本申请实施例中,所述记录检索模块103按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题:
通过遍历操作,按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词,并对所述检索词进行向量化处理,得到检索词向量提取所述初始聊天记录中的聊天关键词,并对所述聊天关键词进行向量化处理,得到关键词向量;
计算所述检索词向量和所述关键词向量之间的相似度,选择所述相似度大于或者等于预设的相似阈值的关键词对应的问题作为所述检索词对应的问题。
具体地,在本申请实施例中,所述初始聊天记录中可能包含多个聊天关键词,所述聊天关键词为医学领域的医学名词,计算所述检索词和所述多个聊天关键词之间的相似度,只要所述初始聊天记录中的句子中一个聊天关键词符合,则将对应的句子作为所述检索词对应的问题。
详细地,本申请实施例选择所述相似度大于或者等于预设的相似阈值的关键词对应的问题作为所述检索词对应的问题时,问题可以是以问句形式呈现出来的聊天记录。若检索到的所述关键词出现在不是问句的聊天记录中,则对应的聊天记录不会归为所述检索词对应的问题。
例如,所述检索词为“脑梗”,所述初始聊天记录为:“脑梗和心脏病在现代社会人的身上出现的频率很高,治疗脑梗的真正有效的方法到底存在吗?,但其中真正有效的主要是以下三种,包括·······。”,所述初始聊天记录中的聊天关键词为:“脑梗”和心脏病,通过计算所述检索词向量和所述关键词向量之间的相似度,可以将所述初始聊天记录对应的包含“脑梗”的问题作为所述检索词对应的问题。
其中,本申请实施例可以采用很多计算方法计算所述检索词向量和所述关键词向量之间的相似度,包括,但不限于,采用余弦相似度公式进行计算,采用欧几里得距离进行计算等。
可选地,本申请其中一个实施例中,所述计算所述检索词向量和所述关键词向量之间的相似度,包括:
利用下述公式计算所述检索词向量和所述关键词向量之间的相似度:
Figure PCTCN2022087818-appb-000002
其中,cos(a,b)为相似度,a为所述检索词向量,b为所述关键词向量,|a|、|b|分别为所述检索词向量对应的模和所述关键词向量对应的模。
本申请实施例可以根据预设的word2vec算法对所述检索词和所述聊天关键词进行向量化处理,得到所述检索词向量和所述关键词向量。
步骤五、获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料。
本申请实施例中,所述原始问答数据集可以是智能问答的业务场景下包含的医学问题和对应的答案。
详细地,所述原始问答数据集中的每个流程节点可以是所述医学专家和提问者之间的每一轮对话,以及对应的语料数据是指每轮对话中医学专家和提问者之间对话中医学专家的回答。
本申请实施例对所述语料数据进行流程节点标记后进行合并,得到训练语料。
步骤六、对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型。
本申请实施例中,对所述训练语料进行特征编码可以采用独热编码、目标编码、贝叶斯目标编码等方法,其中,本申请利用预设的独热编码对所述训练语料进行特征编码,得到训练语料向量。
具体地,所述对所述训练语料进行特征编码,得到训练语料向量,包括:
对所述原始问答数据集中的训练语料进行语料总数汇总,得到语料总数;
以所述训练语料为预设矩阵的行数,以所述语料总数为所述预设矩阵的列数,构建得到初始矩阵向量;
设置所述初始矩阵向量中所述训练语料对应的列数所在的位置为第一数值,其余列数为第二数值,得到训练语料向量。
例如,所述语料总数为5,即假设有五轮对话,那么第一轮训练语料的训练语料向量就是[1,0,0,0,0]。
本申请实施例中,所述预设的多分类模型可以为浅层的神经网络或者支持向量机模型。本申请实施例中,对所述多分类模型进行训练得到的问题意图分类模型可以识别出每个问题对应的意图。
详细地,所述利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型,包括:
利用所述预设的多分类模型对所述训练语料进行分类,得到一种或者多种分类意图;
标注所述原始问答数据集中的意图类别,计算所述意图类别和所述分类意图之间的重复度;
当所述重复度小于预设的分类阈值时,对所述预设的分类模型进行迭代更新,重新对所述最终表示向量进行分类;
当所述重复度大于或者等于预设的分类阈值时,得到问题意图分类模型。
详细地,本申请实施例通过判断所述每轮医学专家和提问者之间对话中医学专家的回答的实际意图来标注所述原始问答数据集中的意图类别。例如,第一轮对话为:提问者人员:“请问您能对脑梗相关的问题进行解答吗?”,医学专家:“是的,我可以。”,则所述第一轮对话的意图类别为确认医学专家解答领域,第二轮对话为:提问者:“脑梗的有效治疗方案有哪些呢?”,医学专家:“现在关于脑梗治疗的普遍治疗方案主要有以下几种,第一………”,则所述第二轮对话的意图类别为确认具体问题的解决方法等。
步骤七、所述问答记录生成模块106将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
本申请实施例中,将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,所述问题意图可以粗略估计出问题所述的范围和领域,根据不同的用户意图进行对应的解答,可以提高解答问题的效率。
其中,根据所述问题意图对检索得到的所述问题进行解答,得到对应的答案,并将所述问题和对应的答案进行汇总,生成所述问答记录。
具体地,所述根据所述问题意图对所述问题进行解答并生成问答记录,包括:
根据所述问题意图选择对应的预设的问答数据库,利用所述问答数据库对所述问题进行匹配处理,判断所述问题是否与所述问答数据库中的问题匹配;
若所述问题与所述问答数据库中的问题匹配,将所述问答数据库中的问题对应的答案作为所述问题的答案,并根据所述问题和所述答案生成问答记录;
若所述问题与所述问答数据库中的问题不匹配,将所述问题标记为未解答问题并进行对其进行问题解答,得到所述未解答问题的答案;
根据所述未解答问题和所述未解答问题的答案生成问答记录。
其中,所述问答数据库中包含一些常见问题及所述常见问题对应的回答。
详细地,若所述问题与所述问题数据库中的问题不匹配,本申请实施例将将所述问题标记为未解答问题,所述未解答问题在所述问答数据库中搜索不到匹配的问题,可以通过人工解答得到答案。
具体地,所述将所述问答记录推送到用户端,包括:
根据问答记录的传输文件将所述问答记录传输至数据推送引擎;
利用所述数据推送引擎推送所述问答记录至用户端。
如图4所示,是本申请一实施例提供的实现问答记录生成方法的电子设备的结构示意图。
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如问答记录生成程序12。
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等,所述计算机可读存储介质可以是非易失性的,也可以是易失性的。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如问答记录生成程序12的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如问答记录生成程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。
图4仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图4示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以 上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。
所述电子设备1中的所述存储器11存储的问答记录生成程序12是多个指令的组合,在所述处理器10中运行时,可以实现:
对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
具体地,所述处理器10对上述指令的具体实现方法可参考图1至图4对应实施例中相关步骤的描述,在此不赘述。
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。所述计算机可读存储介质可以是易失性的,也可以是非易失性的。例如,所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
本申请还提供一种计算机可读存储介质,所述可读存储介质存储有计算机程序,所述计算机程序在被电子设备的处理器所执行时,可以实现:
对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问 题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种问答记录生成方法,其中,所述方法包括:
    对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
    对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
    对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
    按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
    获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
    对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
    将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
  2. 如权利要求1所述的问答记录生成方法,其中,所述对获取的聊天记录进行分词处理,包括:
    按照预设规则对所述聊天记录进行预处理,得到初始聊天记录;
    利用分词工具对所述初始聊天记录进行分词处理,得到分词聊天集;
    根据预设的关键词词典,从所述分词聊天集中筛选出分词。
  3. 如权利要求1所述的问答记录生成方法,其中,所述对获取的聊天记录进行分词处理之前,所述方法还包括:
    识别出所述聊天记录对应的用户;
    判断所述用户是否通过身份校验;
    若所述用户未通过所述身份校验,则删除所述用户的聊天记录;
    若所述用户通过所述身份校验,则保留所述用户对应的聊天记录。
  4. 如权利要求1所述的问答记录生成方法,其中,所述按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题,包括:
    通过遍历操作,按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词,并对所述检索词进行向量化处理,得到检索词向量;
    提取所述初始聊天记录中的聊天关键词,并对所述聊天关键词进行向量化处理,得到关键词向量;
    计算所述检索词向量和所述关键词向量之间的相似度,选择所述相似度大于或者等于预设的相似阈值的关键词对应的问题作为所述检索词对应的问题。
  5. 如权利要求1所述的问答记录生成方法,其中,所述对所述训练语料进行特征编码,得到训练语料向量,包括:
    对所述原始问答数据集中的训练语料进行语料总数汇总,得到语料总数;
    以所述训练语料为预设矩阵的行数,以所述语料总数为所述预设矩阵的列数,构建得到初始矩阵向量;
    设置所述初始矩阵向量中所述训练语料对应的列数所在的位置为第一数值,其余列数为第二数值,得到训练语料向量。
  6. 如权利要求1所述的问答记录生成方法,其中,所述利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型,包括:
    利用所述预设的多分类模型对所述训练语料进行分类,得到一种或者多种分类意图;
    标注所述原始问答数据集中的意图类别,计算所述意图类别和所述分类意图之间的重复度;
    当所述重复度小于预设的分类阈值时,对所述预设的分类模型进行迭代更新,重新对所述最终表示向量进行分类;
    当所述重复度大于或者等于预设的分类阈值时,得到问题意图分类模型。
  7. 如权利要求1至5中任意一项所述的问答记录生成方法,其中,所述根据所述问题意图对所述问题进行解答并生成问答记录,包括:
    根据所述问题意图选择对应的预设的问答数据库,利用所述问答数据库对所述问题进行匹配处理,判断所述问题是否与所述问答数据库中的问题匹配;
    若所述问题与所述问答数据库中的问题匹配,将所述问答数据库中的问题对应的答案作为所述问题的答案,并根据所述问题和所述答案生成问答记录;
    若所述问题与所述问答数据库中的问题不匹配,将所述问题标记为未解答问题并进行对其进行问题解答,得到所述未解答问题的答案;
    根据所述未解答问题和所述未解答问题的答案生成问答记录。
  8. 一种问答记录生成装置,其中,所述装置包括:
    分词提取模块,用于对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
    热词榜单生成模块,用于对所述频率大于预设阈值的分词进行汇总,得到热门分词集,对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
    记录检索模块,用于按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
    训练语料生成模块,用于获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
    模型训练模块,用于对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
    问答记录生成模块,用于将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
  9. 一种电子设备,其中,所述电子设备包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下所述的问答记录生成方法:
    对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
    对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
    对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
    按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
    获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
    对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
    将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
  10. 如权利要求9所述的电子设备,其中,所述对获取的聊天记录进行分词处理,包 括:
    按照预设规则对所述聊天记录进行预处理,得到初始聊天记录;
    利用分词工具对所述初始聊天记录进行分词处理,得到分词聊天集;
    根据预设的关键词词典,从所述分词聊天集中筛选出分词。
  11. 如权利要求9所述的电子设备,其中,所述对获取的聊天记录进行分词处理之前,所述方法还包括:
    识别出所述聊天记录对应的用户;
    判断所述用户是否通过身份校验;
    若所述用户未通过所述身份校验,则删除所述用户的聊天记录;
    若所述用户通过所述身份校验,则保留所述用户对应的聊天记录。
  12. 如权利要求9所述的电子设备,其中,所述按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题,包括:
    通过遍历操作,按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词,并对所述检索词进行向量化处理,得到检索词向量;
    提取所述初始聊天记录中的聊天关键词,并对所述聊天关键词进行向量化处理,得到关键词向量;
    计算所述检索词向量和所述关键词向量之间的相似度,选择所述相似度大于或者等于预设的相似阈值的关键词对应的问题作为所述检索词对应的问题。
  13. 如权利要求9所述的电子设备,其中,所述对所述训练语料进行特征编码,得到训练语料向量,包括:
    对所述原始问答数据集中的训练语料进行语料总数汇总,得到语料总数;
    以所述训练语料为预设矩阵的行数,以所述语料总数为所述预设矩阵的列数,构建得到初始矩阵向量;
    设置所述初始矩阵向量中所述训练语料对应的列数所在的位置为第一数值,其余列数为第二数值,得到训练语料向量。
  14. 如权利要求9所述的电子设备,其中,所述利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型,包括:
    利用所述预设的多分类模型对所述训练语料进行分类,得到一种或者多种分类意图;
    标注所述原始问答数据集中的意图类别,计算所述意图类别和所述分类意图之间的重复度;
    当所述重复度小于预设的分类阈值时,对所述预设的分类模型进行迭代更新,重新对所述最终表示向量进行分类;
    当所述重复度大于或者等于预设的分类阈值时,得到问题意图分类模型。
  15. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下所述的问答记录生成方法:
    对获取的聊天记录进行分词处理,并统计每个分词出现的频率;
    对所述频率大于预设阈值的分词进行汇总,得到热门分词集;
    对所述热门分词集中的分词按照频率的大小进行排序,生成热词榜单;
    按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题;
    获取原始问答数据集,提取所述原始问答数据集中的每个流程节点和所述流程节点对应的语料数据,将所述语料数据的流程节点进行标记、合并,得到训练语料;
    对所述训练语料进行特征编码,得到训练语料向量,利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型;
    将检索得到的所述问题输入至所述问题意图分类模型中,得到问题意图,根据所述问题意图对所述问题进行解答并生成问答记录,将所述问答记录推送到客户端。
  16. 如权利要求15所述的计算机可读存储介质,其中,所述对获取的聊天记录进行分词处理,包括:
    按照预设规则对所述聊天记录进行预处理,得到初始聊天记录;
    利用分词工具对所述初始聊天记录进行分词处理,得到分词聊天集;
    根据预设的关键词词典,从所述分词聊天集中筛选出分词。
  17. 如权利要求15所述的计算机可读存储介质,其中,所述对获取的聊天记录进行分词处理之前,所述方法还包括:
    识别出所述聊天记录对应的用户;
    判断所述用户是否通过身份校验;
    若所述用户未通过所述身份校验,则删除所述用户的聊天记录;
    若所述用户通过所述身份校验,则保留所述用户对应的聊天记录。
  18. 如权利要求15所述的计算机可读存储介质,其中,所述按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词在所述聊天记录中检索,得到所述检索词对应的问题,包括:
    通过遍历操作,按照所述热词榜单中分词的排列顺序依次选择其中一个分词,将选择的所述分词作为检索词,并对所述检索词进行向量化处理,得到检索词向量;
    提取所述初始聊天记录中的聊天关键词,并对所述聊天关键词进行向量化处理,得到关键词向量;
    计算所述检索词向量和所述关键词向量之间的相似度,选择所述相似度大于或者等于预设的相似阈值的关键词对应的问题作为所述检索词对应的问题。
  19. 如权利要求15所述的计算机可读存储介质,其中,所述对所述训练语料进行特征编码,得到训练语料向量,包括:
    对所述原始问答数据集中的训练语料进行语料总数汇总,得到语料总数;
    以所述训练语料为预设矩阵的行数,以所述语料总数为所述预设矩阵的列数,构建得到初始矩阵向量;
    设置所述初始矩阵向量中所述训练语料对应的列数所在的位置为第一数值,其余列数为第二数值,得到训练语料向量。
  20. 如权利要求15所述的计算机可读存储介质,其中,所述利用所述训练语料对预设的多分类模型进行训练,得到问题意图分类模型,包括:
    利用所述预设的多分类模型对所述训练语料进行分类,得到一种或者多种分类意图;
    标注所述原始问答数据集中的意图类别,计算所述意图类别和所述分类意图之间的重复度;
    当所述重复度小于预设的分类阈值时,对所述预设的分类模型进行迭代更新,重新对所述最终表示向量进行分类;
    当所述重复度大于或者等于预设的分类阈值时,得到问题意图分类模型。
PCT/CN2022/087818 2021-04-21 2022-04-20 问答记录生成方法、装置、电子设备及存储介质 WO2022222942A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110429297.3 2021-04-21
CN202110429297.3A CN113111159A (zh) 2021-04-21 2021-04-21 问答记录生成方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022222942A1 true WO2022222942A1 (zh) 2022-10-27

Family

ID=76719045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/087818 WO2022222942A1 (zh) 2021-04-21 2022-04-20 问答记录生成方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN113111159A (zh)
WO (1) WO2022222942A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662523A (zh) * 2023-08-01 2023-08-29 宁波甬恒瑶瑶智能科技有限公司 一种基于gpt模型的生化知识问答方法、系统及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111159A (zh) * 2021-04-21 2021-07-13 康键信息技术(深圳)有限公司 问答记录生成方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304437A (zh) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 一种自动问答方法、装置及存储介质
CN108388558A (zh) * 2018-02-07 2018-08-10 平安普惠企业管理有限公司 问题匹配方法、装置、客服机器人和存储介质
CN110232114A (zh) * 2019-05-06 2019-09-13 平安科技(深圳)有限公司 语句意图识别方法、装置及计算机可读存储介质
CN110321564A (zh) * 2019-07-05 2019-10-11 浙江工业大学 一种多轮对话意图识别方法
CN111782785A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 自动问答方法、装置、设备以及存储介质
CN112632257A (zh) * 2020-12-29 2021-04-09 深圳赛安特技术服务有限公司 基于语义匹配的问题处理方法、装置、终端和存储介质
CN112650829A (zh) * 2019-10-11 2021-04-13 阿里巴巴集团控股有限公司 一种客服处理方法及装置
CN113111159A (zh) * 2021-04-21 2021-07-13 康键信息技术(深圳)有限公司 问答记录生成方法、装置、电子设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11580350B2 (en) * 2016-12-21 2023-02-14 Microsoft Technology Licensing, Llc Systems and methods for an emotionally intelligent chat bot
CN106878819B (zh) * 2017-01-20 2019-07-26 合一网络技术(北京)有限公司 一种网络直播中信息交互的方法、系统及装置
CN112347760A (zh) * 2020-11-16 2021-02-09 北京京东尚科信息技术有限公司 意图识别模型的训练方法及装置、意图识别方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304437A (zh) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 一种自动问答方法、装置及存储介质
CN108388558A (zh) * 2018-02-07 2018-08-10 平安普惠企业管理有限公司 问题匹配方法、装置、客服机器人和存储介质
CN110232114A (zh) * 2019-05-06 2019-09-13 平安科技(深圳)有限公司 语句意图识别方法、装置及计算机可读存储介质
CN110321564A (zh) * 2019-07-05 2019-10-11 浙江工业大学 一种多轮对话意图识别方法
CN112650829A (zh) * 2019-10-11 2021-04-13 阿里巴巴集团控股有限公司 一种客服处理方法及装置
CN111782785A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 自动问答方法、装置、设备以及存储介质
CN112632257A (zh) * 2020-12-29 2021-04-09 深圳赛安特技术服务有限公司 基于语义匹配的问题处理方法、装置、终端和存储介质
CN113111159A (zh) * 2021-04-21 2021-07-13 康键信息技术(深圳)有限公司 问答记录生成方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662523A (zh) * 2023-08-01 2023-08-29 宁波甬恒瑶瑶智能科技有限公司 一种基于gpt模型的生化知识问答方法、系统及存储介质
CN116662523B (zh) * 2023-08-01 2023-10-20 宁波甬恒瑶瑶智能科技有限公司 一种基于gpt模型的生化知识问答方法、系统及存储介质

Also Published As

Publication number Publication date
CN113111159A (zh) 2021-07-13

Similar Documents

Publication Publication Date Title
US11537820B2 (en) Method and system for generating and correcting classification models
US10586155B2 (en) Clarification of submitted questions in a question and answer system
WO2022141861A1 (zh) 情感分类方法、装置、电子设备及存储介质
CN107818815B (zh) 电子病历的检索方法及系统
CN110929125B (zh) 搜索召回方法、装置、设备及其存储介质
WO2022222942A1 (zh) 问答记录生成方法、装置、电子设备及存储介质
WO2022105115A1 (zh) 问答对匹配方法、装置、电子设备及存储介质
CN112395395B (zh) 文本关键词提取方法、装置、设备及存储介质
WO2016033239A1 (en) Data clustering system and methods
WO2023029512A1 (zh) 基于知识图谱的医疗问题解答方法、装置、设备及介质
WO2022222943A1 (zh) 科室推荐方法、装置、电子设备及存储介质
CN113312461A (zh) 基于自然语言处理的智能问答方法、装置、设备及介质
CN111078837A (zh) 智能问答信息处理方法、电子设备及计算机可读存储介质
US11461680B2 (en) Identifying attributes in unstructured data files using a machine-learning model
CN116134432A (zh) 用于提供对查询的答案的系统和方法
WO2022160454A1 (zh) 医疗文献的检索方法、装置、电子设备及存储介质
CN111797245B (zh) 基于知识图谱模型的信息匹配方法及相关装置
CN108304381B (zh) 基于人工智能的实体建边方法、装置、设备及存储介质
CN112132238A (zh) 一种识别隐私数据的方法、装置、设备和可读介质
WO2022227171A1 (zh) 关键信息提取方法、装置、电子设备及介质
WO2021174923A1 (zh) 概念词序列生成方法、装置、计算机设备及存储介质
CN114416939A (zh) 智能问答方法、装置、设备及存储介质
US20210117448A1 (en) Iterative sampling based dataset clustering
CN116450916A (zh) 基于定段分级的信息查询方法、装置、电子设备及介质
Xia et al. Content-irrelevant tag cleansing via bi-layer clustering and peer cooperation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22791049

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE