WO2022105119A1 - Training corpus generation method for intention recognition model, and related device thereof - Google Patents

Training corpus generation method for intention recognition model, and related device thereof Download PDF

Info

Publication number
WO2022105119A1
WO2022105119A1 PCT/CN2021/090462 CN2021090462W WO2022105119A1 WO 2022105119 A1 WO2022105119 A1 WO 2022105119A1 CN 2021090462 W CN2021090462 W CN 2021090462W WO 2022105119 A1 WO2022105119 A1 WO 2022105119A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
inquiry
query
target
related corpus
Prior art date
Application number
PCT/CN2021/090462
Other languages
French (fr)
Chinese (zh)
Inventor
孙向欣
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022105119A1 publication Critical patent/WO2022105119A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of big data technology, and in particular, to a training corpus generation method for an intent recognition model and related devices.
  • human-machine dialogues mostly use intent recognition models to identify customer intentions.
  • customer intentions rely on AI (Artificial Intelligence) inquiries, and in some scenarios, customer intentions do not rely on AI inquiries. Therefore, in the training process of the intent recognition model, it is mostly determined whether to fill the corresponding AI query in the training sample according to the dependency situation.
  • AI Artificial Intelligence
  • the inventor realized that since the dependence of customer intentions on AI queries cannot be judged in the actual production process, the prediction parameters of the input model all contain AI queries. As a result, the model training mode and the model prediction mode are inconsistent, and the accuracy of the intent recognition model in production is lower than that in the training environment.
  • the purpose of the embodiments of the present application is to propose a training corpus generation method for an intent recognition model and related equipment, so as to improve the quality of the training corpus of the intent recognition model.
  • the embodiment of the present application provides a training corpus generation method for an intent recognition model, which adopts the following technical solutions:
  • a training corpus generation method for an intent recognition model comprising the following steps:
  • the target query related corpus in the target query related corpus and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
  • the first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
  • the embodiment of the present application also provides a training corpus generation device for an intent recognition model, which adopts the following technical solutions:
  • a training corpus generation device for an intent recognition model comprising:
  • the matching module is used to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship;
  • an establishment module for establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus respectively;
  • a calculation module configured to calculate the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the inquiry-related corpus based on the similarity. For the non-inquiry-related corpus, a target inquiry-related corpus and a target non-inquiry-related corpus are obtained;
  • the generating module is configured to obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category corresponding to the corpus and the target query-related corpus generate a first training sample;
  • the association module is used to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with a preset inquiry category, and obtain the first two training samples;
  • the output module is used for outputting the first training sample and the second training sample as training corpus, wherein the training corpus is used for training the intention recognition model.
  • the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
  • a computer device comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the following method for generating a training corpus of an intent recognition model is implemented:
  • the target query related corpus in the target query related corpus and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
  • the first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • a computer-readable storage medium where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following method for generating a training corpus of an intent recognition model is implemented:
  • the target query related corpus in the target query related corpus and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
  • the first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
  • This application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjusts the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to achieve the determined
  • the accuracy of target query-related corpus and target non-question-related corpus is higher.
  • By associating target non-question-related corpus with preset query categories based on intent tags the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems.
  • the explosion of corpus ensures the efficiency of model training.
  • the training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of a training corpus generation method for an intent recognition model according to the present application
  • FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for generating training corpus of an intent recognition model according to the present application
  • FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, and 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • the method for generating training corpus for the intent recognition model is generally performed by a server/terminal device, and accordingly, the apparatus for generating training corpus for the intent recognition model is generally set in the server/terminal device.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the training corpus generation method for the intent recognition model includes the following steps:
  • S1 Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-question corpus.
  • query related corpus wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship.
  • an annotator pre-marks an intent label on the customer answer corpus under each inquiry category, where the inquiry category may include six categories from Q1 to Q6.
  • the successful customer answer corpus is used as the inquiry-related corpus, and the remaining customer's answer corpus is used as the non-inquiry-related corpus, which is convenient for further processing of the customer's answer corpus.
  • query-related corpora Examples of query-related corpora are as follows:
  • the electronic device for example, the server/terminal device shown in FIG. 1
  • the method for generating training corpus of the intent recognition model runs can receive the AI query corpus and the customer’s answer through a wired connection or a wireless connection. corpus.
  • the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
  • the customer answer corpus is matched based on a preset regular expression, the customer answer corpus that is successfully matched is used as inquiry-related corpus, and the customer's answer corpus that fails to match is used as non-inquiry-related corpus
  • the corpus includes:
  • the suspected inquiry-related corpus is marked as inquiry-related or non-inquiry-related based on the confirmation of the suspected inquiry-related corpus by the designated person;
  • the method of regular matching is adopted to extract suspected query-related corpus from the customer's answer corpus, and the remaining corpus, that is, the corpus that fails to match, is suspected non-inquiry-related corpus.
  • suspected inquiries-related corpora are handed over to designated personnel for confirmation, and the confirmed inquiries-related corpora.
  • the suspected inquiry-related corpus marked as inquiry-related is regarded as inquiry-related corpus; the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related are regarded as non-inquiry-related corpus.
  • the relevant corpus of the inquiry can be further determined, and the accuracy of the division of the corpus of the customer's answer can be improved.
  • S2 Establish an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively.
  • S3 Calculate the similarity between each non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity
  • the query-related corpus is obtained, and the target query-related corpus and the target non-query-related corpus are obtained.
  • the similarity between each non-question-related corpus and the query-related corpus is calculated.
  • the corpus in the query-related corpus and the non-question-related corpus are adjusted by similarity. Achieve a more rigorous target query-related corpus and target non-query-related corpus.
  • the calculating the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus includes:
  • the cosine similarity with the largest numerical value is taken as the similarity between the current non-question-related corpus and the query-related corpus.
  • the language representation model is called to embed the query-related corpus, thereby converting the query-related corpus into a 768-dimensional query-related word vector, where each query-related word vector represents a corpus Embedding, Embedding refers to using a low-dimensional vector to represent a corpus.
  • the language representation model can be the BERT (Bidirectional Encoder Representations from Transformers) model.
  • the BERT model has a wide range of versatility and can capture longer distances. rely.
  • the language representation model is called to convert the non-inquiry-related corpus into a 768-dimensional non-inquiry-related word vector, which can represent information in both directions.
  • the cosine similarity between each non-query-related word vector and each of the query-related word vectors is traversed and calculated. After the traversal, the maximum value of the cosine similarity is taken as the similarity between the current non-query-related word vector and the query-related corpus.
  • query-related word vectors An example of query-related word vectors is as follows:
  • non-inquiry related corpus such as: I have saved it.
  • the non-inquiry-related word vector corresponding to the non-inquiry-related corpus is a 768-dimensional vector: [0.07, 0.002, 0.04,..., 0.009], and the non-inquiry-related word vector is calculated with each inquiry-related word vector respectively.
  • Cosine similarity For two vectors of the same latitude, A and B, the cosine similarity calculation formula is:
  • adjusting the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to obtain the target inquiry-related corpus and the target non-inquiry-related corpus includes:
  • the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus according to the classification of the designated person, Obtain the target query-related corpus and the target non-query-related corpus.
  • the non-inquiry-related corpus still belongs to the non-inquiry-related corpus.
  • the similarity is 0.3, which is less than the first similarity threshold of 0.6, so the corpus is a non-question-related corpus.
  • the similarity is 0.9, which is less than the first similarity threshold of 0.6, so the corpus is the corpus to be confirmed.
  • the non-inquiry-related corpus is used as the to-be-confirmed corpus, the to-be-determined corpus is extracted from the non-inquiry-related corpus for redistribution. Achieve a more rigorous target query-related corpus and target non-query-related corpus.
  • the adjusting the query-related corpus and the non-query-related corpus based on the similarity, and obtaining the target query-related corpus and the target non-query-related corpus include:
  • the similarity is greater than the preset second threshold, the corresponding non-inquiry-related corpus is directly deleted, which can effectively improve the processing speed of the computer.
  • the first to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry according to the classification of the designated person In the relevant corpus, obtain the first inquiry-related corpus and the first non-inquiry-related corpus;
  • the second to-be-confirmed corpus is allocated to the first non-inquiry-related corpus according to the classification of the designated person or the From the first inquiry-related corpus, obtain a second inquiry-related corpus and a second non-inquiry-related corpus;
  • the designated person in this application may be an annotator. If the similarity is greater than the first similarity threshold, the corpus is included in the corpus to be confirmed by the business, that is, the corresponding non-inquiry-related corpus is regarded as the corpus to be confirmed, and the part of the corpus is returned to the annotator.
  • the annotator is to confirm whether the corpus annotation is related to the AI query. According to the annotation of the annotator, the to-be-confirmed corpus related to AI inquiries is added to the inquiry-related corpus, and the to-be-confirmed corpus that is not related to AI inquiries is added to the non-inquiry-related corpus.
  • the maximum similarity that is, if the third similarity is greater than the preset second similarity threshold, it is directly deleted. If the similarity is less than the second similarity threshold, it is directly still the non-inquiry-related corpus and still exists in the non-inquiry-related corpus.
  • S4 Acquire the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category and the target query-related corpus generate a first training sample.
  • the generated first training sample belongs to the training sample that the customer intends to rely on the AI query corpus.
  • the first training sample is generated through the target query related corpus and the corresponding query category, so as to ensure the dependency between the first training sample and the customer's intention.
  • S5 Acquire the target non-inquiry-related corpus in the target non-inquiry-related corpus, associate the target non-inquiry-related corpus with a preset inquiry category based on the intent label, and obtain a second training sample .
  • the target non-inquiry-related corpus is associated with a preset inquiry category to obtain a second training sample.
  • the second training sample belongs to the training sample in which the customer's intention does not depend on the AI query corpus.
  • associating the target non-inquiry-related corpus with a preset inquiry category based on the intent tag, and obtaining the second training sample includes:
  • sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;
  • the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
  • sample balance is performed on the non-inquiry-related corpus under each intent label, so as to prevent the samples under different intent labels from being too different and affecting the subsequent training effect of the model.
  • the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the preset quantity thresholds are respectively used for each of the intent labels.
  • the corresponding target non-inquiry-related corpus is subjected to sample equalization processing, and the balanced corpus obtained includes:
  • the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;
  • corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus.
  • the quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
  • the first number threshold may be set to 2500
  • the second number threshold may be set to 1000.
  • the specific values of the first quantity threshold and/or the second quantity threshold can be adjusted according to actual needs, as long as they are applicable.
  • intent labels with more than 2,500 corpora 2,500 non-inquiry-related corpora were randomly selected and retained.
  • intent tags with less than 1000 corpus the corpus is expanded to 1000.
  • the corpus of each intent label is set to be no more than 2500 corpora and no less than 1000 corpus, because there is a serious imbalance in the intent labels of model training.
  • the corpus expansion on the target non-inquiry related corpus corresponding to the current intent tag includes:
  • a preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.
  • the method of corpus expansion is to use python to call the RandomOverSample (random oversample) package.
  • RandomOverSample package some corpus in the corpus can be randomly copied to expand the corpus to a predetermined value.
  • the RandomOverSample package is often used to randomly replicate and repeat the minority class samples. The goal is to make the number of minority classes equal to the majority class to obtain a new balanced dataset.
  • S6 Use the first training sample and the second training sample as training corpora and output, wherein the training corpus is used for training an intent recognition model.
  • a better training corpus is obtained, and the consistency between the accuracy in the training environment and the accuracy in the production environment is improved.
  • the corpus used in the intent recognition model can identify customer intent more accurately.
  • the preset intent recognition model is trained by the training corpus, and the trained intent recognition model is obtained.
  • the above training corpus can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the present application can be applied in the field of smart medical care, thereby promoting the construction of smart cities.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a training corpus generation device for an intent recognition model, which corresponds to the method embodiment shown in FIG. 2 .
  • the device can be specifically applied to various electronic devices.
  • the training corpus generation device 300 for the intent recognition model described in this embodiment includes: a matching module 301 , a establishing module 302 , a computing module 303 , a generating module 304 , an association module 305 and an output module 306 .
  • the matching module 301 is configured to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain the query Inquiry-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship; establishing a module 302 is used to respectively base on the inquiry-related corpus and the non-inquiry-related corpus The inquiry-related corpus establishes an inquiry-related corpus and a non-inquiry-related corpus; the computing module 303 is configured to calculate the relationship between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus similarity, and adjust the query-related corpus and the non-query-related corpus based on the similarity to obtain a target query-related corpus and a target non-query-related corpus;
  • the query-related corpus generates a first training sample;
  • the association module 305 is configured to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with the target non-inquiry-related corpus.
  • the preset query categories are associated to obtain a second training sample; and an output module 306 is configured to use the first training sample and the second training sample as training corpus and output, wherein the training corpus is used for Intent recognition model training.
  • the present application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and performs the query-related corpus and the non-inquiry-related corpus based on the similarity. Adjustment to achieve higher accuracy of the determined target inquiry-related corpus and target non-inquiry-related corpus.
  • target non-question-related corpus By associating target non-question-related corpus with preset query categories based on intent tags, the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems.
  • the explosion of corpus ensures the efficiency of model training.
  • the training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.
  • the matching module 301 includes a matching sub-module, a presentation sub-module, a marking sub-module and a generating sub-module.
  • the matching sub-module is used to match the customer answer corpus based on a preset regular expression, and use the successfully matched customer answer corpus as the suspected query related corpus, and the matched failed customer answer corpus as the suspected query related corpus Non-inquiry-related corpus;
  • the display sub-module is used to display the suspected inquiries-related corpus on the preset front-end page, and notify the designated personnel to confirm the suspected inquiries-related corpus;
  • the marking sub-module is used to identify When the designated person completes the confirmation, the suspected inquiry-related corpus is marked as inquiry-related or non-inquiry-related based on the confirmation of the suspected inquiry-related corpus by the designated person;
  • the generating submodule is used to mark the inquiry-related corpus as an inquiry-related material.
  • the suspected inquiry-related corpus related to the inquiry is used as the inquiry-related corpus, and the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related are used as the non-inquiry-related corpus.
  • the calculation module 303 includes a first vector submodule, a second vector submodule, a similarity calculation submodule and a similarity confirmation submodule.
  • the first vector sub-module is used to input the current query-related corpus into a pre-trained language representation model to obtain query-related word vectors; the second vector sub-module is used to input the non-inquiry-related corpus into a pre-trained language representation model.
  • non-inquiry-related word vectors are obtained; the similarity calculation sub-module is used to traversely calculate the cosine similarity between the current non-inquiry-related word vectors and each of the inquiry-related word vectors.
  • the similarity confirmation sub-module is used for taking the cosine similarity with the largest numerical value as the similarity between the current non-inquiry-related corpus and the inquiry-related corpus.
  • the computing module 303 also includes a first identifying sub-module and a first allocating sub-module.
  • the first identification sub-module is used to identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, the corresponding non-inquiry related corpus is used as The corpus to be confirmed, and notify the designated person to classify the to-be-confirmed corpus;
  • the first assignment sub-module is configured to, when it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, classify the to-be-confirmed corpus according to the classification of the designated person.
  • the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus to obtain a target inquiry-related corpus and a target non-inquiry-related corpus.
  • the above calculation module 303 is further configured to: identify whether the similarity is greater than a preset second similarity threshold, and when the similarity is greater than a preset second similarity When the threshold is reached, the corresponding non-inquiry-related corpus is deleted from the non-inquiry-related corpus to obtain a target inquiry-related corpus and a target non-inquiry-related corpus.
  • the calculation module 303 further includes a first calculation submodule, a second identification submodule, a second allocation submodule, a second calculation submodule, a third identification submodule, a third identification submodule, and a third identification submodule. Allocating submodules, third computing submodules, and deleting submodules.
  • the first calculation sub-module is used to calculate the similarity between each of the non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus;
  • the second recognition sub-module is used to identify the Whether the similarity is greater than the preset first similarity threshold, when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the first to be confirmed corpus, and notify the designated person Classify the first to-be-confirmed corpus;
  • the second assignment sub-module is configured to classify the first to-be-confirmed corpus according to the designated person's classification when it is recognized that the designated person has completed the classification of the first to-be-confirmed corpus.
  • the corpus to be confirmed is allocated to the non-inquiry-related corpus or the inquiry-related corpus to obtain a first inquiry-related corpus and a first non-inquiry-related corpus; the second calculation submodule is used to calculate the first inquiry-related corpus.
  • the corpus is allocated to the first non-inquiry-related corpus or the first inquiry-related corpus to obtain a second inquiry-related corpus and a second non-inquiry-related corpus;
  • the third calculation submodule is used to calculate all the the second similarity between each second non-inquiry-related corpus in the second non-inquiry-related corpus and the second inquiry-related corpus;
  • the deletion sub-module is used to identify whether the second similarity is greater than A preset second similarity threshold, when the second similarity is greater than the preset second similarity threshold, delete the corresponding second non-inquiry related corpus from the second non-inquiry related corpus to obtain The target query-related corpus and the target non-query-related corpus.
  • the association module 305 includes a determination sub-module, an equalization sub-module and an association sub-module.
  • the determination sub-module is used to determine the target non-inquiry-related corpus corresponding to each of the intent tags;
  • the equalization sub-module is used to separately analyze the target non-inquiry corresponding to each of the intent tags based on a preset quantity threshold
  • the relevant corpus is subjected to sample equalization processing to obtain a balanced corpus;
  • the association sub-module is used to associate the balanced corpus corresponding to each of the intention labels with the preset query category based on the preset same probability, and obtain the second Training samples.
  • the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the equalization sub-module includes an identification unit, a screening unit and an expansion unit.
  • the identifying unit is used to identify whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;
  • the target non-inquiry-related corpus corresponding to the current intent label is randomly screened until the target non-inquiry-related corpus corresponding to the current intent label is The quantity is less than or equal to the first quantity threshold;
  • the expansion unit is configured to, when the quantity of the target non-inquiry-related corpus corresponding to the current intention label is less than the second quantity threshold, to The query-related corpus is expanded until the quantity of the target non-query-related corpus corresponding to the current intent tag is greater than or equal to the second quantity threshold.
  • the expansion unit is further configured to: call a preset random oversampling package, and use the random oversampling package to correlate the target non-inquiry corresponding to the current intent tag The corpus is replicated randomly.
  • This application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjusts the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to achieve the determined
  • the accuracy of target query-related corpus and target non-question-related corpus is higher.
  • By associating target non-question-related corpus with preset query categories based on intent tags the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems.
  • the explosion of corpus ensures the efficiency of model training.
  • the training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.
  • FIG. 4 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 200 includes a memory 201 , a processor 202 , and a network interface 203 that communicate with each other through a system bus. It should be noted that only the computer device 200 with components 201-203 is shown in the figure, but it should be understood that implementation of all shown components is not required, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 201 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the memory 201 may be an internal storage unit of the computer device 200 , such as a hard disk or a memory of the computer device 200 .
  • the memory 201 may also be an external storage device of the computer device 200, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 201 may also include both an internal storage unit of the computer device 200 and an external storage device thereof.
  • the memory 201 is generally used to store the operating system and various application software installed on the computer device 200 , such as computer-readable instructions for a method for generating training corpus of an intent recognition model.
  • the memory 201 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 202 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 202 is typically used to control the overall operation of the computer device 200 .
  • the processor 202 is configured to execute computer-readable instructions stored in the memory 201 or process data, for example, computer-readable instructions for executing a method for generating training corpus of the intent recognition model.
  • the network interface 203 may include a wireless network interface or a wired network interface, and the network interface 203 is generally used to establish a communication connection between the computer device 200 and other electronic devices.
  • the problem of not filling the query category for the training corpus that does not rely on the AI query corpus is solved, and at the same time, better training corpus is obtained, the explosion of the training corpus is not caused, and the intention recognition is effectively improved through the training corpus.
  • the accuracy of the model's ability to identify customer intent is solved, and at the same time, better training corpus is obtained, the explosion of the training corpus is not caused, and the intention recognition is effectively improved through the training corpus.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the above-described method for generating training corpus of an intent recognition model.
  • the problem of not filling the query category for the training corpus that does not rely on the AI query corpus is solved, and at the same time, better training corpus is obtained, the explosion of the training corpus is not caused, and the intention recognition is effectively improved through the training corpus.
  • the accuracy of the model's ability to identify customer intent is solved, and at the same time, better training corpus is obtained, the explosion of the training corpus is not caused, and the intention recognition is effectively improved through the training corpus.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Abstract

The present method belongs to the field of big data, and is applied to the field of smart medical treatments. Disclosed are a training corpus generation method for an intention recognition model, and a related device thereof. The method comprises: receiving an AI inquiry corpus pre-annotated with an inquiry class, and a customer answer corpus pre-annotated with an intention tag, wherein the customer answer corpus comprises an inquiry related corpus and a non-inquiry related corpus; establishing an inquiry related corpus library and a non-inquiry related corpus library; adjusting the inquiry related corpus library and the non-inquiry related corpus library on the basis of the similarity between the non-inquiry related corpus and the inquiry related corpus library, so as to obtain a target inquiry related corpus library and a target non-inquiry related corpus library; establishing a first training sample on the basis of the target inquiry related corpus library; establishing a second training sample on the basis of the intention tag and the non-inquiry related corpus library; and taking the first training sample and the second training sample as a training corpus and outputting same. The training corpus can be stored in a blockchain. By means of the method, the quality of a training corpus is improved.

Description

意图识别模型的训练语料生成方法及其相关设备Training corpus generation method for intent recognition model and related equipment
本申请要求于2020年11月17日提交中国专利局、申请号为202011288871.X,发明名称为“意图识别模型的训练语料生成方法及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on November 17, 2020 with the application number 202011288871.X and the invention titled "Method for generating training corpus for intent recognition model and related equipment", the entire contents of which are Incorporated herein by reference.
技术领域technical field
本申请涉及大数据技术领域,尤其涉及意图识别模型的训练语料生成方法及其相关设备。The present application relates to the field of big data technology, and in particular, to a training corpus generation method for an intent recognition model and related devices.
背景技术Background technique
随着计算机技术的不断变革和发展,人工智能已经逐渐应用于各行各业中,改善人们的生活。人机对话是人工智能的重要发展领域,对话场景复杂多样,要求计算机能够在对话的过程中精准识别客户意图,以便于更好的展开对话。With the continuous change and development of computer technology, artificial intelligence has been gradually applied in all walks of life to improve people's lives. Human-computer dialogue is an important development area of artificial intelligence. The dialogue scenes are complex and diverse, requiring computers to accurately identify customer intentions in the process of dialogue, so as to facilitate better dialogue.
目前,人机对话多是采用意图识别模型来识别客户的意图,由于存在有些场景中客户意图依赖AI(Artificial Intelligence,人工智能)问询,有些场景中客户意图不依赖AI问询。因此在意图识别模型训练过程大多是根据依赖情况决定训练样本中是否填充对应的AI问询。At present, human-machine dialogues mostly use intent recognition models to identify customer intentions. In some scenarios, customer intentions rely on AI (Artificial Intelligence) inquiries, and in some scenarios, customer intentions do not rely on AI inquiries. Therefore, in the training process of the intent recognition model, it is mostly determined whether to fill the corresponding AI query in the training sample according to the dependency situation.
但是,发明人意识到,由于在实际生产过程中无法判断客户意图对AI问询的依赖情况,输入模型的预测参数都是含有AI问询的。导致模型训练模式和模型预测模式不一致,进而使得意图识别模型在生产上的精准度,比在训练环境中的精准度低。However, the inventor realized that since the dependence of customer intentions on AI queries cannot be judged in the actual production process, the prediction parameters of the input model all contain AI queries. As a result, the model training mode and the model prediction mode are inconsistent, and the accuracy of the intent recognition model in production is lower than that in the training environment.
发明内容SUMMARY OF THE INVENTION
本申请实施例的目的在于提出一种意图识别模型的训练语料生成方法及其相关设备,提高意图识别模型的训练语料的质量。The purpose of the embodiments of the present application is to propose a training corpus generation method for an intent recognition model and related equipment, so as to improve the quality of the training corpus of the intent recognition model.
为了解决上述技术问题,本申请实施例提供一种意图识别模型的训练语料生成方法,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiment of the present application provides a training corpus generation method for an intent recognition model, which adopts the following technical solutions:
一种意图识别模型的训练语料生成方法,包括下述步骤:A training corpus generation method for an intent recognition model, comprising the following steps:
接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;
分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;
计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;
获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;
将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
为了解决上述技术问题,本申请实施例还提供一种意图识别模型的训练语料生成装置,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiment of the present application also provides a training corpus generation device for an intent recognition model, which adopts the following technical solutions:
一种意图识别模型的训练语料生成装置,包括:A training corpus generation device for an intent recognition model, comprising:
匹配模块,用于接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;The matching module is used to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship;
建立模块,用于分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;an establishment module for establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus respectively;
计算模块,用于计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;A calculation module, configured to calculate the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the inquiry-related corpus based on the similarity. For the non-inquiry-related corpus, a target inquiry-related corpus and a target non-inquiry-related corpus are obtained;
生成模块,用于获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;The generating module is configured to obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category corresponding to the corpus and the target query-related corpus generate a first training sample;
关联模块,用于获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;以及The association module is used to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with a preset inquiry category, and obtain the first two training samples; and
输出模块,用于将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The output module is used for outputting the first training sample and the second training sample as training corpus, wherein the training corpus is used for training the intention recognition model.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的意图识别模型的训练语料生成方法:A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the following method for generating a training corpus of an intent recognition model is implemented:
接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;
分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;
计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;
获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;
将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的意图识别模型的训练语料生成方法:A computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following method for generating a training corpus of an intent recognition model is implemented:
接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;
分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;
计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目 标问询相关语料库和目标非问询相关语料库;Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;
获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;
将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
与现有技术相比,本申请实施例主要有以下有益效果:Compared with the prior art, the embodiments of the present application mainly have the following beneficial effects:
本申请通过计算非问询相关语料库中每条非问询相关语料与问询相关语料库之间的相似度,并基于相似度对问询相关语料库和非问询相关语料库进行调整,实现确定出的目标问询相关语料和目标非问询相关语料的准确性较高。通过基于意图标签,将目标非问询相关语料与预设的问询类别进行关联的方式,解决了对不依赖AI问询语料的训练语料不进行问询类别的填充的问题,同时没有造成训练语料的爆炸,保证了模型训练的效率。通过此方式生成的训练语料能够使意图识别模型的准确率保持在高水平。This application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjusts the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to achieve the determined The accuracy of target query-related corpus and target non-question-related corpus is higher. By associating target non-question-related corpus with preset query categories based on intent tags, the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems. The explosion of corpus ensures the efficiency of model training. The training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.
附图说明Description of drawings
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
图2是根据本申请的意图识别模型的训练语料生成方法的一个实施例的流程图;2 is a flowchart of an embodiment of a training corpus generation method for an intent recognition model according to the present application;
图3是根据本申请的意图识别模型的训练语料生成装置的一个实施例的结构示意图;3 is a schematic structural diagram of an embodiment of an apparatus for generating training corpus of an intent recognition model according to the present application;
图4是根据本申请的计算机设备的一个实施例的结构示意图。FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
附图标记:200、计算机设备;201、存储器;202、处理器;203、网络接口;300、意图识别模型的训练语料生成装置;301、匹配模块;302、建立模块;303、计算模块;304、生成模块;305、关联模块;306、输出模块。Reference numerals: 200, computer equipment; 201, memory; 202, processor; 203, network interface; 300, training corpus generation device for intent recognition model; 301, matching module; 302, establishment module; 303, calculation module; 304 , generating module; 305, associating module; 306, outputting module.
具体实施方式Detailed ways
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器 应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications can be installed on the terminal devices 101, 102, and 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, and 103 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
需要说明的是,本申请实施例所提供的意图识别模型的训练语料生成方法一般由服务器/终端设备执行,相应地,意图识别模型的训练语料生成装置一般设置于服务器/终端设备中。It should be noted that the method for generating training corpus for the intent recognition model provided by the embodiments of the present application is generally performed by a server/terminal device, and accordingly, the apparatus for generating training corpus for the intent recognition model is generally set in the server/terminal device.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
继续参考图2,示出了根据本申请的意图识别模型的训练语料生成方法的一个实施例的流程图。所述的意图识别模型的训练语料生成方法,包括以下步骤:Continuing to refer to FIG. 2 , a flowchart of an embodiment of a method for generating training corpus of an intent recognition model according to the present application is shown. The training corpus generation method for the intent recognition model includes the following steps:
S1:接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系。S1: Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-question corpus. query related corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship.
在本实施例中,标注人员预先对每一个问询类别下的客户回答语料进行意图标签的标注,其中,问询类别可包括Q1至Q6六种类别。接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,采用正则匹配的方式对客户回答语料进行匹配,确定客户回答语料与AI问询语料是否具有相关性,即将匹配成功的所述客户回答语料作为问询相关语料,将剩余的所述客户回答语料作为非问询相关语料,便于后续对客户回答语料的进一步处理。In this embodiment, an annotator pre-marks an intent label on the customer answer corpus under each inquiry category, where the inquiry category may include six categories from Q1 to Q6. Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and use the regular matching method to match the customer response corpus to determine whether the customer response corpus and the AI query corpus are related, and will be matched soon. The successful customer answer corpus is used as the inquiry-related corpus, and the remaining customer's answer corpus is used as the non-inquiry-related corpus, which is convenient for further processing of the customer's answer corpus.
问询相关语料举例如下:Examples of query-related corpora are as follows:
Figure PCTCN2021090462-appb-000001
Figure PCTCN2021090462-appb-000001
在本实施例中,意图识别模型的训练语料生成方法运行于其上的电子设备(例如图1所示的服务器/终端设备)可以通过有线连接方式或者无线连接方式接收AI问询语料和客户回答语料。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。In this embodiment, the electronic device (for example, the server/terminal device shown in FIG. 1 ) on which the method for generating training corpus of the intent recognition model runs can receive the AI query corpus and the customer’s answer through a wired connection or a wireless connection. corpus. It should be pointed out that the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
具体的,所述基于预设的正则表达式对所述客户回答语料进行匹配,将匹配成功的所述客户回答语料作为问询相关语料,将匹配失败的所述客户回答语料作为非问询相关语料包括:Specifically, the customer answer corpus is matched based on a preset regular expression, the customer answer corpus that is successfully matched is used as inquiry-related corpus, and the customer's answer corpus that fails to match is used as non-inquiry-related corpus The corpus includes:
基于预设的正则表达式对所述客户回答语料进行匹配,将匹配成功的所述客户回答语料作为疑似问询相关语料,将匹配失败的所述客户回答语料作为疑似非问询相关语料;Matching the customer answer corpus based on a preset regular expression, using the successfully matched customer answer corpus as the suspected inquiry-related corpus, and the unsuccessfully matched customer answer corpus as the suspected non-inquiry-related corpus;
将所述疑似问询相关语料展示在预设的前端页面中,并通知指定人员对所述疑似问询相关语料进行确认;Display the relevant corpus of the suspected inquiry on the preset front-end page, and notify the designated person to confirm the relevant corpus of the suspected inquiry;
当识别到所述指定人员完成确认时,基于指定人员对所述疑似问询相关语料的确认将 所述疑似问询相关语料标记为问询相关或非问询相关;When it is recognized that the designated person has completed the confirmation, the suspected inquiry-related corpus is marked as inquiry-related or non-inquiry-related based on the confirmation of the suspected inquiry-related corpus by the designated person;
将标记为问询相关的疑似问询相关语料作为所述问询相关语料,将所述疑似非问询相关语料与标记为非问询相关的疑似问询相关语料作为所述非问询相关语料。Taking the suspected inquiry-related corpus marked as inquiry-related as the inquiry-related corpus, and taking the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related as the non-inquiry-related corpus .
在本实施例中,采用正则匹配的方式,提取客户回答语料中疑似问询相关语料,剩下的语料,即匹配失败的语料为疑似非问询相关语料。将这些疑似问询相关语料,交由指定人员确认,确定出的问询相关语料。将标记为问询相关的疑似问询相关语料作为问询相关语料;将疑似非问询相关语料与标记为非问询相关的疑似问询相关语料作为非问询相关语料。或者也可以在全部的客户回答语料中,去除确定的问询相关语料后,得到非问询相关语料。通过指定人员的确认,可以进一步确定出问询相关语料,提高对客户回答语料划分的准确性。In this embodiment, the method of regular matching is adopted to extract suspected query-related corpus from the customer's answer corpus, and the remaining corpus, that is, the corpus that fails to match, is suspected non-inquiry-related corpus. These suspected inquiries-related corpora are handed over to designated personnel for confirmation, and the confirmed inquiries-related corpora. The suspected inquiry-related corpus marked as inquiry-related is regarded as inquiry-related corpus; the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related are regarded as non-inquiry-related corpus. Alternatively, it is also possible to obtain non-inquiry-related corpus after removing certain query-related corpus from all the customer response corpus. Through the confirmation of the designated personnel, the relevant corpus of the inquiry can be further determined, and the accuracy of the division of the corpus of the customer's answer can be improved.
S2:分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库。S2: Establish an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively.
在本实施例中,通过建立问询相关语料库和非问询相关语料库,便于后续对问询相关语料和所述非问询相关语料进一步的处理。In this embodiment, by establishing an inquiry-related corpus and a non-inquiry-related corpus, it is convenient for subsequent further processing of the inquiry-related corpus and the non-inquiry-related corpus.
S3:计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库。S3: Calculate the similarity between each non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity The query-related corpus is obtained, and the target query-related corpus and the target non-query-related corpus are obtained.
在本实施例中,计算每条非问询相关语料与问询相关语料库之间的相似度。通过相似度来调整问询相关语料库和所述非问询相关语料库中的语料。实现获得更加严谨的目标问询相关语料库和目标非问询相关语料库。In this embodiment, the similarity between each non-question-related corpus and the query-related corpus is calculated. The corpus in the query-related corpus and the non-question-related corpus are adjusted by similarity. Achieve a more rigorous target query-related corpus and target non-query-related corpus.
具体的,所述计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度包括:Specifically, the calculating the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus includes:
将当前所述问询相关语料输入至预先训练的语言表征模型中,获得问询相关词向量;Inputting the currently described query-related corpus into a pre-trained language representation model to obtain query-related word vectors;
将所述非问询相关语料输入至预先训练的语言表征模型中,获得非问询相关词向量;Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;
历遍计算当前所述非问询相关词向量与每个所述问询相关词向量之间的余弦相似度;traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;
将数值最大的余弦相似度作为当前所述非问询相关语料与所述问询相关语料库之间的相似度。The cosine similarity with the largest numerical value is taken as the similarity between the current non-question-related corpus and the query-related corpus.
在本实施例中,调用语言表征模型对问询相关语料进行嵌入(Embedding),从而将问询相关语料转换成768维的问询相关词向量,其中,每个问询相关词向量代表一个语料的嵌入(Embedding),嵌入(Embedding)指用一个低维的向量表示一个语料,语言表征模型可以为BERT(Bidirectional Encoder Representations from Transformers)模型,BERT模型具备广泛的通用性,能捕捉更长距离的依赖。同时调用语言表征模型将非问询相关语料转换成768维的非问询相关词向量,能够双向表征信息。遍历计算每一个非问询相关词向量与每个所述问询相关词向量之间的余弦相似度。遍历后,取出余弦相似度的最大值,作为当前非问询相关词向量与问询相关语料库之间的相似度。In this embodiment, the language representation model is called to embed the query-related corpus, thereby converting the query-related corpus into a 768-dimensional query-related word vector, where each query-related word vector represents a corpus Embedding, Embedding refers to using a low-dimensional vector to represent a corpus. The language representation model can be the BERT (Bidirectional Encoder Representations from Transformers) model. The BERT model has a wide range of versatility and can capture longer distances. rely. At the same time, the language representation model is called to convert the non-inquiry-related corpus into a 768-dimensional non-inquiry-related word vector, which can represent information in both directions. The cosine similarity between each non-query-related word vector and each of the query-related word vectors is traversed and calculated. After the traversal, the maximum value of the cosine similarity is taken as the similarity between the current non-query-related word vector and the query-related corpus.
问询相关词向量示例如下:An example of query-related word vectors is as follows:
问询相关词语料query related words 问询相关词向量Query related word vectors
是的,好yes good [0.3,0.2,0.0005,…,0.006][0.3,0.2,0.0005,…,0.006]
嗯嗯Uh-huh [0.1,0.003,0.002,….,0.03][0.1,0.003,0.002,….,0.03]
了解learn [0.13,0.001,0.05,….,0.07][0.13,0.001,0.05,….,0.07]
晓得了got it [0.27,0.006,0.04,….,0.4][0.27,0.006,0.04,….,0.4]
OKOK [0.09,0.03,0.08,….,0.004][0.09,0.03,0.08,….,0.004]
知道了understood [0.19,0.3,0.02,….,0.008][0.19,0.3,0.02,….,0.008]
….….
计算过程举例如下:非问询相关语料,如:我存好了。该非问询相关语料对应的非问询相关词向量为768维向量:[0.07,0.002,0.04,…,0.009],该非问询相关词向量分别与 每个问询相关词向量进行计算余弦相似度。对于A、B两个同纬度向量,余弦相似度计算公式为:An example of the calculation process is as follows: non-inquiry related corpus, such as: I have saved it. The non-inquiry-related word vector corresponding to the non-inquiry-related corpus is a 768-dimensional vector: [0.07, 0.002, 0.04,..., 0.009], and the non-inquiry-related word vector is calculated with each inquiry-related word vector respectively. Cosine similarity. For two vectors of the same latitude, A and B, the cosine similarity calculation formula is:
Figure PCTCN2021090462-appb-000002
Figure PCTCN2021090462-appb-000002
通过以上公式,计算非问询相关语料“我存好了”,与问询相关语料库中每一条语料的余弦相似度,计算好之后,求余弦相似度的最大值作为当前所述非问询相关语料与所述问询相关语料库之间的相似度。Through the above formula, calculate the cosine similarity of the non-inquiry-related corpus "I saved it" and each corpus in the inquiry-related corpus. After the calculation, the maximum cosine similarity is calculated as the current non-inquiry-related corpus The similarity between the corpus and the query-related corpus.
进一步的,所述基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库包括:Further, adjusting the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to obtain the target inquiry-related corpus and the target non-inquiry-related corpus includes:
识别所述相似度是否大于预设的第一相似度阈值,当所述相似度大于预设的第一相似度阈值时,将对应的所述非问询相关语料作为待确认语料,并通知指定人员对所述待确认语料进行分类;Identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the corpus to be confirmed, and notify the designated The personnel classify the corpus to be confirmed;
当识别到所述指定人员完成对所述待确认语料的分类时,根据所述指定人员的分类将所述待确认语料分配至所述非问询相关语料库中或所述问询相关语料库中,获得目标问询相关语料库和目标非问询相关语料库。When it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus according to the classification of the designated person, Obtain the target query-related corpus and the target non-query-related corpus.
在本实施例中,当所述相似度小于等于预设的第一相似度阈值时,认为该非问询相关语料依然属于非问询相关语料。例如,相似度为0.3,小于第一相似度阈值0.6,所以该语料为非问询相关语料。相似度为0.9,小于第一相似度阈值0.6,所以该语料为待确认语料。当将非问询相关语料作为待确认语料时,从非问询相关语料库中提取出该待确定语料进行重新分配。实现获得更加严谨的目标问询相关语料库和目标非问询相关语料库。In this embodiment, when the similarity is less than or equal to the preset first similarity threshold, it is considered that the non-inquiry-related corpus still belongs to the non-inquiry-related corpus. For example, the similarity is 0.3, which is less than the first similarity threshold of 0.6, so the corpus is a non-question-related corpus. The similarity is 0.9, which is less than the first similarity threshold of 0.6, so the corpus is the corpus to be confirmed. When the non-inquiry-related corpus is used as the to-be-confirmed corpus, the to-be-determined corpus is extracted from the non-inquiry-related corpus for redistribution. Achieve a more rigorous target query-related corpus and target non-query-related corpus.
此外,作为本申请的另一实施例,所述基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库包括:In addition, as another embodiment of the present application, the adjusting the query-related corpus and the non-query-related corpus based on the similarity, and obtaining the target query-related corpus and the target non-query-related corpus include:
识别所述相似度是否大于预设的第二相似度阈值,当所述相似度大于预设的第二相似度阈值时,从所述非问询相关语料库中删除对应的非问询相关语料,获得目标问询相关语料库和目标非问询相关语料库。Identifying whether the similarity is greater than a preset second similarity threshold, when the similarity is greater than a preset second similarity threshold, delete the corresponding non-inquiry related corpus from the non-inquiry related corpus, Obtain the target query-related corpus and the target non-query-related corpus.
在本实施例中,通过相似度大于预设的第二阈值,则直接删除对应的非问询相关语料的方式,能够有效的提升计算机的处理速度。In this embodiment, if the similarity is greater than the preset second threshold, the corresponding non-inquiry-related corpus is directly deleted, which can effectively improve the processing speed of the computer.
作为本申请的又一实施例,还可以,所述计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库包括:As yet another embodiment of the present application, it is also possible to calculate the similarity between each piece of the non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus, and calculate the similarity based on the similarity Adjusting the inquiry-related corpus and the non-inquiry-related corpus, and obtaining the target inquiry-related corpus and the target non-inquiry-related corpus include:
计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度;calculating the similarity between each piece of the non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus;
识别所述相似度是否大于预设的第一相似度阈值,当所述相似度大于预设的第一相似度阈值时,将对应的所述非问询相关语料作为第一待确认语料,并通知指定人员对所述第一待确认语料进行分类;Identify whether the similarity is greater than a preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the first to-be-confirmed corpus, and Notifying the designated personnel to classify the first corpus to be confirmed;
当识别到所述指定人员完成对所述第一待确认语料的分类时,根据所述指定人员的分类将所述第一待确认语料分配至所述非问询相关语料库中或所述问询相关语料库中,获得第一问询相关语料库和第一非问询相关语料库;When it is recognized that the designated person has completed the classification of the first to-be-confirmed corpus, the first to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry according to the classification of the designated person In the relevant corpus, obtain the first inquiry-related corpus and the first non-inquiry-related corpus;
计算所述第一非问询相关语料库中的每条第一非问询相关语料与所述第一问询相关语料库之间的第一相似度;calculating the first similarity between each first non-inquiry-related corpus in the first non-inquiry-related corpus and the first inquiry-related corpus;
识别所述第一相似度是否大于预设的第一相似度阈值,当所述第二相似度大于预设的第一相似度阈值时,将对应的所述第一非问询相关语料作为第二待确认语料,并通知指定人员对所述第二待确认语料进行分类;Identify whether the first similarity is greater than the preset first similarity threshold, and when the second similarity is greater than the preset first similarity threshold, use the corresponding first non-inquiry related corpus as the first Second, the corpus to be confirmed, and notify the designated personnel to classify the second corpus to be confirmed;
当识别到所述指定人员完成对所述第二待确认语料的分类时,根据所述指定人员的分类将所述第二待确认语料分配至所述第一非问询相关语料库中或所述第一问询相关语料 库中,获得第二问询相关语料库和第二非问询相关语料库;When it is recognized that the designated person has completed the classification of the second to-be-confirmed corpus, the second to-be-confirmed corpus is allocated to the first non-inquiry-related corpus according to the classification of the designated person or the From the first inquiry-related corpus, obtain a second inquiry-related corpus and a second non-inquiry-related corpus;
计算所述第二非问询相关语料库中的每条第二非问询相关语料与所述第二问询相关语料库之间的第二相似度;calculating the second similarity between each second non-inquiry-related corpus in the second non-inquiry-related corpus and the second inquiry-related corpus;
识别所述第二相似度是否大于预设的第二相似度阈值,当所述第二相似度大于预设的第二相似度阈值时,从所述第二非问询相关语料库中删除对应的第二非问询相关语料,获得目标问询相关语料库和目标非问询相关语料库。Identify whether the second similarity is greater than a preset second similarity threshold, and when the second similarity is greater than a preset second similarity threshold, delete the corresponding For the second non-inquiry-related corpus, a target inquiry-related corpus and a target non-inquiry-related corpus are obtained.
在本实施例中,本申请中的指定人员可以为标注人员。如果相似度的大于第一相似度阈值,则将该语料纳入需业务确认的语料,即将对应的非问询相关语料作为待确认语料,将该部分语料返给标注人员。标注人员对待确认语料标注是否与AI问询相关。根据标注人员的标注将与AI问询相关的待确认语料补充到问询相关语料库,不与AI问询相关的待确认语料补充到非问询相关语料库。在实践中,在这样进行两轮后,即业务确认两轮后,认为问询相关语料库已经足够丰富。为了节省标注人力成本,在第三轮中,相似度最大值,即第三相似度如果大于预设的第二相似度阈值,则直接删除。相似度若小于第二相似度阈值,则直接依然为非问询相关语料,依然存在于非问询相关语料库中。通过以上的方式,得到目标问询相关语料库和目标非问询相关语料库。In this embodiment, the designated person in this application may be an annotator. If the similarity is greater than the first similarity threshold, the corpus is included in the corpus to be confirmed by the business, that is, the corresponding non-inquiry-related corpus is regarded as the corpus to be confirmed, and the part of the corpus is returned to the annotator. The annotator is to confirm whether the corpus annotation is related to the AI query. According to the annotation of the annotator, the to-be-confirmed corpus related to AI inquiries is added to the inquiry-related corpus, and the to-be-confirmed corpus that is not related to AI inquiries is added to the non-inquiry-related corpus. In practice, after two rounds of this, that is, two rounds of business confirmation, it is considered that the corpus related to the inquiry is sufficiently rich. In order to save the labor cost of labeling, in the third round, the maximum similarity, that is, if the third similarity is greater than the preset second similarity threshold, it is directly deleted. If the similarity is less than the second similarity threshold, it is directly still the non-inquiry-related corpus and still exists in the non-inquiry-related corpus. Through the above methods, the target query-related corpus and the target non-query-related corpus are obtained.
S4:获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本。S4: Acquire the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category and the target query-related corpus generate a first training sample.
在本实施例中,生成的第一训练样本属于客户意图依赖AI问询语料的训练样本。通过目标问询相关语料和对应的问讯类别生成第一训练样本,保证第一训练样本与客户意图的之间的依赖性。In this embodiment, the generated first training sample belongs to the training sample that the customer intends to rely on the AI query corpus. The first training sample is generated through the target query related corpus and the corresponding query category, so as to ensure the dependency between the first training sample and the customer's intention.
S5:获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本。S5: Acquire the target non-inquiry-related corpus in the target non-inquiry-related corpus, associate the target non-inquiry-related corpus with a preset inquiry category based on the intent label, and obtain a second training sample .
在本实施例中,基于意图标签,将目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本。第二训练样本属于客户意图不依赖AI问询语料的训练样本。通过将非问询相关语料与预设的问询类别进行关联,实现获得的第二训练样本中问询类别的位置不是空置的。In this embodiment, based on the intent tag, the target non-inquiry-related corpus is associated with a preset inquiry category to obtain a second training sample. The second training sample belongs to the training sample in which the customer's intention does not depend on the AI query corpus. By associating the non-question-related corpus with the preset query category, it is realized that the position of the query category in the obtained second training sample is not vacant.
具体的,所述基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本包括:Specifically, associating the target non-inquiry-related corpus with a preset inquiry category based on the intent tag, and obtaining the second training sample includes:
确定每种所述意图标签所对应的目标非问询相关语料;Determine the target non-question-related corpus corresponding to each of the intent tags;
基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料;Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;
基于预设的相同概率,将每种所述意图标签所对应的均衡语料与预设的问询类别进行关联,获得所述第二训练样本。Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
在本实施例中,基于意图标签,对每个意图标签下的非问询相关语料进行样本均衡,避免不同意图标签下的样本相差太大,影响后续对模型的训练效果。对每一个意图标签进行Q1-Q6的同机率的填充,在实现第二训练样本中问询类别的位置不是空置的同时,对每个不依赖客户意图的均衡语料进行了同概率的均匀填充,避免样本偏移。In this embodiment, based on the intent labels, sample balance is performed on the non-inquiry-related corpus under each intent label, so as to prevent the samples under different intent labels from being too different and affecting the subsequent training effect of the model. Filling each intent label with the same probability of Q1-Q6, while realizing that the position of the query category in the second training sample is not vacant, each balanced corpus that does not depend on the customer's intention is uniformly filled with the same probability, Avoid sample skew.
其中,所述数量阈值包括第一数量阈值和第二数量阈值,其中,所述第一数量阈值大于所述第二数量阈值,所述基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料包括:Wherein, the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the preset quantity thresholds are respectively used for each of the intent labels. The corresponding target non-inquiry-related corpus is subjected to sample equalization processing, and the balanced corpus obtained includes:
识别当前意图标签所对应的目标非问询相关语料的数量是否大于所述第一数量阈值或是否小于所述第二数量阈值;Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;
在当前意图标签所对应的目标非问询相关语料的数量大于所述第一数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行随机筛选,直至当前意图标签所对应的目标非问询相关语料的数量小于等于所述第一数量阈值;When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;
在当前意图标签所对应的目标非问询相关语料的数量小于所述第二数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行语料扩充,直至当前意图标签所对应的目标非问询相关语料数量大于等于所述第二数量阈值。When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
在本实施例中,本申请中第一数量阈值可以设置为2500,第二数量阈值可以设置为1000。在实际应用过程中,可以根据实际需要调整第一数量阈值和/或第二数量阈值的具体数值,适用即可。对于大于2500条语料的意图标签,随机筛选保留2500条非问询相关语料。对于小于1000条语料的意图标签,进行语料扩充,扩充成1000条。定为每个意图标签语料不超过2500条语料、不少于1000条语料,是因为模型训练的意图标签存在严重不均衡的现象。通过在训练语料上的大量实验得出,当标签语料大于2500条时,在该意图标签下增加语料,得到的模型准确率的提升非常有限,且会加剧训练集的样本不均衡,导致某些标签量比较少的意图标签识别准确率低。小于1000条语料的意图标签,会因标签在模型中占的权重过小,造成模型对该标签的识别准确率不足。In this embodiment, in this application, the first number threshold may be set to 2500, and the second number threshold may be set to 1000. In the actual application process, the specific values of the first quantity threshold and/or the second quantity threshold can be adjusted according to actual needs, as long as they are applicable. For intent labels with more than 2,500 corpora, 2,500 non-inquiry-related corpora were randomly selected and retained. For intent tags with less than 1000 corpus, the corpus is expanded to 1000. The corpus of each intent label is set to be no more than 2500 corpora and no less than 1000 corpus, because there is a serious imbalance in the intent labels of model training. Through a large number of experiments on the training corpus, it is found that when the label corpus is greater than 2500, adding the corpus under the intent label will result in a very limited improvement in the accuracy of the model, and will aggravate the imbalance of the training set samples, resulting in some Intent labels with fewer labels have lower recognition accuracy. Intention labels with less than 1000 corpora will cause the model to have insufficient recognition accuracy of the label because the weight of the label in the model is too small.
进一步的,所述对所述当前意图标签所对应的目标非问询相关语料进行语料扩充包括:Further, the corpus expansion on the target non-inquiry related corpus corresponding to the current intent tag includes:
调用预设的随机过采样包,通过所述随机过采样包对所述当前意图标签所对应的目标非问询相关语料进行随机复制。A preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.
在本实施例中,进行语料扩充的方式为用python调用RandomOverSample(随机过采样)包,通过该RandomOverSample包,可以随机复制语料中的某些语料,使语料扩充到预先确定的值。RandomOverSample包常用于随机的复制、重复少数类样本,目标使得少数类与多数类的个数相同从而得到一个新的均衡的数据集。In this embodiment, the method of corpus expansion is to use python to call the RandomOverSample (random oversample) package. Through the RandomOverSample package, some corpus in the corpus can be randomly copied to expand the corpus to a predetermined value. The RandomOverSample package is often used to randomly replicate and repeat the minority class samples. The goal is to make the number of minority classes equal to the majority class to obtain a new balanced dataset.
S6:将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。S6: Use the first training sample and the second training sample as training corpora and output, wherein the training corpus is used for training an intent recognition model.
在本实施例中,基于第一训练样本和所述第二训练样本生成的训练语料,获得较佳的训练语料,提高训练环境中的精确度与生产环境中的精确度的一致性,通过训练语料用于意图识别模型能够对客户意图识别更加准确。In this embodiment, based on the training corpus generated by the first training sample and the second training sample, a better training corpus is obtained, and the consistency between the accuracy in the training environment and the accuracy in the production environment is improved. The corpus used in the intent recognition model can identify customer intent more accurately.
训练语料举例如下:Examples of training data are as follows:
Figure PCTCN2021090462-appb-000003
Figure PCTCN2021090462-appb-000003
获得训练语料后,通过训练语料训练预设的意图识别模型,获得训练后的意图识别模型。接收待识别客户回答语料和待识别AI问询语料,确定所述待识别AI问询语料具有一一对应的映射关系的问询类别,作为待识别问询类别,将待识别客户回答语料和待识别问询类别输入至训练后的意图识别模型中,获得客户意图。After the training corpus is obtained, the preset intent recognition model is trained by the training corpus, and the trained intent recognition model is obtained. Receive the customer answer corpus to be identified and the AI query corpus to be identified, and determine the query category to which the AI query corpus to be identified has a one-to-one mapping relationship as the query category to be identified. Identify the query category and input it into the trained intent recognition model to obtain customer intent.
需要强调的是,为进一步保证上述训练语料的私密和安全性,上述训练语料还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above training corpus, the above training corpus can also be stored in a node of a blockchain.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本申请可应用于智慧医疗领域中,从而推动智慧城市的建设。The present application can be applied in the field of smart medical care, thereby promoting the construction of smart cities.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过 计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, the processes of the above-mentioned method embodiments may be included. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种意图识别模型的训练语料生成装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。With further reference to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a training corpus generation device for an intent recognition model, which corresponds to the method embodiment shown in FIG. 2 . , the device can be specifically applied to various electronic devices.
如图3所示,本实施例所述的意图识别模型的训练语料生成装置300包括:匹配模块301、建立模块302、计算模块303、生成模块304、关联模块305以及输出模块306。其中,匹配模块301,用于接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;建立模块302,用于分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;计算模块303,用于计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;生成模块304,用于获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;关联模块305,用于获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;以及输出模块306,用于将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。As shown in FIG. 3 , the training corpus generation device 300 for the intent recognition model described in this embodiment includes: a matching module 301 , a establishing module 302 , a computing module 303 , a generating module 304 , an association module 305 and an output module 306 . The matching module 301 is configured to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain the query Inquiry-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship; establishing a module 302 is used to respectively base on the inquiry-related corpus and the non-inquiry-related corpus The inquiry-related corpus establishes an inquiry-related corpus and a non-inquiry-related corpus; the computing module 303 is configured to calculate the relationship between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus similarity, and adjust the query-related corpus and the non-query-related corpus based on the similarity to obtain a target query-related corpus and a target non-query-related corpus; a generating module 304 is used to obtain the target query-related corpus. query the target query-related corpus in the relevant corpus, and determine the query category corresponding to the target query-related corpus based on the AI query corpus, and based on the target query-related corpus corresponding query category and the target The query-related corpus generates a first training sample; the association module 305 is configured to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with the target non-inquiry-related corpus. The preset query categories are associated to obtain a second training sample; and an output module 306 is configured to use the first training sample and the second training sample as training corpus and output, wherein the training corpus is used for Intent recognition model training.
在本实施例中,本申请通过计算非问询相关语料库中每条非问询相关语料与问询相关语料库之间的相似度,并基于相似度对问询相关语料库和非问询相关语料库进行调整,实现确定出的目标问询相关语料和目标非问询相关语料的准确性较高。通过基于意图标签,将目标非问询相关语料与预设的问询类别进行关联的方式,解决了对不依赖AI问询语料的训练语料不进行问询类别的填充的问题,同时没有造成训练语料的爆炸,保证了模型训练的效率。通过此方式生成的训练语料能够使意图识别模型的准确率保持在高水平。In this embodiment, the present application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and performs the query-related corpus and the non-inquiry-related corpus based on the similarity. Adjustment to achieve higher accuracy of the determined target inquiry-related corpus and target non-inquiry-related corpus. By associating target non-question-related corpus with preset query categories based on intent tags, the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems. The explosion of corpus ensures the efficiency of model training. The training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.
匹配模块301包括匹配子模块、展示子模块、标记子模块和生成子模块。其中,匹配子模块用于基于预设的正则表达式对所述客户回答语料进行匹配,将匹配成功的所述客户回答语料作为疑似问询相关语料,将匹配失败的所述客户回答语料作为疑似非问询相关语料;展示子模块用于将所述疑似问询相关语料展示在预设的前端页面中,并通知指定人员对所述疑似问询相关语料进行确认;标记子模块用于当识别到所述指定人员完成确认时,基于指定人员对所述疑似问询相关语料的确认将所述疑似问询相关语料标记为问询相关或非问询相关;生成子模块用于将标记为问询相关的疑似问询相关语料作为所述问询相关语料,将所述疑似非问询相关语料与标记为非问询相关的疑似问询相关语料作为所述非问询相关语料。The matching module 301 includes a matching sub-module, a presentation sub-module, a marking sub-module and a generating sub-module. The matching sub-module is used to match the customer answer corpus based on a preset regular expression, and use the successfully matched customer answer corpus as the suspected query related corpus, and the matched failed customer answer corpus as the suspected query related corpus Non-inquiry-related corpus; the display sub-module is used to display the suspected inquiries-related corpus on the preset front-end page, and notify the designated personnel to confirm the suspected inquiries-related corpus; the marking sub-module is used to identify When the designated person completes the confirmation, the suspected inquiry-related corpus is marked as inquiry-related or non-inquiry-related based on the confirmation of the suspected inquiry-related corpus by the designated person; the generating submodule is used to mark the inquiry-related corpus as an inquiry-related material. The suspected inquiry-related corpus related to the inquiry is used as the inquiry-related corpus, and the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related are used as the non-inquiry-related corpus.
计算模块303包括第一向量子模块、第二向量子模块、相似度计算子模块和相似度确认子模块。第一向量子模块用于将当前所述问询相关语料输入至预先训练的语言表征模型 中,获得问询相关词向量;第二向量子模块用于将所述非问询相关语料输入至预先训练的语言表征模型中,获得非问询相关词向量;相似度计算子模块用于历遍计算当前所述非问询相关词向量与每个所述问询相关词向量之间的余弦相似度;相似度确认子模块用于将数值最大的余弦相似度作为当前所述非问询相关语料与所述问询相关语料库之间的相似度。The calculation module 303 includes a first vector submodule, a second vector submodule, a similarity calculation submodule and a similarity confirmation submodule. The first vector sub-module is used to input the current query-related corpus into a pre-trained language representation model to obtain query-related word vectors; the second vector sub-module is used to input the non-inquiry-related corpus into a pre-trained language representation model. In the trained language representation model, non-inquiry-related word vectors are obtained; the similarity calculation sub-module is used to traversely calculate the cosine similarity between the current non-inquiry-related word vectors and each of the inquiry-related word vectors The similarity confirmation sub-module is used for taking the cosine similarity with the largest numerical value as the similarity between the current non-inquiry-related corpus and the inquiry-related corpus.
计算模块303还包括第一识别子模块和第一分配子模块。第一识别子模块用于识别所述相似度是否大于预设的第一相似度阈值,当所述相似度大于预设的第一相似度阈值时,将对应的所述非问询相关语料作为待确认语料,并通知指定人员对所述待确认语料进行分类;第一分配子模块用于当识别到所述指定人员完成对所述待确认语料的分类时,根据所述指定人员的分类将所述待确认语料分配至所述非问询相关语料库中或所述问询相关语料库中,获得目标问询相关语料库和目标非问询相关语料库。The computing module 303 also includes a first identifying sub-module and a first allocating sub-module. The first identification sub-module is used to identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, the corresponding non-inquiry related corpus is used as The corpus to be confirmed, and notify the designated person to classify the to-be-confirmed corpus; the first assignment sub-module is configured to, when it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, classify the to-be-confirmed corpus according to the classification of the designated person. The to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus to obtain a target inquiry-related corpus and a target non-inquiry-related corpus.
在本实施例的一些可选的实现方式中,上述计算模块303进一步用于:识别所述相似度是否大于预设的第二相似度阈值,当所述相似度大于预设的第二相似度阈值时,从所述非问询相关语料库中删除对应的非问询相关语料,获得目标问询相关语料库和目标非问询相关语料库。In some optional implementations of this embodiment, the above calculation module 303 is further configured to: identify whether the similarity is greater than a preset second similarity threshold, and when the similarity is greater than a preset second similarity When the threshold is reached, the corresponding non-inquiry-related corpus is deleted from the non-inquiry-related corpus to obtain a target inquiry-related corpus and a target non-inquiry-related corpus.
在本实施例的一些可选的实现方式中,计算模块303还包括第一计算子模块、第二识别子模块、第二分配子模块、第二计算子模块、第三识别子模块、第三分配子模块、第三计算子模块以及删除子模块。其中,第一计算子模块用于计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度;第二识别子模块用于识别所述相似度是否大于预设的第一相似度阈值,当所述相似度大于预设的第一相似度阈值时,将对应的所述非问询相关语料作为第一待确认语料,并通知指定人员对所述第一待确认语料进行分类;第二分配子模块用于当识别到所述指定人员完成对所述第一待确认语料的分类时,根据所述指定人员的分类将所述第一待确认语料分配至所述非问询相关语料库中或所述问询相关语料库中,获得第一问询相关语料库和第一非问询相关语料库;第二计算子模块用于计算所述第一非问询相关语料库中的每条第一非问询相关语料与所述第一问询相关语料库之间的第一相似度;第三识别子模块用于识别所述第一相似度是否大于预设的第一相似度阈值,当所述第二相似度大于预设的第一相似度阈值时,将对应的所述第一非问询相关语料作为第二待确认语料,并通知指定人员对所述第二待确认语料进行分类;第三分配子模块用于当识别到所述指定人员完成对所述第二待确认语料的分类时,根据所述指定人员的分类将所述第二待确认语料分配至所述第一非问询相关语料库中或所述第一问询相关语料库中,获得第二问询相关语料库和第二非问询相关语料库;第三计算子模块用于计算所述第二非问询相关语料库中的每条第二非问询相关语料与所述第二问询相关语料库之间的第二相似度;删除子模块用于识别所述第二相似度是否大于预设的第二相似度阈值,当所述第二相似度大于预设的第二相似度阈值时,从所述第二非问询相关语料库中删除对应的第二非问询相关语料,获得目标问询相关语料库和目标非问询相关语料库。In some optional implementations of this embodiment, the calculation module 303 further includes a first calculation submodule, a second identification submodule, a second allocation submodule, a second calculation submodule, a third identification submodule, a third identification submodule, and a third identification submodule. Allocating submodules, third computing submodules, and deleting submodules. Wherein, the first calculation sub-module is used to calculate the similarity between each of the non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus; the second recognition sub-module is used to identify the Whether the similarity is greater than the preset first similarity threshold, when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the first to be confirmed corpus, and notify the designated person Classify the first to-be-confirmed corpus; the second assignment sub-module is configured to classify the first to-be-confirmed corpus according to the designated person's classification when it is recognized that the designated person has completed the classification of the first to-be-confirmed corpus. The corpus to be confirmed is allocated to the non-inquiry-related corpus or the inquiry-related corpus to obtain a first inquiry-related corpus and a first non-inquiry-related corpus; the second calculation submodule is used to calculate the first inquiry-related corpus. The first similarity between each first non-inquiry-related corpus in the non-inquiry-related corpus and the first inquiry-related corpus; the third identification sub-module is used to identify whether the first similarity is greater than a predetermined The set first similarity threshold, when the second similarity is greater than the preset first similarity threshold, the corresponding first non-inquiry related corpus is used as the second to-be-confirmed corpus, and the designated person is notified to The second to-be-confirmed corpus is classified; the third allocation sub-module is configured to classify the second to-be-confirmed corpus according to the classification of the designated person when it is recognized that the designated person has completed the classification of the second to-be-confirmed corpus. Confirm that the corpus is allocated to the first non-inquiry-related corpus or the first inquiry-related corpus to obtain a second inquiry-related corpus and a second non-inquiry-related corpus; the third calculation submodule is used to calculate all the the second similarity between each second non-inquiry-related corpus in the second non-inquiry-related corpus and the second inquiry-related corpus; the deletion sub-module is used to identify whether the second similarity is greater than A preset second similarity threshold, when the second similarity is greater than the preset second similarity threshold, delete the corresponding second non-inquiry related corpus from the second non-inquiry related corpus to obtain The target query-related corpus and the target non-query-related corpus.
关联模块305包括确定子模块、均衡子模块和关联子模块。其中,确定子模块用于确定每种所述意图标签所对应的目标非问询相关语料;均衡子模块用于基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料;关联子模块用于基于预设的相同概率,将每种所述意图标签所对应的均衡语料与预设的问询类别进行关联,获得所述第二训练样本。The association module 305 includes a determination sub-module, an equalization sub-module and an association sub-module. Wherein, the determination sub-module is used to determine the target non-inquiry-related corpus corresponding to each of the intent tags; the equalization sub-module is used to separately analyze the target non-inquiry corresponding to each of the intent tags based on a preset quantity threshold The relevant corpus is subjected to sample equalization processing to obtain a balanced corpus; the association sub-module is used to associate the balanced corpus corresponding to each of the intention labels with the preset query category based on the preset same probability, and obtain the second Training samples.
所述数量阈值包括第一数量阈值和第二数量阈值,其中,所述第一数量阈值大于所述第二数量阈值,均衡子模块包括识别单元、筛选单元和扩充单元。其中,识别单元用于识别当前意图标签所对应的目标非问询相关语料的数量是否大于所述第一数量阈值或是否小于所述第二数量阈值;筛选单元用于在当前意图标签所对应的目标非问询相关语料的数量大于所述第一数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行随机筛选,直至当前意图标签所对应的目标非问询相关语料的数量小于等于所述第一数量阈值;扩充单元用于在当前意图标签所对应的目标非问询相关语料的数量小于所述第二数量阈 值时,对所述当前意图标签所对应的目标非问询相关语料进行语料扩充,直至当前意图标签所对应的目标非问询相关语料数量大于等于所述第二数量阈值。The quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the equalization sub-module includes an identification unit, a screening unit and an expansion unit. Wherein, the identifying unit is used to identify whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold; When the quantity of the target non-inquiry-related corpus is greater than the first quantity threshold, the target non-inquiry-related corpus corresponding to the current intent label is randomly screened until the target non-inquiry-related corpus corresponding to the current intent label is The quantity is less than or equal to the first quantity threshold; the expansion unit is configured to, when the quantity of the target non-inquiry-related corpus corresponding to the current intention label is less than the second quantity threshold, to The query-related corpus is expanded until the quantity of the target non-query-related corpus corresponding to the current intent tag is greater than or equal to the second quantity threshold.
在本实施例的一些可选的实现方式中,上述扩充单元进一步用于:调用预设的随机过采样包,通过所述随机过采样包对所述当前意图标签所对应的目标非问询相关语料进行随机复制。In some optional implementations of this embodiment, the expansion unit is further configured to: call a preset random oversampling package, and use the random oversampling package to correlate the target non-inquiry corresponding to the current intent tag The corpus is replicated randomly.
本申请通过计算非问询相关语料库中每条非问询相关语料与问询相关语料库之间的相似度,并基于相似度对问询相关语料库和非问询相关语料库进行调整,实现确定出的目标问询相关语料和目标非问询相关语料的准确性较高。通过基于意图标签,将目标非问询相关语料与预设的问询类别进行关联的方式,解决了对不依赖AI问询语料的训练语料不进行问询类别的填充的问题,同时没有造成训练语料的爆炸,保证了模型训练的效率。通过此方式生成的训练语料能够使意图识别模型的准确率保持在高水平。This application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjusts the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to achieve the determined The accuracy of target query-related corpus and target non-question-related corpus is higher. By associating target non-question-related corpus with preset query categories based on intent tags, the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems. The explosion of corpus ensures the efficiency of model training. The training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to FIG. 4 , which is a block diagram of a basic structure of a computer device according to this embodiment.
所述计算机设备200包括通过系统总线相互通信连接存储器201、处理器202、网络接口203。需要指出的是,图中仅示出了具有组件201-203的计算机设备200,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 200 includes a memory 201 , a processor 202 , and a network interface 203 that communicate with each other through a system bus. It should be noted that only the computer device 200 with components 201-203 is shown in the figure, but it should be understood that implementation of all shown components is not required, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
所述存储器201至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。所述计算机可读存储介质可以是非易失性,也可以是易失性。在一些实施例中,所述存储器201可以是所述计算机设备200的内部存储单元,例如该计算机设备200的硬盘或内存。在另一些实施例中,所述存储器201也可以是所述计算机设备200的外部存储设备,例如该计算机设备200上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器201还可以既包括所述计算机设备200的内部存储单元也包括其外部存储设备。本实施例中,所述存储器201通常用于存储安装于所述计算机设备200的操作系统和各类应用软件,例如意图识别模型的训练语料生成方法的计算机可读指令等。此外,所述存储器201还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 201 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. The computer-readable storage medium may be non-volatile or volatile. In some embodiments, the memory 201 may be an internal storage unit of the computer device 200 , such as a hard disk or a memory of the computer device 200 . In other embodiments, the memory 201 may also be an external storage device of the computer device 200, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 201 may also include both an internal storage unit of the computer device 200 and an external storage device thereof. In this embodiment, the memory 201 is generally used to store the operating system and various application software installed on the computer device 200 , such as computer-readable instructions for a method for generating training corpus of an intent recognition model. In addition, the memory 201 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器202在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器202通常用于控制所述计算机设备200的总体操作。本实施例中,所述处理器202用于运行所述存储器201中存储的计算机可读指令或者处理数据,例如运行所述意图识别模型的训练语料生成方法的计算机可读指令。In some embodiments, the processor 202 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 202 is typically used to control the overall operation of the computer device 200 . In this embodiment, the processor 202 is configured to execute computer-readable instructions stored in the memory 201 or process data, for example, computer-readable instructions for executing a method for generating training corpus of the intent recognition model.
所述网络接口203可包括无线网络接口或有线网络接口,该网络接口203通常用于在所述计算机设备200与其他电子设备之间建立通信连接。The network interface 203 may include a wireless network interface or a wired network interface, and the network interface 203 is generally used to establish a communication connection between the computer device 200 and other electronic devices.
在本实施例中,解决了对不依赖AI问询语料的训练语料不进行问询类别的填充的问题,同时获得较佳的训练语料,没有造成训练语料的爆炸,通过训练语料有效提升意图识别模型能够对客户意图识别的准确性。In this embodiment, the problem of not filling the query category for the training corpus that does not rely on the AI query corpus is solved, and at the same time, better training corpus is obtained, the explosion of the training corpus is not caused, and the intention recognition is effectively improved through the training corpus. The accuracy of the model's ability to identify customer intent.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的意图识别模型的训练语料生成方法。The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the above-described method for generating training corpus of an intent recognition model.
在本实施例中,解决了对不依赖AI问询语料的训练语料不进行问询类别的填充的问题,同时获得较佳的训练语料,没有造成训练语料的爆炸,通过训练语料有效提升意图识别模型能够对客户意图识别的准确性。In this embodiment, the problem of not filling the query category for the training corpus that does not rely on the AI query corpus is solved, and at the same time, better training corpus is obtained, the explosion of the training corpus is not caused, and the intention recognition is effectively improved through the training corpus. The accuracy of the model's ability to identify customer intent.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims (20)

  1. 一种意图识别模型的训练语料生成方法,包括下述步骤:A training corpus generation method for an intent recognition model, comprising the following steps:
    接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;
    分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;
    计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;
    获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
    获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;
    将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
  2. 根据权利要求1所述的意图识别模型的训练语料生成方法,其中,所述计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度包括:The method for generating training corpus for an intent recognition model according to claim 1, wherein the calculating the similarity between each of the non-question-related corpora in the non-question-related corpus and the query-related corpus include:
    将所述问询相关语料输入至预先训练的语言表征模型中,获得问询相关词向量;Inputting the query-related corpus into a pre-trained language representation model to obtain query-related word vectors;
    将所述非问询相关语料输入至预先训练的语言表征模型中,获得非问询相关词向量;Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;
    历遍计算当前所述非问询相关词向量与每个所述问询相关词向量之间的余弦相似度;traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;
    将数值最大的余弦相似度作为所述非问询相关语料与所述问询相关语料库之间的相似度。The cosine similarity with the largest numerical value is taken as the similarity between the non-question-related corpus and the query-related corpus.
  3. 根据权利要求1所述的意图识别模型的训练语料生成方法,其中,所述基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本包括:The method for generating training corpus of an intent recognition model according to claim 1, wherein the target non-question related corpus is associated with a preset query category based on the intent label to obtain a second training sample include:
    确定每种所述意图标签所对应的目标非问询相关语料;Determine the target non-question-related corpus corresponding to each of the intent tags;
    基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料;Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;
    基于预设的相同概率,将每种所述意图标签所对应的均衡语料与预设的问询类别进行关联,获得所述第二训练样本。Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
  4. 根据权利要求3所述的意图识别模型的训练语料生成方法,其中,所述数量阈值包括第一数量阈值和第二数量阈值,其中,所述第一数量阈值大于所述第二数量阈值,所述基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料包括:The method for generating training corpus of an intent recognition model according to claim 3, wherein the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the The sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags based on the preset quantity threshold, and the balanced corpus obtained includes:
    识别当前意图标签所对应的目标非问询相关语料的数量是否大于所述第一数量阈值或是否小于所述第二数量阈值;Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;
    在当前意图标签所对应的目标非问询相关语料的数量大于所述第一数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行随机筛选,直至当前意图标签所对应的目标非问询相关语料的数量小于等于所述第一数量阈值;When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;
    在当前意图标签所对应的目标非问询相关语料的数量小于所述第二数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行语料扩充,直至当前意图标签所对应的目标非问询相关语料数量大于等于所述第二数量阈值。When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
  5. 根据权利要求4所述的意图识别模型的训练语料生成方法,其中,所述对所述当前意图标签所对应的目标非问询相关语料进行语料扩充包括:The method for generating training corpus of an intent recognition model according to claim 4, wherein the performing corpus expansion on the target non-inquiry related corpus corresponding to the current intent label comprises:
    调用预设的随机过采样包,通过所述随机过采样包对所述当前意图标签所对应的目标 非问询相关语料进行随机复制。The preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.
  6. 根据权利要求1所述的意图识别模型的训练语料生成方法,其中,所述基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库包括:The method for generating training corpus for an intent recognition model according to claim 1, wherein the query-related corpus and the non-question-related corpus are adjusted based on the similarity to obtain a target query-related corpus and a target non-question-related corpus. Inquiry-related corpora include:
    识别所述相似度是否大于预设的第一相似度阈值,当所述相似度大于预设的第一相似度阈值时,将对应的所述非问询相关语料作为待确认语料,并通知指定人员对所述待确认语料进行分类;Identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the corpus to be confirmed, and notify the designated The personnel classify the corpus to be confirmed;
    当识别到所述指定人员完成对所述待确认语料的分类时,根据所述指定人员的分类将所述待确认语料分配至所述非问询相关语料库中或所述问询相关语料库中,获得目标问询相关语料库和目标非问询相关语料库。When it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus according to the classification of the designated person, Obtain the target query-related corpus and the target non-query-related corpus.
  7. 根据权利要求1所述的意图识别模型的训练语料生成方法,其中,所述基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库包括:The method for generating training corpus for an intent recognition model according to claim 1, wherein the query-related corpus and the non-question-related corpus are adjusted based on the similarity to obtain a target query-related corpus and a target non-question-related corpus. Inquiry-related corpora include:
    识别所述相似度是否大于预设的第二相似度阈值,当所述相似度大于预设的第二相似度阈值时,从所述非问询相关语料库中删除对应的非问询相关语料,获得目标问询相关语料库和目标非问询相关语料库。Identifying whether the similarity is greater than a preset second similarity threshold, when the similarity is greater than a preset second similarity threshold, delete the corresponding non-inquiry related corpus from the non-inquiry related corpus, Obtain the target query-related corpus and the target non-query-related corpus.
  8. 一种意图识别模型的训练语料生成装置,包括:A training corpus generation device for an intent recognition model, comprising:
    匹配模块,用于接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;The matching module is used to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship;
    建立模块,用于分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;an establishment module for establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus respectively;
    计算模块,用于计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;A calculation module, configured to calculate the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the inquiry-related corpus based on the similarity. For the non-inquiry-related corpus, a target inquiry-related corpus and a target non-inquiry-related corpus are obtained;
    生成模块,用于获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;The generating module is configured to obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category corresponding to the corpus and the target query-related corpus generate a first training sample;
    关联模块,用于获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;以及The association module is used to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with a preset inquiry category, and obtain the first two training samples; and
    输出模块,用于将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The output module is used for outputting the first training sample and the second training sample as training corpus, wherein the training corpus is used for training the intention recognition model.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的意图识别模型的训练语料生成方法:A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the following method for generating a training corpus of an intent recognition model is implemented:
    接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;
    分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;
    计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;
    获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所 述目标问询相关语料生成第一训练样本;Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
    获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;
    将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
  10. 根据权利要求9所述的计算机设备,其中,所述计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度包括:The computer device according to claim 9, wherein the calculating the similarity between each of the non-question-related corpus and the query-related corpus in the non-question-related corpus comprises:
    将所述问询相关语料输入至预先训练的语言表征模型中,获得问询相关词向量;Inputting the query-related corpus into a pre-trained language representation model to obtain query-related word vectors;
    将所述非问询相关语料输入至预先训练的语言表征模型中,获得非问询相关词向量;Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;
    历遍计算当前所述非问询相关词向量与每个所述问询相关词向量之间的余弦相似度;traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;
    将数值最大的余弦相似度作为所述非问询相关语料与所述问询相关语料库之间的相似度。The cosine similarity with the largest numerical value is taken as the similarity between the non-question-related corpus and the query-related corpus.
  11. 根据权利要求9所述的计算机设备,其中,所述基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本包括:The computer device according to claim 9, wherein, based on the intent tag, associating the target non-question-related corpus with a preset query category, and obtaining the second training sample comprises:
    确定每种所述意图标签所对应的目标非问询相关语料;Determine the target non-question-related corpus corresponding to each of the intent tags;
    基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料;Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;
    基于预设的相同概率,将每种所述意图标签所对应的均衡语料与预设的问询类别进行关联,获得所述第二训练样本。Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
  12. 根据权利要求11所述的计算机设备,其中,所述数量阈值包括第一数量阈值和第二数量阈值,其中,所述第一数量阈值大于所述第二数量阈值,所述基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料包括:11. The computer device of claim 11, wherein the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the predetermined quantity is based on The threshold value performs sample equalization processing on the target non-inquiry-related corpus corresponding to each of the intent labels, and the obtained balanced corpus includes:
    识别当前意图标签所对应的目标非问询相关语料的数量是否大于所述第一数量阈值或是否小于所述第二数量阈值;Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;
    在当前意图标签所对应的目标非问询相关语料的数量大于所述第一数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行随机筛选,直至当前意图标签所对应的目标非问询相关语料的数量小于等于所述第一数量阈值;When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;
    在当前意图标签所对应的目标非问询相关语料的数量小于所述第二数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行语料扩充,直至当前意图标签所对应的目标非问询相关语料数量大于等于所述第二数量阈值。When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
  13. 根据权利要求12所述的计算机设备,其中,所述对所述当前意图标签所对应的目标非问询相关语料进行语料扩充包括:The computer device according to claim 12, wherein the performing corpus expansion on the target non-query related corpus corresponding to the current intent tag comprises:
    调用预设的随机过采样包,通过所述随机过采样包对所述当前意图标签所对应的目标非问询相关语料进行随机复制。A preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.
  14. 根据权利要求9所述的计算机设备,其中,所述基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库包括:The computer device according to claim 9, wherein the adjusting the query-related corpus and the non-query-related corpus based on the similarity, and obtaining the target query-related corpus and the target non-query-related corpus comprises:
    识别所述相似度是否大于预设的第一相似度阈值,当所述相似度大于预设的第一相似度阈值时,将对应的所述非问询相关语料作为待确认语料,并通知指定人员对所述待确认语料进行分类;Identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the corpus to be confirmed, and notify the designated The personnel classify the corpus to be confirmed;
    当识别到所述指定人员完成对所述待确认语料的分类时,根据所述指定人员的分类将所述待确认语料分配至所述非问询相关语料库中或所述问询相关语料库中,获得目标问询相关语料库和目标非问询相关语料库。When it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus according to the classification of the designated person, Obtain the target query-related corpus and the target non-query-related corpus.
  15. 根据权利要求9所述的计算机设备,其中,所述基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库包 括:The computer device according to claim 9, wherein the adjusting the inquiry-related corpus and the non-inquiry-related corpus based on the similarity, and obtaining the target inquiry-related corpus and the target non-inquiry-related corpus comprises:
    识别所述相似度是否大于预设的第二相似度阈值,当所述相似度大于预设的第二相似度阈值时,从所述非问询相关语料库中删除对应的非问询相关语料,获得目标问询相关语料库和目标非问询相关语料库。Identifying whether the similarity is greater than a preset second similarity threshold, when the similarity is greater than a preset second similarity threshold, delete the corresponding non-inquiry related corpus from the non-inquiry related corpus, Obtain the target query-related corpus and the target non-query-related corpus.
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的意图识别模型的训练语料生成方法:A computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following method for generating a training corpus of an intent recognition model is implemented:
    接收预标注问询类别的AI问询语料和预标注意图标签的客户回答语料,并基于预设的正则表达式对所述客户回答语料进行筛选操作,得到问询相关语料以及非问询相关语料,其中,所述客户回答语料与所述AI问询语料具有一一对应的映射关系;Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;
    分别基于所述问询相关语料和所述非问询相关语料建立问询相关语料库和非问询相关语料库;establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;
    计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度,并基于所述相似度调整所述问询相关语料库和所述非问询相关语料库,获得目标问询相关语料库和目标非问询相关语料库;Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;
    获取所述目标问询相关语料库中的目标问询相关语料,并基于所述AI问询语料确定所述目标问询相关语料对应的问询类别,基于所述目标问询相关语料对应的问询类别和所述目标问询相关语料生成第一训练样本;Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;
    获取所述目标非问询相关语料库中的目标非问询相关语料,基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本;acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;
    将所述第一训练样本和所述第二训练样本作为训练语料并输出,其中,所述训练语料用于意图识别模型的训练。The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算所述非问询相关语料库中每条所述非问询相关语料与所述问询相关语料库之间的相似度包括:The computer-readable storage medium according to claim 16, wherein the calculating the similarity between each of the non-question-related corpus and the query-related corpus in the non-question-related corpus comprises:
    将所述问询相关语料输入至预先训练的语言表征模型中,获得问询相关词向量;Inputting the query-related corpus into a pre-trained language representation model to obtain query-related word vectors;
    将所述非问询相关语料输入至预先训练的语言表征模型中,获得非问询相关词向量;Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;
    历遍计算当前所述非问询相关词向量与每个所述问询相关词向量之间的余弦相似度;traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;
    将数值最大的余弦相似度作为所述非问询相关语料与所述问询相关语料库之间的相似度。The cosine similarity with the largest numerical value is taken as the similarity between the non-question-related corpus and the query-related corpus.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述基于所述意图标签,将所述目标非问询相关语料与预设的问询类别进行关联,获得第二训练样本包括:The computer-readable storage medium according to claim 16, wherein the associating the target non-question-related corpus with a preset query category based on the intent tag, and obtaining the second training sample comprises:
    确定每种所述意图标签所对应的目标非问询相关语料;Determine the target non-question-related corpus corresponding to each of the intent tags;
    基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料;Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;
    基于预设的相同概率,将每种所述意图标签所对应的均衡语料与预设的问询类别进行关联,获得所述第二训练样本。Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述数量阈值包括第一数量阈值和第二数量阈值,其中,所述第一数量阈值大于所述第二数量阈值,所述基于预设的数量阈值分别对每种所述意图标签所对应的目标非问询相关语料进行样本均衡处理,获得均衡语料包括:19. The computer-readable storage medium of claim 18, wherein the quantity threshold comprises a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, the The set quantity thresholds respectively perform sample equalization processing on the target non-inquiry-related corpus corresponding to each of the intent labels, and the obtained balanced corpus includes:
    识别当前意图标签所对应的目标非问询相关语料的数量是否大于所述第一数量阈值或是否小于所述第二数量阈值;Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;
    在当前意图标签所对应的目标非问询相关语料的数量大于所述第一数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行随机筛选,直至当前意图标签所对应的目标非问询相关语料的数量小于等于所述第一数量阈值;When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;
    在当前意图标签所对应的目标非问询相关语料的数量小于所述第二数量阈值时,对所述当前意图标签所对应的目标非问询相关语料进行语料扩充,直至当前意图标签所对应的 目标非问询相关语料数量大于等于所述第二数量阈值。When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述对所述当前意图标签所对应的目标非问询相关语料进行语料扩充包括:The computer-readable storage medium according to claim 19, wherein the performing corpus expansion on the target non-query related corpus corresponding to the current intent tag comprises:
    调用预设的随机过采样包,通过所述随机过采样包对所述当前意图标签所对应的目标非问询相关语料进行随机复制。A preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.
PCT/CN2021/090462 2020-11-17 2021-04-28 Training corpus generation method for intention recognition model, and related device thereof WO2022105119A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011288871.XA CN112395390B (en) 2020-11-17 2020-11-17 Training corpus generation method of intention recognition model and related equipment thereof
CN202011288871.X 2020-11-17

Publications (1)

Publication Number Publication Date
WO2022105119A1 true WO2022105119A1 (en) 2022-05-27

Family

ID=74606272

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090462 WO2022105119A1 (en) 2020-11-17 2021-04-28 Training corpus generation method for intention recognition model, and related device thereof

Country Status (2)

Country Link
CN (1) CN112395390B (en)
WO (1) WO2022105119A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395390B (en) * 2020-11-17 2023-07-25 平安科技(深圳)有限公司 Training corpus generation method of intention recognition model and related equipment thereof
CN114281968B (en) * 2021-12-20 2023-02-28 北京百度网讯科技有限公司 Model training and corpus generation method, device, equipment and storage medium
CN115408509B (en) * 2022-11-01 2023-02-14 杭州一知智能科技有限公司 Intention identification method, system, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170161363A1 (en) * 2015-12-04 2017-06-08 International Business Machines Corporation Automatic Corpus Expansion using Question Answering Techniques
CN108153780A (en) * 2016-12-05 2018-06-12 阿里巴巴集团控股有限公司 A kind of human-computer dialogue device and its interactive method of realization
WO2018157700A1 (en) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 Method and device for generating dialogue, and storage medium
CN111368043A (en) * 2020-02-19 2020-07-03 中国平安人寿保险股份有限公司 Event question-answering method, device, equipment and storage medium based on artificial intelligence
CN111428010A (en) * 2019-01-10 2020-07-17 北京京东尚科信息技术有限公司 Man-machine intelligent question and answer method and device
CN112395390A (en) * 2020-11-17 2021-02-23 平安科技(深圳)有限公司 Training corpus generation method of intention recognition model and related equipment thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619050B (en) * 2018-06-20 2023-05-09 华为技术有限公司 Intention recognition method and device
CN109508376A (en) * 2018-11-23 2019-03-22 四川长虹电器股份有限公司 It can online the error correction intension recognizing method and device that update
CN110032724B (en) * 2018-12-19 2022-11-25 阿里巴巴集团控股有限公司 Method and device for recognizing user intention
CN110135551B (en) * 2019-05-15 2020-07-21 西南交通大学 Robot chatting method based on word vector and recurrent neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170161363A1 (en) * 2015-12-04 2017-06-08 International Business Machines Corporation Automatic Corpus Expansion using Question Answering Techniques
CN108153780A (en) * 2016-12-05 2018-06-12 阿里巴巴集团控股有限公司 A kind of human-computer dialogue device and its interactive method of realization
WO2018157700A1 (en) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 Method and device for generating dialogue, and storage medium
CN111428010A (en) * 2019-01-10 2020-07-17 北京京东尚科信息技术有限公司 Man-machine intelligent question and answer method and device
CN111368043A (en) * 2020-02-19 2020-07-03 中国平安人寿保险股份有限公司 Event question-answering method, device, equipment and storage medium based on artificial intelligence
CN112395390A (en) * 2020-11-17 2021-02-23 平安科技(深圳)有限公司 Training corpus generation method of intention recognition model and related equipment thereof

Also Published As

Publication number Publication date
CN112395390B (en) 2023-07-25
CN112395390A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
WO2022105119A1 (en) Training corpus generation method for intention recognition model, and related device thereof
WO2022126971A1 (en) Density-based text clustering method and apparatus, device, and storage medium
US9460117B2 (en) Image searching
US11727053B2 (en) Entity recognition from an image
US10713306B2 (en) Content pattern based automatic document classification
WO2022126970A1 (en) Method and device for financial fraud risk identification, computer device, and storage medium
WO2022174491A1 (en) Artificial intelligence-based method and apparatus for medical record quality control, computer device, and storage medium
WO2022134584A1 (en) Real estate picture verification method and apparatus, computer device and storage medium
CN112632278A (en) Labeling method, device, equipment and storage medium based on multi-label classification
WO2022126962A1 (en) Knowledge graph-based method for detecting guiding and abetting corpus and related device
WO2021103594A1 (en) Tacitness degree detection method and device, server and readable storage medium
CN116956326A (en) Authority data processing method and device, computer equipment and storage medium
CN116661936A (en) Page data processing method and device, computer equipment and storage medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
WO2022105120A1 (en) Text detection method and apparatus from image, computer device and storage medium
WO2022142032A1 (en) Handwritten signature verification method and apparatus, computer device, and storage medium
CN113065354B (en) Method for identifying geographic position in corpus and related equipment thereof
CN113989618A (en) Recyclable article classification and identification method
CN112036501A (en) Image similarity detection method based on convolutional neural network and related equipment thereof
CN111597453A (en) User image drawing method and device, computer equipment and computer readable storage medium
CN115250200B (en) Service authorization authentication method and related equipment thereof
CN117076775A (en) Information data processing method, information data processing device, computer equipment and storage medium
CN117113400A (en) Data leakage tracing method, device, equipment and storage medium thereof
CN116796730A (en) Text error correction method, device, equipment and storage medium based on artificial intelligence
CN117827814A (en) Data verification method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893275

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893275

Country of ref document: EP

Kind code of ref document: A1