WO2022105119A1

WO2022105119A1 - Training corpus generation method for intention recognition model, and related device thereof

Info

Publication number: WO2022105119A1
Application number: PCT/CN2021/090462
Authority: WO
Inventors: 孙向欣
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-17
Filing date: 2021-04-28
Publication date: 2022-05-27
Also published as: CN112395390B; CN112395390A

Abstract

The present method belongs to the field of big data, and is applied to the field of smart medical treatments. Disclosed are a training corpus generation method for an intention recognition model, and a related device thereof. The method comprises: receiving an AI inquiry corpus pre-annotated with an inquiry class, and a customer answer corpus pre-annotated with an intention tag, wherein the customer answer corpus comprises an inquiry related corpus and a non-inquiry related corpus; establishing an inquiry related corpus library and a non-inquiry related corpus library; adjusting the inquiry related corpus library and the non-inquiry related corpus library on the basis of the similarity between the non-inquiry related corpus and the inquiry related corpus library, so as to obtain a target inquiry related corpus library and a target non-inquiry related corpus library; establishing a first training sample on the basis of the target inquiry related corpus library; establishing a second training sample on the basis of the intention tag and the non-inquiry related corpus library; and taking the first training sample and the second training sample as a training corpus and outputting same. The training corpus can be stored in a blockchain. By means of the method, the quality of a training corpus is improved.

Description

Training corpus generation method for intent recognition model and related equipment

This application claims the priority of the Chinese patent application filed on November 17, 2020 with the application number 202011288871.X and the invention titled "Method for generating training corpus for intent recognition model and related equipment", the entire contents of which are Incorporated herein by reference.

technical field

The present application relates to the field of big data technology, and in particular, to a training corpus generation method for an intent recognition model and related devices.

Background technique

With the continuous change and development of computer technology, artificial intelligence has been gradually applied in all walks of life to improve people's lives. Human-computer dialogue is an important development area of artificial intelligence. The dialogue scenes are complex and diverse, requiring computers to accurately identify customer intentions in the process of dialogue, so as to facilitate better dialogue.

At present, human-machine dialogues mostly use intent recognition models to identify customer intentions. In some scenarios, customer intentions rely on AI (Artificial Intelligence) inquiries, and in some scenarios, customer intentions do not rely on AI inquiries. Therefore, in the training process of the intent recognition model, it is mostly determined whether to fill the corresponding AI query in the training sample according to the dependency situation.

However, the inventor realized that since the dependence of customer intentions on AI queries cannot be judged in the actual production process, the prediction parameters of the input model all contain AI queries. As a result, the model training mode and the model prediction mode are inconsistent, and the accuracy of the intent recognition model in production is lower than that in the training environment.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to propose a training corpus generation method for an intent recognition model and related equipment, so as to improve the quality of the training corpus of the intent recognition model.

In order to solve the above technical problems, the embodiment of the present application provides a training corpus generation method for an intent recognition model, which adopts the following technical solutions:

A training corpus generation method for an intent recognition model, comprising the following steps:

Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;

establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;

Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;

Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;

acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;

The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.

In order to solve the above technical problems, the embodiment of the present application also provides a training corpus generation device for an intent recognition model, which adopts the following technical solutions:

A training corpus generation device for an intent recognition model, comprising:

The matching module is used to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship;

an establishment module for establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus respectively;

A calculation module, configured to calculate the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the inquiry-related corpus based on the similarity. For the non-inquiry-related corpus, a target inquiry-related corpus and a target non-inquiry-related corpus are obtained;

The generating module is configured to obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category corresponding to the corpus and the target query-related corpus generate a first training sample;

The association module is used to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with a preset inquiry category, and obtain the first two training samples; and

The output module is used for outputting the first training sample and the second training sample as training corpus, wherein the training corpus is used for training the intention recognition model.

In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solutions:

A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the following method for generating a training corpus of an intent recognition model is implemented:

In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:

A computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following method for generating a training corpus of an intent recognition model is implemented:

Compared with the prior art, the embodiments of the present application mainly have the following beneficial effects:

This application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjusts the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to achieve the determined The accuracy of target query-related corpus and target non-question-related corpus is higher. By associating target non-question-related corpus with preset query categories based on intent tags, the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems. The explosion of corpus ensures the efficiency of model training. The training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.

Description of drawings

In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

2 is a flowchart of an embodiment of a training corpus generation method for an intent recognition model according to the present application;

3 is a schematic structural diagram of an embodiment of an apparatus for generating training corpus of an intent recognition model according to the present application;

FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Reference numerals: 200, computer equipment; 201, memory; 202, processor; 203, network interface; 300, training corpus generation device for intent recognition model; 301, matching module; 302, establishment module; 303, calculation module; 304 , generating module; 305, associating module; 306, outputting module.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.

As shown in FIG. 1 , the system architecture 100 may include

terminal devices

101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the

terminal devices

101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications can be installed on the

terminal devices

101, 102, and 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.

The

terminal devices

101, 102, and 103 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.

The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the

terminal devices

101 , 102 , and 103 .

It should be noted that the method for generating training corpus for the intent recognition model provided by the embodiments of the present application is generally performed by a server/terminal device, and accordingly, the apparatus for generating training corpus for the intent recognition model is generally set in the server/terminal device.

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

Continuing to refer to FIG. 2 , a flowchart of an embodiment of a method for generating training corpus of an intent recognition model according to the present application is shown. The training corpus generation method for the intent recognition model includes the following steps:

S1: Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-question corpus. query related corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship.

In this embodiment, an annotator pre-marks an intent label on the customer answer corpus under each inquiry category, where the inquiry category may include six categories from Q1 to Q6. Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and use the regular matching method to match the customer response corpus to determine whether the customer response corpus and the AI query corpus are related, and will be matched soon. The successful customer answer corpus is used as the inquiry-related corpus, and the remaining customer's answer corpus is used as the non-inquiry-related corpus, which is convenient for further processing of the customer's answer corpus.

Examples of query-related corpora are as follows:

In this embodiment, the electronic device (for example, the server/terminal device shown in FIG. 1 ) on which the method for generating training corpus of the intent recognition model runs can receive the AI query corpus and the customer’s answer through a wired connection or a wireless connection. corpus. It should be pointed out that the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .

Specifically, the customer answer corpus is matched based on a preset regular expression, the customer answer corpus that is successfully matched is used as inquiry-related corpus, and the customer's answer corpus that fails to match is used as non-inquiry-related corpus The corpus includes:

Matching the customer answer corpus based on a preset regular expression, using the successfully matched customer answer corpus as the suspected inquiry-related corpus, and the unsuccessfully matched customer answer corpus as the suspected non-inquiry-related corpus;

Display the relevant corpus of the suspected inquiry on the preset front-end page, and notify the designated person to confirm the relevant corpus of the suspected inquiry;

When it is recognized that the designated person has completed the confirmation, the suspected inquiry-related corpus is marked as inquiry-related or non-inquiry-related based on the confirmation of the suspected inquiry-related corpus by the designated person;

Taking the suspected inquiry-related corpus marked as inquiry-related as the inquiry-related corpus, and taking the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related as the non-inquiry-related corpus .

In this embodiment, the method of regular matching is adopted to extract suspected query-related corpus from the customer's answer corpus, and the remaining corpus, that is, the corpus that fails to match, is suspected non-inquiry-related corpus. These suspected inquiries-related corpora are handed over to designated personnel for confirmation, and the confirmed inquiries-related corpora. The suspected inquiry-related corpus marked as inquiry-related is regarded as inquiry-related corpus; the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related are regarded as non-inquiry-related corpus. Alternatively, it is also possible to obtain non-inquiry-related corpus after removing certain query-related corpus from all the customer response corpus. Through the confirmation of the designated personnel, the relevant corpus of the inquiry can be further determined, and the accuracy of the division of the corpus of the customer's answer can be improved.

S2: Establish an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively.

In this embodiment, by establishing an inquiry-related corpus and a non-inquiry-related corpus, it is convenient for subsequent further processing of the inquiry-related corpus and the non-inquiry-related corpus.

S3: Calculate the similarity between each non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity The query-related corpus is obtained, and the target query-related corpus and the target non-query-related corpus are obtained.

In this embodiment, the similarity between each non-question-related corpus and the query-related corpus is calculated. The corpus in the query-related corpus and the non-question-related corpus are adjusted by similarity. Achieve a more rigorous target query-related corpus and target non-query-related corpus.

Specifically, the calculating the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus includes:

Inputting the currently described query-related corpus into a pre-trained language representation model to obtain query-related word vectors;

Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;

traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;

The cosine similarity with the largest numerical value is taken as the similarity between the current non-question-related corpus and the query-related corpus.

In this embodiment, the language representation model is called to embed the query-related corpus, thereby converting the query-related corpus into a 768-dimensional query-related word vector, where each query-related word vector represents a corpus Embedding, Embedding refers to using a low-dimensional vector to represent a corpus. The language representation model can be the BERT (Bidirectional Encoder Representations from Transformers) model. The BERT model has a wide range of versatility and can capture longer distances. rely. At the same time, the language representation model is called to convert the non-inquiry-related corpus into a 768-dimensional non-inquiry-related word vector, which can represent information in both directions. The cosine similarity between each non-query-related word vector and each of the query-related word vectors is traversed and calculated. After the traversal, the maximum value of the cosine similarity is taken as the similarity between the current non-query-related word vector and the query-related corpus.

An example of query-related word vectors is as follows:

问询相关词语料query related words	问询相关词向量Query related word vectors
是的，好yes good	[0.3,0.2,0.0005,…,0.006][0.3,0.2,0.0005,…,0.006]
嗯嗯Uh-huh	[0.1,0.003,0.002,….,0.03][0.1,0.003,0.002,….,0.03]
了解learn	[0.13,0.001,0.05,….,0.07][0.13,0.001,0.05,….,0.07]
晓得了got it	[0.27,0.006,0.04,….,0.4][0.27,0.006,0.04,….,0.4]
OKOK	[0.09,0.03,0.08,….,0.004][0.09,0.03,0.08,….,0.004]
知道了understood	[0.19,0.3,0.02,….,0.008][0.19,0.3,0.02,….,0.008]
……	….….

An example of the calculation process is as follows: non-inquiry related corpus, such as: I have saved it. The non-inquiry-related word vector corresponding to the non-inquiry-related corpus is a 768-dimensional vector: [0.07, 0.002, 0.04,..., 0.009], and the non-inquiry-related word vector is calculated with each inquiry-related word vector respectively. Cosine similarity. For two vectors of the same latitude, A and B, the cosine similarity calculation formula is:

Through the above formula, calculate the cosine similarity of the non-inquiry-related corpus "I saved it" and each corpus in the inquiry-related corpus. After the calculation, the maximum cosine similarity is calculated as the current non-inquiry-related corpus The similarity between the corpus and the query-related corpus.

Further, adjusting the inquiry-related corpus and the non-inquiry-related corpus based on the similarity to obtain the target inquiry-related corpus and the target non-inquiry-related corpus includes:

Identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the corpus to be confirmed, and notify the designated The personnel classify the corpus to be confirmed;

When it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus according to the classification of the designated person, Obtain the target query-related corpus and the target non-query-related corpus.

In this embodiment, when the similarity is less than or equal to the preset first similarity threshold, it is considered that the non-inquiry-related corpus still belongs to the non-inquiry-related corpus. For example, the similarity is 0.3, which is less than the first similarity threshold of 0.6, so the corpus is a non-question-related corpus. The similarity is 0.9, which is less than the first similarity threshold of 0.6, so the corpus is the corpus to be confirmed. When the non-inquiry-related corpus is used as the to-be-confirmed corpus, the to-be-determined corpus is extracted from the non-inquiry-related corpus for redistribution. Achieve a more rigorous target query-related corpus and target non-query-related corpus.

In addition, as another embodiment of the present application, the adjusting the query-related corpus and the non-query-related corpus based on the similarity, and obtaining the target query-related corpus and the target non-query-related corpus include:

Identifying whether the similarity is greater than a preset second similarity threshold, when the similarity is greater than a preset second similarity threshold, delete the corresponding non-inquiry related corpus from the non-inquiry related corpus, Obtain the target query-related corpus and the target non-query-related corpus.

In this embodiment, if the similarity is greater than the preset second threshold, the corresponding non-inquiry-related corpus is directly deleted, which can effectively improve the processing speed of the computer.

As yet another embodiment of the present application, it is also possible to calculate the similarity between each piece of the non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus, and calculate the similarity based on the similarity Adjusting the inquiry-related corpus and the non-inquiry-related corpus, and obtaining the target inquiry-related corpus and the target non-inquiry-related corpus include:

calculating the similarity between each piece of the non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus;

Identify whether the similarity is greater than a preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the first to-be-confirmed corpus, and Notifying the designated personnel to classify the first corpus to be confirmed;

When it is recognized that the designated person has completed the classification of the first to-be-confirmed corpus, the first to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry according to the classification of the designated person In the relevant corpus, obtain the first inquiry-related corpus and the first non-inquiry-related corpus;

calculating the first similarity between each first non-inquiry-related corpus in the first non-inquiry-related corpus and the first inquiry-related corpus;

Identify whether the first similarity is greater than the preset first similarity threshold, and when the second similarity is greater than the preset first similarity threshold, use the corresponding first non-inquiry related corpus as the first Second, the corpus to be confirmed, and notify the designated personnel to classify the second corpus to be confirmed;

When it is recognized that the designated person has completed the classification of the second to-be-confirmed corpus, the second to-be-confirmed corpus is allocated to the first non-inquiry-related corpus according to the classification of the designated person or the From the first inquiry-related corpus, obtain a second inquiry-related corpus and a second non-inquiry-related corpus;

calculating the second similarity between each second non-inquiry-related corpus in the second non-inquiry-related corpus and the second inquiry-related corpus;

Identify whether the second similarity is greater than a preset second similarity threshold, and when the second similarity is greater than a preset second similarity threshold, delete the corresponding For the second non-inquiry-related corpus, a target inquiry-related corpus and a target non-inquiry-related corpus are obtained.

In this embodiment, the designated person in this application may be an annotator. If the similarity is greater than the first similarity threshold, the corpus is included in the corpus to be confirmed by the business, that is, the corresponding non-inquiry-related corpus is regarded as the corpus to be confirmed, and the part of the corpus is returned to the annotator. The annotator is to confirm whether the corpus annotation is related to the AI query. According to the annotation of the annotator, the to-be-confirmed corpus related to AI inquiries is added to the inquiry-related corpus, and the to-be-confirmed corpus that is not related to AI inquiries is added to the non-inquiry-related corpus. In practice, after two rounds of this, that is, two rounds of business confirmation, it is considered that the corpus related to the inquiry is sufficiently rich. In order to save the labor cost of labeling, in the third round, the maximum similarity, that is, if the third similarity is greater than the preset second similarity threshold, it is directly deleted. If the similarity is less than the second similarity threshold, it is directly still the non-inquiry-related corpus and still exists in the non-inquiry-related corpus. Through the above methods, the target query-related corpus and the target non-query-related corpus are obtained.

S4: Acquire the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category and the target query-related corpus generate a first training sample.

In this embodiment, the generated first training sample belongs to the training sample that the customer intends to rely on the AI query corpus. The first training sample is generated through the target query related corpus and the corresponding query category, so as to ensure the dependency between the first training sample and the customer's intention.

S5: Acquire the target non-inquiry-related corpus in the target non-inquiry-related corpus, associate the target non-inquiry-related corpus with a preset inquiry category based on the intent label, and obtain a second training sample .

In this embodiment, based on the intent tag, the target non-inquiry-related corpus is associated with a preset inquiry category to obtain a second training sample. The second training sample belongs to the training sample in which the customer's intention does not depend on the AI query corpus. By associating the non-question-related corpus with the preset query category, it is realized that the position of the query category in the obtained second training sample is not vacant.

Specifically, associating the target non-inquiry-related corpus with a preset inquiry category based on the intent tag, and obtaining the second training sample includes:

Determine the target non-question-related corpus corresponding to each of the intent tags;

Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;

Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.

In this embodiment, based on the intent labels, sample balance is performed on the non-inquiry-related corpus under each intent label, so as to prevent the samples under different intent labels from being too different and affecting the subsequent training effect of the model. Filling each intent label with the same probability of Q1-Q6, while realizing that the position of the query category in the second training sample is not vacant, each balanced corpus that does not depend on the customer's intention is uniformly filled with the same probability, Avoid sample skew.

Wherein, the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the preset quantity thresholds are respectively used for each of the intent labels. The corresponding target non-inquiry-related corpus is subjected to sample equalization processing, and the balanced corpus obtained includes:

Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.

In this embodiment, in this application, the first number threshold may be set to 2500, and the second number threshold may be set to 1000. In the actual application process, the specific values of the first quantity threshold and/or the second quantity threshold can be adjusted according to actual needs, as long as they are applicable. For intent labels with more than 2,500 corpora, 2,500 non-inquiry-related corpora were randomly selected and retained. For intent tags with less than 1000 corpus, the corpus is expanded to 1000. The corpus of each intent label is set to be no more than 2500 corpora and no less than 1000 corpus, because there is a serious imbalance in the intent labels of model training. Through a large number of experiments on the training corpus, it is found that when the label corpus is greater than 2500, adding the corpus under the intent label will result in a very limited improvement in the accuracy of the model, and will aggravate the imbalance of the training set samples, resulting in some Intent labels with fewer labels have lower recognition accuracy. Intention labels with less than 1000 corpora will cause the model to have insufficient recognition accuracy of the label because the weight of the label in the model is too small.

Further, the corpus expansion on the target non-inquiry related corpus corresponding to the current intent tag includes:

A preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.

In this embodiment, the method of corpus expansion is to use python to call the RandomOverSample (random oversample) package. Through the RandomOverSample package, some corpus in the corpus can be randomly copied to expand the corpus to a predetermined value. The RandomOverSample package is often used to randomly replicate and repeat the minority class samples. The goal is to make the number of minority classes equal to the majority class to obtain a new balanced dataset.

S6: Use the first training sample and the second training sample as training corpora and output, wherein the training corpus is used for training an intent recognition model.

In this embodiment, based on the training corpus generated by the first training sample and the second training sample, a better training corpus is obtained, and the consistency between the accuracy in the training environment and the accuracy in the production environment is improved. The corpus used in the intent recognition model can identify customer intent more accurately.

Examples of training data are as follows:

After the training corpus is obtained, the preset intent recognition model is trained by the training corpus, and the trained intent recognition model is obtained. Receive the customer answer corpus to be identified and the AI query corpus to be identified, and determine the query category to which the AI query corpus to be identified has a one-to-one mapping relationship as the query category to be identified. Identify the query category and input it into the trained intent recognition model to obtain customer intent.

It should be emphasized that, in order to further ensure the privacy and security of the above training corpus, the above training corpus can also be stored in a node of a blockchain.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The present application can be applied in the field of smart medical care, thereby promoting the construction of smart cities.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, the processes of the above-mentioned method embodiments may be included. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.

With further reference to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a training corpus generation device for an intent recognition model, which corresponds to the method embodiment shown in FIG. 2 . , the device can be specifically applied to various electronic devices.

As shown in FIG. 3 , the training corpus generation device 300 for the intent recognition model described in this embodiment includes: a matching module 301 , a establishing module 302 , a computing module 303 , a generating module 304 , an association module 305 and an output module 306 . The matching module 301 is configured to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain the query Inquiry-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship; establishing a module 302 is used to respectively base on the inquiry-related corpus and the non-inquiry-related corpus The inquiry-related corpus establishes an inquiry-related corpus and a non-inquiry-related corpus; the computing module 303 is configured to calculate the relationship between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus similarity, and adjust the query-related corpus and the non-query-related corpus based on the similarity to obtain a target query-related corpus and a target non-query-related corpus; a generating module 304 is used to obtain the target query-related corpus. query the target query-related corpus in the relevant corpus, and determine the query category corresponding to the target query-related corpus based on the AI query corpus, and based on the target query-related corpus corresponding query category and the target The query-related corpus generates a first training sample; the association module 305 is configured to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with the target non-inquiry-related corpus. The preset query categories are associated to obtain a second training sample; and an output module 306 is configured to use the first training sample and the second training sample as training corpus and output, wherein the training corpus is used for Intent recognition model training.

In this embodiment, the present application calculates the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and performs the query-related corpus and the non-inquiry-related corpus based on the similarity. Adjustment to achieve higher accuracy of the determined target inquiry-related corpus and target non-inquiry-related corpus. By associating target non-question-related corpus with preset query categories based on intent tags, the problem of not filling query categories for training corpus that does not rely on AI query corpus is solved, and it does not cause training problems. The explosion of corpus ensures the efficiency of model training. The training corpus generated in this way can keep the accuracy of the intent recognition model at a high level.

The matching module 301 includes a matching sub-module, a presentation sub-module, a marking sub-module and a generating sub-module. The matching sub-module is used to match the customer answer corpus based on a preset regular expression, and use the successfully matched customer answer corpus as the suspected query related corpus, and the matched failed customer answer corpus as the suspected query related corpus Non-inquiry-related corpus; the display sub-module is used to display the suspected inquiries-related corpus on the preset front-end page, and notify the designated personnel to confirm the suspected inquiries-related corpus; the marking sub-module is used to identify When the designated person completes the confirmation, the suspected inquiry-related corpus is marked as inquiry-related or non-inquiry-related based on the confirmation of the suspected inquiry-related corpus by the designated person; the generating submodule is used to mark the inquiry-related corpus as an inquiry-related material. The suspected inquiry-related corpus related to the inquiry is used as the inquiry-related corpus, and the suspected non-inquiry-related corpus and the suspected inquiry-related corpus marked as non-inquiry-related are used as the non-inquiry-related corpus.

The calculation module 303 includes a first vector submodule, a second vector submodule, a similarity calculation submodule and a similarity confirmation submodule. The first vector sub-module is used to input the current query-related corpus into a pre-trained language representation model to obtain query-related word vectors; the second vector sub-module is used to input the non-inquiry-related corpus into a pre-trained language representation model. In the trained language representation model, non-inquiry-related word vectors are obtained; the similarity calculation sub-module is used to traversely calculate the cosine similarity between the current non-inquiry-related word vectors and each of the inquiry-related word vectors The similarity confirmation sub-module is used for taking the cosine similarity with the largest numerical value as the similarity between the current non-inquiry-related corpus and the inquiry-related corpus.

The computing module 303 also includes a first identifying sub-module and a first allocating sub-module. The first identification sub-module is used to identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, the corresponding non-inquiry related corpus is used as The corpus to be confirmed, and notify the designated person to classify the to-be-confirmed corpus; the first assignment sub-module is configured to, when it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, classify the to-be-confirmed corpus according to the classification of the designated person. The to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus to obtain a target inquiry-related corpus and a target non-inquiry-related corpus.

In some optional implementations of this embodiment, the above calculation module 303 is further configured to: identify whether the similarity is greater than a preset second similarity threshold, and when the similarity is greater than a preset second similarity When the threshold is reached, the corresponding non-inquiry-related corpus is deleted from the non-inquiry-related corpus to obtain a target inquiry-related corpus and a target non-inquiry-related corpus.

In some optional implementations of this embodiment, the calculation module 303 further includes a first calculation submodule, a second identification submodule, a second allocation submodule, a second calculation submodule, a third identification submodule, a third identification submodule, and a third identification submodule. Allocating submodules, third computing submodules, and deleting submodules. Wherein, the first calculation sub-module is used to calculate the similarity between each of the non-inquiry-related corpus in the non-inquiry-related corpus and the inquiry-related corpus; the second recognition sub-module is used to identify the Whether the similarity is greater than the preset first similarity threshold, when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the first to be confirmed corpus, and notify the designated person Classify the first to-be-confirmed corpus; the second assignment sub-module is configured to classify the first to-be-confirmed corpus according to the designated person's classification when it is recognized that the designated person has completed the classification of the first to-be-confirmed corpus. The corpus to be confirmed is allocated to the non-inquiry-related corpus or the inquiry-related corpus to obtain a first inquiry-related corpus and a first non-inquiry-related corpus; the second calculation submodule is used to calculate the first inquiry-related corpus. The first similarity between each first non-inquiry-related corpus in the non-inquiry-related corpus and the first inquiry-related corpus; the third identification sub-module is used to identify whether the first similarity is greater than a predetermined The set first similarity threshold, when the second similarity is greater than the preset first similarity threshold, the corresponding first non-inquiry related corpus is used as the second to-be-confirmed corpus, and the designated person is notified to The second to-be-confirmed corpus is classified; the third allocation sub-module is configured to classify the second to-be-confirmed corpus according to the classification of the designated person when it is recognized that the designated person has completed the classification of the second to-be-confirmed corpus. Confirm that the corpus is allocated to the first non-inquiry-related corpus or the first inquiry-related corpus to obtain a second inquiry-related corpus and a second non-inquiry-related corpus; the third calculation submodule is used to calculate all the the second similarity between each second non-inquiry-related corpus in the second non-inquiry-related corpus and the second inquiry-related corpus; the deletion sub-module is used to identify whether the second similarity is greater than A preset second similarity threshold, when the second similarity is greater than the preset second similarity threshold, delete the corresponding second non-inquiry related corpus from the second non-inquiry related corpus to obtain The target query-related corpus and the target non-query-related corpus.

The association module 305 includes a determination sub-module, an equalization sub-module and an association sub-module. Wherein, the determination sub-module is used to determine the target non-inquiry-related corpus corresponding to each of the intent tags; the equalization sub-module is used to separately analyze the target non-inquiry corresponding to each of the intent tags based on a preset quantity threshold The relevant corpus is subjected to sample equalization processing to obtain a balanced corpus; the association sub-module is used to associate the balanced corpus corresponding to each of the intention labels with the preset query category based on the preset same probability, and obtain the second Training samples.

The quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the equalization sub-module includes an identification unit, a screening unit and an expansion unit. Wherein, the identifying unit is used to identify whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold; When the quantity of the target non-inquiry-related corpus is greater than the first quantity threshold, the target non-inquiry-related corpus corresponding to the current intent label is randomly screened until the target non-inquiry-related corpus corresponding to the current intent label is The quantity is less than or equal to the first quantity threshold; the expansion unit is configured to, when the quantity of the target non-inquiry-related corpus corresponding to the current intention label is less than the second quantity threshold, to The query-related corpus is expanded until the quantity of the target non-query-related corpus corresponding to the current intent tag is greater than or equal to the second quantity threshold.

In some optional implementations of this embodiment, the expansion unit is further configured to: call a preset random oversampling package, and use the random oversampling package to correlate the target non-inquiry corresponding to the current intent tag The corpus is replicated randomly.

To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to FIG. 4 , which is a block diagram of a basic structure of a computer device according to this embodiment.

The computer device 200 includes a memory 201 , a processor 202 , and a network interface 203 that communicate with each other through a system bus. It should be noted that only the computer device 200 with components 201-203 is shown in the figure, but it should be understood that implementation of all shown components is not required, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.

The memory 201 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. The computer-readable storage medium may be non-volatile or volatile. In some embodiments, the memory 201 may be an internal storage unit of the computer device 200 , such as a hard disk or a memory of the computer device 200 . In other embodiments, the memory 201 may also be an external storage device of the computer device 200, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 201 may also include both an internal storage unit of the computer device 200 and an external storage device thereof. In this embodiment, the memory 201 is generally used to store the operating system and various application software installed on the computer device 200 , such as computer-readable instructions for a method for generating training corpus of an intent recognition model. In addition, the memory 201 can also be used to temporarily store various types of data that have been output or will be output.

In some embodiments, the processor 202 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 202 is typically used to control the overall operation of the computer device 200 . In this embodiment, the processor 202 is configured to execute computer-readable instructions stored in the memory 201 or process data, for example, computer-readable instructions for executing a method for generating training corpus of the intent recognition model.

The network interface 203 may include a wireless network interface or a wired network interface, and the network interface 203 is generally used to establish a communication connection between the computer device 200 and other electronic devices.

In this embodiment, the problem of not filling the query category for the training corpus that does not rely on the AI query corpus is solved, and at the same time, better training corpus is obtained, the explosion of the training corpus is not caused, and the intention recognition is effectively improved through the training corpus. The accuracy of the model's ability to identify customer intent.

The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the above-described method for generating training corpus of an intent recognition model.

From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims

A training corpus generation method for an intent recognition model, comprising the following steps:

Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;

establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;

Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;

Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;

acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;

The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
The method for generating training corpus for an intent recognition model according to claim 1, wherein the calculating the similarity between each of the non-question-related corpora in the non-question-related corpus and the query-related corpus include:

Inputting the query-related corpus into a pre-trained language representation model to obtain query-related word vectors;

Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;

traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;

The cosine similarity with the largest numerical value is taken as the similarity between the non-question-related corpus and the query-related corpus.
The method for generating training corpus of an intent recognition model according to claim 1, wherein the target non-question related corpus is associated with a preset query category based on the intent label to obtain a second training sample include:

Determine the target non-question-related corpus corresponding to each of the intent tags;

Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;

Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
The method for generating training corpus of an intent recognition model according to claim 3, wherein the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the The sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags based on the preset quantity threshold, and the balanced corpus obtained includes:

Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
The method for generating training corpus of an intent recognition model according to claim 4, wherein the performing corpus expansion on the target non-inquiry related corpus corresponding to the current intent label comprises:

The preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.
The method for generating training corpus for an intent recognition model according to claim 1, wherein the query-related corpus and the non-question-related corpus are adjusted based on the similarity to obtain a target query-related corpus and a target non-question-related corpus. Inquiry-related corpora include:

Identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the corpus to be confirmed, and notify the designated The personnel classify the corpus to be confirmed;

When it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus according to the classification of the designated person, Obtain the target query-related corpus and the target non-query-related corpus.
The method for generating training corpus for an intent recognition model according to claim 1, wherein the query-related corpus and the non-question-related corpus are adjusted based on the similarity to obtain a target query-related corpus and a target non-question-related corpus. Inquiry-related corpora include:

Identifying whether the similarity is greater than a preset second similarity threshold, when the similarity is greater than a preset second similarity threshold, delete the corresponding non-inquiry related corpus from the non-inquiry related corpus, Obtain the target query-related corpus and the target non-query-related corpus.
A training corpus generation device for an intent recognition model, comprising:

The matching module is used to receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus, wherein the customer answer corpus and the AI inquiry corpus have a one-to-one mapping relationship;

an establishment module for establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus respectively;

A calculation module, configured to calculate the similarity between each non-inquiry-related corpus and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the inquiry-related corpus based on the similarity. For the non-inquiry-related corpus, a target inquiry-related corpus and a target non-inquiry-related corpus are obtained;

The generating module is configured to obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The query category corresponding to the corpus and the target query-related corpus generate a first training sample;

The association module is used to obtain the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent tag, associate the target non-inquiry-related corpus with a preset inquiry category, and obtain the first two training samples; and

The output module is used for outputting the first training sample and the second training sample as training corpus, wherein the training corpus is used for training the intention recognition model.
A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the following method for generating a training corpus of an intent recognition model is implemented:

Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;

establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;

Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;

Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;

acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;

The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
The computer device according to claim 9, wherein the calculating the similarity between each of the non-question-related corpus and the query-related corpus in the non-question-related corpus comprises:

Inputting the query-related corpus into a pre-trained language representation model to obtain query-related word vectors;

Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;

traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;

The cosine similarity with the largest numerical value is taken as the similarity between the non-question-related corpus and the query-related corpus.
The computer device according to claim 9, wherein, based on the intent tag, associating the target non-question-related corpus with a preset query category, and obtaining the second training sample comprises:

Determine the target non-question-related corpus corresponding to each of the intent tags;

Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;

Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
11. The computer device of claim 11, wherein the quantity threshold includes a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, and the predetermined quantity is based on The threshold value performs sample equalization processing on the target non-inquiry-related corpus corresponding to each of the intent labels, and the obtained balanced corpus includes:

Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
The computer device according to claim 12, wherein the performing corpus expansion on the target non-query related corpus corresponding to the current intent tag comprises:

A preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.
The computer device according to claim 9, wherein the adjusting the query-related corpus and the non-query-related corpus based on the similarity, and obtaining the target query-related corpus and the target non-query-related corpus comprises:

Identify whether the similarity is greater than the preset first similarity threshold, and when the similarity is greater than the preset first similarity threshold, use the corresponding non-inquiry related corpus as the corpus to be confirmed, and notify the designated The personnel classify the corpus to be confirmed;

When it is recognized that the designated person has completed the classification of the to-be-confirmed corpus, the to-be-confirmed corpus is allocated to the non-inquiry-related corpus or the inquiry-related corpus according to the classification of the designated person, Obtain the target query-related corpus and the target non-query-related corpus.
The computer device according to claim 9, wherein the adjusting the inquiry-related corpus and the non-inquiry-related corpus based on the similarity, and obtaining the target inquiry-related corpus and the target non-inquiry-related corpus comprises:

Identifying whether the similarity is greater than a preset second similarity threshold, when the similarity is greater than a preset second similarity threshold, delete the corresponding non-inquiry related corpus from the non-inquiry related corpus, Obtain the target query-related corpus and the target non-query-related corpus.
A computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following method for generating a training corpus of an intent recognition model is implemented:

Receive the AI query corpus pre-labeled with the query category and the customer response corpus pre-labeled with the intent label, and perform a screening operation on the customer response corpus based on a preset regular expression to obtain query-related corpus and non-inquiry-related corpus. Corpus, wherein the customer answer corpus and the AI query corpus have a one-to-one mapping relationship;

establishing an inquiry-related corpus and a non-inquiry-related corpus based on the inquiry-related corpus and the non-inquiry-related corpus, respectively;

Calculate the similarity between each of the non-inquiry-related corpora and the inquiry-related corpus in the non-inquiry-related corpus, and adjust the inquiry-related corpus and the non-inquiry-related corpus based on the similarity Relevant corpus, obtain the target inquiry-related corpus and the target non-inquiry-related corpus;

Obtain the target query related corpus in the target query related corpus, and determine the query category corresponding to the target query related corpus based on the AI query corpus, and based on the target query related corpus The corresponding query The category and the target query related corpus generate a first training sample;

acquiring the target non-inquiry-related corpus in the target non-inquiry-related corpus, and based on the intent label, associating the target non-inquiry-related corpus with a preset inquiry category to obtain a second training sample;

The first training sample and the second training sample are used as training corpus and output, wherein the training corpus is used for training an intention recognition model.
The computer-readable storage medium according to claim 16, wherein the calculating the similarity between each of the non-question-related corpus and the query-related corpus in the non-question-related corpus comprises:

Inputting the query-related corpus into a pre-trained language representation model to obtain query-related word vectors;

Inputting the non-inquiry-related corpus into a pre-trained language representation model to obtain non-inquiry-related word vectors;

traversing the cosine similarity between the current non-query-related word vector and each of the query-related word vectors;

The cosine similarity with the largest numerical value is taken as the similarity between the non-question-related corpus and the query-related corpus.
The computer-readable storage medium according to claim 16, wherein the associating the target non-question-related corpus with a preset query category based on the intent tag, and obtaining the second training sample comprises:

Determine the target non-question-related corpus corresponding to each of the intent tags;

Based on a preset quantity threshold, sample equalization processing is performed on the target non-inquiry-related corpus corresponding to each of the intent tags, to obtain balanced corpus;

Based on a preset same probability, the balanced corpus corresponding to each of the intent labels is associated with a preset query category to obtain the second training sample.
19. The computer-readable storage medium of claim 18, wherein the quantity threshold comprises a first quantity threshold and a second quantity threshold, wherein the first quantity threshold is greater than the second quantity threshold, the The set quantity thresholds respectively perform sample equalization processing on the target non-inquiry-related corpus corresponding to each of the intent labels, and the obtained balanced corpus includes:

Identifying whether the quantity of the target non-inquiry related corpus corresponding to the current intent label is greater than the first quantity threshold or less than the second quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent label is greater than the first quantity threshold, randomly filter the target non-inquiry-related corpus corresponding to the current intent label until the target non-inquiry-related corpus corresponding to the current intent label is The quantity of the target non-inquiry-related corpus is less than or equal to the first quantity threshold;

When the quantity of the target non-inquiry-related corpus corresponding to the current intent tag is less than the second quantity threshold, corpus expansion is performed on the target non-inquiry-related corpus corresponding to the current intent tag until the current intent tag corresponds to the target non-inquiry related corpus. The quantity of the target non-inquiry-related corpus is greater than or equal to the second quantity threshold.
The computer-readable storage medium according to claim 19, wherein the performing corpus expansion on the target non-query related corpus corresponding to the current intent tag comprises:

A preset random oversampling package is called, and the target non-query related corpus corresponding to the current intent tag is randomly copied through the random oversampling package.