WO2023040153A1 - Method, apparatus, and device for updating intent recognition model, and readable medium - Google Patents

Method, apparatus, and device for updating intent recognition model, and readable medium Download PDF

Info

Publication number
WO2023040153A1
WO2023040153A1 PCT/CN2022/071694 CN2022071694W WO2023040153A1 WO 2023040153 A1 WO2023040153 A1 WO 2023040153A1 CN 2022071694 W CN2022071694 W CN 2022071694W WO 2023040153 A1 WO2023040153 A1 WO 2023040153A1
Authority
WO
WIPO (PCT)
Prior art keywords
intent
corpus
dialogue
category
recognition model
Prior art date
Application number
PCT/CN2022/071694
Other languages
French (fr)
Chinese (zh)
Inventor
罗圣西
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023040153A1 publication Critical patent/WO2023040153A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of artificial intelligence, and discloses a method, device, equipment and readable medium for updating an intention recognition model.
  • “Intent recognition” refers to a piece of information input by the user to express the query demand, and judge the type of intent stated by the user.
  • the inventor realizes that the current intent recognition technology is mainly used in search engines, man-machine dialogue systems, etc., in When applied to human-computer dialogue, an intention recognition model is constructed to identify the customer's intention. Due to the interference of environmental noise in daily human-computer dialogue, a large amount of corpus that does not belong to the existing intention category will be generated. If If the intent recognition model cannot correctly identify this kind of corpus, it will have a great impact on the user experience, and in severe cases, there may be a risk of leaking user privacy.
  • the existing solution is to generate out-of-set corpus through data enhancement methods, generally random insertion, deletion, exchange and other operations, and train the rejection ability of the intention recognition model through out-of-set corpus, but this data enhancement method cannot guarantee that the generated corpus will be certain. Belonging to the out-of-set category will also lead to corpus entanglement in the training corpus, affecting the recognition effect of the trained intent recognition model on normal corpus, that is, the problem of low update accuracy of the existing intent recognition model.
  • the present application provides an intent recognition model update method, device, device and storage medium.
  • An intent recognition model update method, device, device and readable medium are used to improve the recognition accuracy of the intent recognition model for corpus intent.
  • the first aspect of the present application provides a method for updating an intent recognition model, wherein the method for updating an intent recognition model includes: obtaining original dialogue material, and identifying the meaning of each sentence in the original dialogue material by presetting an intent recognition model The first intent category; initialize the mask list corresponding to the original dialogue material, and adjust a group of element values in the mask list according to preset selection rules to obtain an adjusted mask list; based on the adjusted A mask list, constructing the auxiliary dialogue material corresponding to the original dialogue material, and identifying the second intent category of each statement in the auxiliary dialogue material through the intent recognition model; for the first intent category and the The degree of difference is detected for the second intent category to obtain the first detection result, and based on the first detection result, a sentence satisfying the preset difference condition is selected from the attached dialogue corpus as the final extra-collection corpus; The foreign corpus is marked as out-of-set intent, and the original dialogue corpus and the final out-of-set corpus are used to train the intent recognition model to obtain a new intent recognition model.
  • the second aspect of the present application provides an intent recognition model updating device, including a memory, a processor, and computer-readable instructions stored on the memory and operable on the processor, and the processor executes the computer
  • the following steps are implemented when the instruction is readable: obtain the original dialogue material, and identify the first intent category of each statement in the original dialogue material through a preset intent recognition model; initialize the mask list corresponding to the original dialogue material, and Adjust a set of element values in the mask list according to preset selection rules to obtain an adjusted mask list; based on the adjusted mask list, construct an auxiliary dialogue material corresponding to the original dialogue material, and pass
  • the intent recognition model identifies the second intent category of each statement in the attached dialogue material; detects the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the According to the first detection result, a sentence that satisfies the preset difference condition is selected from the attached dialogue material as the final out-of-set corpus; the final out-of-set corpus is marked as
  • the third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are run on the computer, the computer is made to perform the following steps: obtain the original dialogue material, And through the preset intention recognition model, identify the first intent category of each sentence in the original dialogue material; initialize the mask list corresponding to the original dialogue material, and adjust the mask list according to the preset selection rules A set of element values to obtain an adjusted mask list; based on the adjusted mask list, construct an auxiliary dialogue corpus corresponding to the original dialogue corpus, and use the intent recognition model to identify the the second intent category of each sentence; detecting the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, from the attached dialogue data Selecting sentences that satisfy the preset difference conditions as the final extra-set corpus; marking the final extra-set corpus as extra-set intent, and using the original dialogue material and the final extra-set corpus to train the intent recognition model , to get a
  • the fourth aspect of the present application provides an intent recognition model update device, wherein the intent recognition model update device includes: a corpus acquisition module, used to obtain original dialogue materials, and recognize the original dialogue by presetting the intent recognition model The first intent category of each sentence in the corpus; the mask construction module is used to initialize the mask list corresponding to the original dialogue material, and adjust a group of element values in the mask list according to preset selection rules to obtain An adjusted mask list; a second intention module, configured to construct an auxiliary dialogue material corresponding to the original dialogue material based on the adjusted mask list, and identify the auxiliary dialogue material through the intent recognition model The second intent category of each sentence in the sentence; the final out-of-set module is used to detect the degree of difference between the first intent category and the second intent category, obtain the first detection result, and based on the first detection result , selecting sentences satisfying the preset difference conditions from the attached dialogue data as the final extra-set corpus; the corpus training module is used to mark the final out-of-set corpus as out-of-set intent, and adopt the
  • the page to be confused is obtained, and the target detection model is used to determine the text area in the page to be confused, and the position coordinates corresponding to the text area are determined; Recognize the text in the region to obtain the text; use regular expressions to query the text to be confused and the position coordinates corresponding to the text to be confused in the text, and use the color extraction algorithm to extract the text color of the text to be confused ; According to the text color of the text to be confused, a confusion layer is generated on the interface to be confused corresponding to the position coordinates of the text to be confused, and the text to be confused is covered by the confusion layer to obtain a covered page .
  • This application uses the mask list in the computer data processing method to realize the sentence processing of the original dialogue material, constructs the mask list and replaces the relevant values to obtain the candidate foreign language material that meets the first difference condition, and the processing method of the language material uses the preset Processing the mathematical laws of the set, so that the generated candidate out-of-set corpus is more suitable for the category of out-of-set intentions.
  • the processing method of the language material uses the preset Processing the mathematical laws of the set, so that the generated candidate out-of-set corpus is more suitable for the category of out-of-set intentions.
  • This method can eliminate the need for additional comparison experiments to realize the recognition rejection function by setting the confidence threshold, shortening the time
  • the cycle of launching and optimizing the intent recognition model while enabling the trained intent recognition model to have the ability to reject recognition, reduces the impact on the normal corpus recognition effect, thereby improving the update accuracy of the intent recognition model.
  • FIG. 1 is a schematic diagram of the first embodiment of the method for updating the intent recognition model of the present application
  • FIG. 2 is a schematic diagram of a second embodiment of the method for updating the intent recognition model of the present application
  • FIG. 3 is a schematic diagram of a third embodiment of the method for updating the intent recognition model of the present application.
  • FIG. 4 is a schematic diagram of a fourth embodiment of the method for updating the intent recognition model of the present application.
  • FIG. 5 is a schematic diagram of a fifth embodiment of the method for updating the intent recognition model of the present application.
  • Fig. 6 is a schematic diagram of an embodiment of the device for updating the intention recognition model of the present application.
  • Fig. 7 is a schematic diagram of another embodiment of the device for updating the intention recognition model of the present application.
  • Fig. 8 is a schematic diagram of an embodiment of an intention recognition model updating device of the present application.
  • Embodiments of the present application provide a method, device, device, and readable medium for updating an intent recognition model.
  • the present application uses the mask list in the computer data processing method to realize the statement of the original dialogue material Processing, constructing a mask list and replacing relevant values to obtain candidate out-of-collection corpus that meets the first difference condition.
  • the processing method of the corpus uses preset mathematical laws to make the generated candidate out-of-collection corpus more satisfy the category of out-of-collection intent , this method can avoid the additional comparison experiment of realizing the recognition rejection function by setting the confidence threshold, shorten the cycle of launching and optimizing the intention recognition model, and at the same time make the trained intention recognition model have the recognition rejection ability, and reduce the The impact on the recognition effect of normal corpus improves the update accuracy of the intent recognition model.
  • the first embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
  • the execution subject of the present application may be an intent recognition model updating device, or may also be a terminal or a server, which is not specifically limited here.
  • the embodiment of the present application is described by taking the server as an execution subject as an example.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the original dialogue materials here are generally collected and summarized by business personnel in related industries according to the frequently consulted questions in this industry and the business that they often need to deal with. It is used to establish an intent recognition model.
  • the preset intent recognition model here can be FastText (fast text classifier), textCNN (text classification algorithm) or any classification model based on the pre-trained language model, by obtaining the original dialogue material and passing the preset intent Recognition model, identifying the first intent category of each sentence in the original dialogue data, here is the recognition principle of the intent category because the preset intent recognition model is essentially a series of mathematical operations, which can output different intents obtained by recognition operations The results in the form of probability scores, and select the label with the highest probability of intention operation and record the label and corresponding probability for output.
  • the target man-machine dialogue system obtains relevant original dialogue materials collected by relevant business personnel, and through the preset intent recognition model, performs intent recognition on each sentence in the obtained original dialogue materials and its corresponding intent Label and probability, where the probability represents the possibility that the corpus belongs to this category, and then the first intent category in each sentence in the original dialogue corpus is obtained.
  • the mask here is a string of binary codes, which can also be replaced by other character elements in practical applications. Since computers essentially perform operations on binary codes, this program uses a mask list, which is convenient for numerical conversion and For computer processing, the preset selection rules here are determined according to the business type of the intention recognition model dialogue to be trained. According to the length of the original dialogue material and the difficulty of dialogue processing, two or more masks can be set for numerical adjustment. , and the adjustment rules only need to conform to certain mathematical laws. For example, adjust according to the reading order of the corpus, initialize the obtained original dialogue corpus as a mask list, and adjust the mask list according to the preset selection rules. Elements are adjusted and replaced, resulting in an adjusted mask list.
  • auxiliary dialogue corpus corresponding to the original dialogue corpus, and identify the second intent category of each sentence in the auxiliary dialogue corpus through an intent recognition model
  • the adjusted mask list is used to determine the first element value corresponding to each piece of corpus in the adjusted mask list, where the first element value is the corresponding value in the mask list obtained from the initialization process of the original dialogue material.
  • mask element value by determining the value of the first element and its corresponding position, replacing it with the words corresponding to the original dialogue material and arranging them in the original order to obtain the auxiliary dialogue material, and identifying the auxiliary dialogue material through the intent recognition model.
  • the labels and corresponding probabilities for each sentence in the dialogue corpus yield the second intent category.
  • the target man-machine dialogue system uses the processed mask list after numerical transformation, first identifies the first mask element list in the mask list after numerical transformation and correspondingly initializes the mask element and its position, and then recognizes the obtained
  • the first mask element and its corresponding position are replaced by the words corresponding to the corresponding position in the original dialogue material, and then the words replacing the first mask element are arranged in the order of the original dialogue material , to obtain the auxiliary dialogue material corresponding to the original dialogue material, and then pass the obtained auxiliary dialogue material through the current intent recognition model to be trained to identify the labels and corresponding probabilities corresponding to each sentence in the auxiliary dialogue material, and then obtain the auxiliary dialogue material
  • the corresponding second intent class is the processed mask list after numerical transformation, first identifies the first mask element list in the mask list after numerical transformation and correspondingly initializes the mask element and its position, and then recognizes the obtained
  • the first mask element and its corresponding position are replaced by the words corresponding to the corresponding position in the original dialogue material, and then the words replacing the
  • the preset difference conditions here include the first difference condition and the second difference condition; according to the first intention category and the second intention category obtained by the recognition process, the label corresponding to each sentence corpus and its corresponding
  • the detection of the degree of difference corresponds to the probability.
  • the detection of the degree of difference refers to the use of intent recognition to obtain whether the tags in the original dialogue corpus and the attached dialogue corpus are different, or to compare the corresponding probabilities of the two.
  • the attached dialogue corpus The difference between the corresponding label probability of the original dialogue material and the corresponding label basic probability of the original dialogue material is greater than a certain threshold (the threshold is set according to the probability characteristics of the intent recognition model, and the general threshold is small), and the obtained result is changed to the mask list , intent recognition, and difference degree detection loop operation processing until the preset loop processing conditions are met, and then the first detection result is obtained, and based on the obtained first detection result, select from the attached dialogue materials that meet the preset first difference condition Sentences are used as the candidate out-of-set corpus, and after class identification and difference detection are carried out on the candidate out-of-set corpus, the second detection result is obtained, and the sentences satisfying the preset second difference condition are selected from the candidate out-of-set corpus as the final set foreign language materials.
  • a certain threshold the threshold is set according to the probability characteristics of the intent recognition model, and the general threshold is small
  • the target man-machine dialogue system detects the degree of difference between the corresponding probabilities of the machine corresponding to each sentence of the intention category according to the first intention category and the second intention category, and compares whether the difference between the two is greater than Preset values, and the obtained results are again subjected to mask list transformation, intent recognition and difference degree detection until the exit condition of replacing all words in the corpus is met, and the first detection result is obtained, and then according to the first
  • the detection result uses the preset first difference condition, and selects a sentence from each dialogue material that satisfies the large difference in the difference between the two intent recognition probabilities as the candidate foreign language material; the candidate foreign language is identified through the intent recognition model
  • the third intent category of each sentence in the data; the degree of difference between the first intent category and the third intent category is detected to obtain the second detection result, and according to the second detection result, select from the candidate out-of-set corpus that satisfies the preset
  • the sentences of the second difference condition are used as the final out-of-set corpus
  • the final out-of-set corpus obtained in step 104 it is marked as the out-of-set intention of this model, and then the original dialogue data and the final out-of-set corpus obtained from the processing are combined into new training corpus, which is based on machine learning methods Carry out intent recognition training with new training corpus, obtain a new intent recognition model, and then obtain an intent recognition model with a recognition rejection function.
  • the basic machine learning method here refers to using the new training corpus data obtained and the previous intent recognition model training.
  • the target human-computer dialogue system marks it as the out-of-set intention of the training model according to the obtained final out-of-set corpus, and then merges the original dialogue material and the final out-of-set corpus to obtain a new training corpus , the new training corpus is based on the method of machine learning to perform repeated training in the intention recognition model to be trained this time, so as to obtain a new intention recognition model with recognition rejection function.
  • a new intention recognition model is obtained through training, and the acquired corpus to be recognized is transferred to the intention recognition model, and the new intention recognition model is performed according to the obtained prediction to be recognized and the out-of-set corpus in the model.
  • Corpus identification identifying the extra-set corpus that has nothing to do with the real intention, and then obtaining the corpus corresponding to the real intention predicted by the test and its corresponding intention category, and then returning the recognized intention to the target human-computer dialogue system for display.
  • the present application uses the mask list in the computer data processing method to realize the statement of the original dialogue material Processing, constructing a mask list and replacing relevant values to obtain candidate out-of-collection corpus that meets the first difference condition.
  • the processing method of the corpus uses preset mathematical laws to make the generated candidate out-of-collection corpus more satisfy the category of out-of-collection intent , this method can avoid the additional comparison experiment of realizing the recognition rejection function by setting the confidence threshold, shorten the cycle of launching and optimizing the intention recognition model, and at the same time make the trained intention recognition model have the recognition rejection ability, and reduce the The impact on the recognition effect of normal corpus improves the accuracy of updating the intent recognition model.
  • the second embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
  • the sentence segmentation processing here is to realize the sentence processing of the middle corpus of the original dialogue data by identifying the basic punctuation marks in the original dialogue data, so that the original dialogue can be obtained Multiple sentences of the corpus, and the string length of each sentence obtained in the original dialogue data is calculated separately through the preset string function, and the string length value corresponding to each sentence can be obtained.
  • the corresponding mask list of the original dialogue material is constructed by using masks.
  • the human-computer dialogue system uses the preset sentence processing function to process the original dialogue data according to the original dialogue data obtained from the external input.
  • the sentence processing function here uses the set basic punctuation marks (such as full stop, exclamation point, etc.), and then recognize and divide the punctuation marks in the original dialogue data, realize the sentence processing operation, obtain multiple sentences, and then use the string function to perform statistical calculations on the number of characters in the obtained sentences, and get each The string length of the sentence, and then use the preset first mask value of the same length corresponding to each string length to form the mask corresponding to each sentence, so that the mask corresponding to the original dialogue material constructed by the mask code list, for example, set the first mask value element to 0, and initialize a mask list with the same length as the corpus according to the obtained string length, in which the mask elements are all 0, such as the original corpus is "I want to modify my account password" and the length is 10, then the generated corresponding mask list is a list [0,0,0,0,0,0,0,0,0
  • the preset selection rules are used to select the positions of each segment of the mask.
  • the mask corresponds to the first mask value at the selected position.
  • the first mask value at the selected position is correspondingly adjusted by selecting each segment of the mask, and then the second mask value is used to replace the first mask value at the selected position. After the replacement, the original position is not changed. Change, so that the mask list after numerical transformation can be obtained.
  • the man-machine dialogue system uses the preset selection rules to determine the selection position corresponding to each segment of the mask in the mask list according to the mask list obtained from the initialization process, and adjusts the selected position on the selected mask.
  • the selection rule here adopts the normal sentence reading order, selects two characters at a time from left to right, selects the first two mask elements of the sentence for the first time, and thus obtains the first two characters at the selected position A mask value [0,0], and then use the preset second mask value to replace the first mask value at the corresponding position selected by adjustment, so as to obtain the mask list after numerical transformation, such as setting the second
  • the mask value is 1, and the mask element 0 position corresponding to the selected character is changed to 1, and the list of the original corpus according to the previous embodiment then becomes [1,1,0,0,0,0,0,0 ,0].
  • the original dialogue material is divided into sentences, and the length of the string is calculated to obtain the string length of each sentence, and then the preset first element value that is the same as the length of each string is used to combine into a mask corresponding to each sentence, and use the obtained mask to construct a mask list corresponding to the original dialogue material, so as to determine the position selected for adjustment corresponding to each segment of the mask in the mask list based on the preset selection rules, and The first element value at the corresponding selected position is respectively selected, and the preset second element value is used to replace the first element value at the selected position, thereby obtaining an adjusted mask list.
  • the present application uses a mask list to process the sentences of the original dialogue data.
  • the first element value is used to construct a new mask, and then the preset selection is used.
  • the rule replaces the selected first element value with the second element value, so as to obtain the adjusted mask list, thereby avoiding the random insertion, deletion, exchange and other operations and processing methods of the original dialogue material in the prior art, and then the original dialogue material
  • the processing is more in line with the laws of mathematical operations, and can obtain more corpus outside the training set.
  • the third embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
  • the system uses the first mask value to identify the function, and this function uses the preset first mask value as the identification identifier to traverse the entire sentence through function traversal, and identifies the corresponding first mask value in the mask list.
  • a mask value and its corresponding position respectively determine the first mask value and its corresponding position of each mask in the mask A list after numerical transformation, and according to the initial mask list and the original dialogue material between each statement The corresponding relationship of each corpus in the original dialogue corpus and the word in the same position as the first mask are obtained.
  • each statement is correspondingly selected in order and combined in order to obtain a new statement
  • the corresponding selected words of each sentence are sequentially combined to obtain corresponding new statement.
  • the corresponding new sentences are spliced and combined according to the sentence combination method of the original dialogue materials to obtain the auxiliary dialogue materials corresponding to the original dialogue materials, and then the obtained auxiliary dialogue materials are passed through the intention recognition model to be trained.
  • Intent recognition identifying a second intent category corresponding to each sentence in the attached dialogue data.
  • the man-machine dialogue system uses the first mask value recognition function to identify the position of the first mask value of each mask in the mask list after numerical transformation, and selects the position of each mask in the original dialogue data respectively.
  • a word with the same position as the first mask value such as [1,1,0,0,0,0,0,0,0,0,0] obtained in the previous embodiment, is identified and selected to obtain the third position Go to the first mask value at the 10th position and its corresponding position, and then according to the original dialogue material as "I want to modify my account password", select the character corresponding to the mask list position element of 0 in the corpus, and then Splice the selected words according to the reading order of the original sentence to obtain the attached dialogue material corresponding to the original dialogue material.
  • a new sentence can be formed by selecting and splicing the original dialogue material.
  • the adjusted mask list by respectively determining the first element value position of each mask in the mask list after numerical transformation, and selecting each sentence in the original corpus and the first element value position The same word, and then according to the order of the position of the first element value, sequentially combine the words corresponding to each sentence to obtain a new sentence, splice the new sentence, and obtain the attached dialogue material corresponding to the original dialogue material , so that the second intent category is obtained by performing intent recognition processing on the auxiliary dialogue data.
  • the present application converts the corresponding element position into the original dialogue sentence combination sentence by determining the position of the corresponding element in the adjusted mask list, and then realizes the intent recognition processing of the combined sentence to obtain the second intent category , by using the mask list processing method for the original dialogue material, the adjusted mask list is converted into the corresponding words of the original sentence, so as to realize the intention recognition and processing operation of the attached dialogue material, and the processing method is simple and faster.
  • the desired intent recognition result is simple and faster.
  • the fourth embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
  • the judgment of the intention category is performed on the two obtained intention categories, and it is judged whether the intention categories of their corresponding sentences are the same. If the same, then use the auxiliary dialogue data of this round of intent recognition as the initial dialogue data of a new round.
  • the current round of original dialogue material is used as a new round of initial dialogue material.
  • the next round of corresponding mask list value transformation, intention recognition and difference degree detection is performed on the initial dialogue material case, and the processing is stopped until the initial dialogue material satisfies the preset exit condition, and a new first detection result is obtained.
  • the preset exit condition here is that when all the words in each segment of the corpus in the original dialogue corpus have been traversed and selected by the preset selection rules, the processing exit condition is met.
  • the human-computer dialogue system judges whether the intent types of the first intent category and the second intent category are the same. For a new round of initial dialogue material, if the detection results have different intent categories, the initial dialogue material for this round of intent recognition will be used as the next round of new initial dialogue material, such as the auxiliary dialogue material obtained in the previous embodiment.
  • My account password and the initial dialogue material is "I want to change my account password”
  • the intention identification tags of the two should be the same, and the probability will be very close, so the first two elements of the mask list will be Keep it as 1.
  • the element in the mask list is 1, which means that the corresponding character does not contribute much to the semantics in the original sentence, and then the attached dialogue data, namely "change my account password" is used as the next round
  • the initial dialogue material namely "I want to change my account password” will be used as the new initial dialogue material for the next round, and then the new initial dialogue material will be obtained.
  • the next round of corresponding mask value conversion, intent recognition, and difference degree comparison cycle processing operations until all the words in the initial dialogue data have been traversed and replaced.
  • the new initial dialogue material is "Change my account password”
  • select the second and third characters for loop processing and then when the two characters of "password” are replaced, the exit condition is satisfied, and then a new first detection result is obtained.
  • the degree of difference between the first intention category corresponding to each piece of corpus in the original dialogue corpus and the second intention category corresponding to the subsidiary dialogue corpus are the same, and the preset difference here is Whether the degree is the same means that after two corresponding corpora are identified for intent categories, when the intent categories of the two are the same and the probability of both is very high (generally greater than 80%), the degree of difference is the same.
  • the preset first difference condition here is to perform difference detection based on the corresponding probabilities of the labels obtained from the intention recognition of the original dialogue data and the labels and their probabilities obtained from the intention recognition of the attached dialogue materials, and obtain a two A difference condition where the difference in probability is large. If the intent category of the sentence corresponding to the second intent category different from the first intent category is different, it may be determined that the sentence corresponding to the second intent category satisfies the preset first difference condition and is used as a candidate out-of-set corpus.
  • the man-machine dialogue system judges the first intention category corresponding to each segment of the original dialogue corpus and the second intention category corresponding to the processed auxiliary dialogue corpus according to the first detection result obtained through processing, and judges the two Whether the categories corresponding to the two are the same category and the corresponding probability is relatively high. If the judgment result is that the two intention categories are not the same as the statement corresponding to the second intention category, it can be determined that the statement corresponding to the second intention category satisfies the preset first The difference difference condition is used as an alternative out-of-set corpus.
  • the candidate out-of-collection corpus obtained in step 408 use the intention recognition model to be trained this time to perform the intention label and corresponding probability intention recognition processing on each sentence in the candidate out-of-collection corpus obtained from the selection process , and then the third intent category of each sentence in the candidate foreign corpus can be obtained.
  • the first intent category obtained by performing intent recognition on the basis of the original dialogue data and the third intent class obtained by performing intent recognition on the alternative foreign corpus use intent recognition to obtain the labels corresponding to each piece of corpus and their corresponding probabilities Carry out the degree of difference detection, and then obtain the second detection result, and according to the obtained second detection result, select from the candidate foreign language material, so as to obtain the sentence that satisfies the preset second difference condition as the final foreign language material, here
  • the preset second difference condition is to perform difference detection based on the corresponding probabilities of the tags obtained by the intention recognition of the original dialogue data, and the tags and their probabilities obtained by the intent recognition of the alternative foreign language data, and compare them to obtain a probability difference between the two The difference condition with a large difference.
  • the target human-computer dialogue system performs intent recognition processing on the original dialogue data to obtain the first intent category and the third intent category obtained from the alternative corpus. Compare the difference between the corresponding labels and their corresponding probabilities, and obtain the detection result of the difference comparison between the two as the second detection result. Then, according to the second detection result, select from the candidate foreign corpus, and use the second detection result The label corresponding to each sentence in the sentence and its corresponding probability, and the label and its corresponding probability of each sentence in the original dialogue corpus are compared to compare the probability difference, and the corpus that meets the larger difference condition is obtained, and then the corpus that satisfies the preset second The sentences of the two different conditions are used as the final extra-corpus corpus.
  • the auxiliary dialogue material is used as a new initial dialogue material, and if the detection results are different, the initial dialogue material is used as The new initial dialogue data, and then use the preset selection rules to perform the cycle processing of mask transformation, intent recognition, and difference detection.
  • the first detection result is obtained, and then based on the first detection result, use Whether the preset difference degree is the same is judged by the first and second intent categories, and the sentence corresponding to the second intent category with different judgment results is obtained as the candidate foreign language data, and then through the intention identification and difference of the candidate foreign language data After detection and judgment of difference conditions, the final out-of-collection corpus is obtained.
  • the present application can identify all the out-of-set corpus composed of the original dialogue material words of the training, so as to obtain a candidate out-of-set corpus.
  • This method can obtain as much out-of-set corpus as possible, avoiding The additional comparative experiment of realizing the rejection function by setting the confidence threshold of the existing method is eliminated, which shortens the time required for the intention recognition model to go online and the post-optimization, and then obtains an intention recognition model with higher accuracy of intention recognition.
  • the fifth embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
  • Detect the degree of difference between the first intent category and the second intent category obtain a first detection result, and sequentially determine whether each first intent category and each corresponding second intent category are the same according to the first detection result;
  • Detect the degree of difference between the first intent category and the third intent category obtain a second detection result, and judge whether the degree of difference between each first intent category and each corresponding third intent category is based on the second detection result Greater than the preset difference degree threshold;
  • the second detection result is obtained by detecting the degree of difference between the obtained first intention category and the third intention category, and then according to the second detection result, each first intention category and alternative For the third intent category corresponding to the out-of-set corpus, it is judged whether the difference degree of the intent category is greater than a preset difference degree threshold.
  • the difference condition is based on the corresponding probabilities of the tags obtained from the intent recognition of the original dialogue data, and the tags and their probabilities obtained from the intent recognition of the alternative corpus, and judging whether the two corresponding corpus intent categories are different and greater than the preset difference degree Threshold (here, the difference degree threshold is generally set to 80%, and then it is judged that the intention labels of the two are different and the probability value is relatively large) as the difference detection result.
  • the human-computer dialogue system judges the first intention recognition model obtained by recognizing the intention of each piece of corpus in the original dialogue corpus according to the second detection result, and it is the same as the corresponding original corpus of dialogue corpus in the candidate foreign corpus.
  • the third intention category obtained by intention recognition judge whether the corresponding intention category outside the candidate set and the corresponding intention category of the original corpus are greater than the preset difference degree threshold. If the probability is greater than a certain difference degree threshold, select the candidate out-of-set corpus corresponding to these different third intent categories as the final out-of-set corpus obtained from model training.
  • the sentences satisfying the preset second difference condition are selected as the final foreign corpus.
  • the selection of the candidate out-of-set corpus and then the second difference condition can avoid the problem that the out-of-set corpus constructed by the traditional data-enhanced corpus synthesis method may be entangled with the normal training corpus.
  • the code list and the existing intent recognition model ensure the out-of-set nature of the generated corpus, so that the trained intent recognition model has the ability to reject recognition while reducing the impact on the normal corpus recognition effect.
  • An embodiment of the device for updating the intent recognition model in the embodiment of the present application includes: corpus
  • the obtaining module 601 is used to obtain the original dialogue material, and through the preset intention recognition model, identifies the first intent category of each sentence in the original dialogue material;
  • the mask construction module 602 is used to initialize the mask list corresponding to the original dialogue material , and adjust a group of element values in the mask list according to the preset selection rules to obtain the adjusted mask list;
  • the second intent module 603 is used to construct an auxiliary dialogue corresponding to the original dialogue material based on the adjusted mask list corpus, and identify the second intent category of each statement in the attached dialogue material through the intent recognition model;
  • the final out-of-set module 604 is used to detect the degree of difference between the first intent category and the second intent category to obtain the first detection result , and based on the first detection result, select sentences satisfying
  • the present application uses the mask list in the computer data processing method to realize the statement of the original dialogue material Processing, constructing a mask list and replacing relevant values to obtain candidate out-of-collection corpus that meets the first difference condition.
  • the processing method of the corpus uses preset mathematical laws to make the generated candidate out-of-collection corpus more satisfy the category of out-of-collection intent , this method can avoid the additional comparison experiment of realizing the recognition rejection function by setting the confidence threshold, shorten the cycle of launching and optimizing the intention recognition model, and at the same time make the trained intention recognition model have the recognition rejection ability, and reduce the The impact on the recognition effect of normal corpus improves the update accuracy of the intent recognition model.
  • another embodiment of the device for updating the intention recognition model in the embodiment of the present application includes: a corpus acquisition module 601, which is used to obtain the original dialogue materials, and recognize each sentence in the original dialogue materials through the preset intention recognition model
  • the first intent category of the mask construction module 602 which is used to initialize the mask list corresponding to the original dialogue material, and adjust a set of element values in the mask list according to preset selection rules to obtain an adjusted mask list
  • the second intention module 603 is used to construct the auxiliary dialogue material corresponding to the original dialogue material based on the adjusted mask list, and identify the second intent category of each statement in the auxiliary dialogue material through the intent recognition model
  • the final out-of-set module 604 which is used to detect the degree of difference between the first intent category and the second intent category to obtain the first detection result, and based on the first detection result, select the sentence satisfying the preset difference condition from the attached dialogue corpus as the final out-of-set corpus
  • the corpus training module 605 is used to mark the final out-of-
  • the mask construction module 602 includes: a character calculation unit 6021, which is used to perform sentence segmentation processing on the original dialogue material to obtain a plurality of sentences, and calculate the string length of each sentence of the original dialogue material respectively; a mask combination unit 6022 , for combining the preset first element value with the same length as each string to form a mask corresponding to each sentence; the corpus correspondence unit 6023 is used for constructing a mask list corresponding to the original dialogue corpus by using the mask.
  • a character calculation unit 6021 which is used to perform sentence segmentation processing on the original dialogue material to obtain a plurality of sentences, and calculate the string length of each sentence of the original dialogue material respectively
  • a mask combination unit 6022 for combining the preset first element value with the same length as each string to form a mask corresponding to each sentence
  • the corpus correspondence unit 6023 is used for constructing a mask list corresponding to the original dialogue corpus by using the mask.
  • the mask construction module 602 also includes: an element selection unit 6024, which is used to respectively determine the adjustment position corresponding to each segment of the mask in the mask list according to preset selection rules; an element replacement unit 6025, which is used to adopt the preset The two-element value replaces the first element value at the adjusted position to obtain the adjusted mask list.
  • the second intention module 603 includes: a word selection unit 6031, which is used to respectively determine the position of the first element value of each mask in the mask list after numerical transformation, and select each sentence in the original corpus with the first The single word that element value position is identical; Sequential combination unit 6032, is used for according to the order of first element value position, each sentence is correspondingly selected single word carries out order combination respectively, correspondingly obtains new sentence; Sentence splicing unit 6033, is used for Each new sentence is spliced to obtain the attached dialogue material corresponding to the original dialogue material.
  • a word selection unit 6031 which is used to respectively determine the position of the first element value of each mask in the mask list after numerical transformation, and select each sentence in the original corpus with the first The single word that element value position is identical
  • Sequential combination unit 6032 is used for according to the order of first element value position, each sentence is correspondingly selected single word carries out order combination respectively, correspondingly obtains new sentence
  • Sentence splicing unit 6033 is
  • the final out-of-set module 604 also includes: if the first detection result is that the first intent category and the second intent category are the same, then use the auxiliary dialogue material as the initial dialogue material; if the first detection result is that the first intent category and the second If the two intent categories are different, the original dialogue corpus is used as the initial dialogue corpus; the next round of corresponding mask list value transformation, intent recognition and difference degree detection is performed on the initial dialogue corpus until the initial dialogue corpus meets the preset exit conditions. A new first detection result is obtained.
  • the final out-of-collection corpus 604 includes: a difference judging unit 6041, configured to sequentially judge whether each first intent category and each corresponding second intent category are the same according to the first detection result; an alternative selection unit 6042, using If not the same, then determine that the sentence corresponding to the second intent category different from the first intent category satisfies the preset first difference condition and is used as an alternative foreign language material; the alternative identification unit 6043 is used to identify the alternative through the intention recognition model The third intention category of each sentence in the foreign language material; the final selection unit 6044 is used to detect the degree of difference between the first intention category and the third intention category, obtain the second detection result, and according to the second detection result, from In the candidate foreign corpus, the sentences satisfying the preset second difference condition are selected as the final foreign corpus.
  • a difference judging unit 6041 configured to sequentially judge whether each first intent category and each corresponding second intent category are the same according to the first detection result
  • an alternative selection unit 6042 using If not the same, then determine that the sentence corresponding
  • the final out-of-set unit 6044 includes judging whether the degree of difference between each first intention category and each corresponding third intention category is greater than a preset difference degree threshold according to the second detection result; if greater, determine that the degree of difference is greater than Sentences corresponding to the third intent category with a preset difference degree threshold satisfy the second difference condition and are used as the final out-of-set corpus.
  • this embodiment describes in detail the specific functions of each module and the unit structure of some modules.
  • the mask list is obtained by processing the original dialogue data with the mask element, and then through the element
  • the cycle processing of replacement, intent recognition, and difference detection is performed until the words in each sentence in the original dialogue data are processed accordingly, and the candidate out-of-set corpus is obtained, and then the intention recognition and difference conditions are performed on the candidate out-of-set corpus Judging the final extra-corpus corpus, and training it with the original dialogue corpus using basic machine learning methods to obtain a rejection intent recognition model with the rejection function, which can not only speed up the training speed of the model, but also avoid the phenomenon of corpus entanglement, and obtain An intent recognition model with higher efficiency for normal corpus intent recognition.
  • FIG. 8 is a schematic structural diagram of an intention recognition model updating device provided by an embodiment of the present application.
  • the intent recognition model updating device 800 may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units (CPU) 810 (for example, one or more processors) and memory 820, one or more storage media 830 for storing application programs 833 or data 832 (for example, one or more mass storage devices).
  • the memory 820 and the storage medium 830 may be temporary storage or persistent storage.
  • the program stored in the storage medium 830 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations for the intent recognition model update device 800 .
  • the processor 810 may be configured to communicate with the storage medium 830 , and execute a series of instruction operations in the storage medium 830 on the intent recognition model update device 800 .
  • the intent recognition model update device 800 may also include one or more power sources 840, one or more wired or wireless network interfaces 850, one or more input and output interfaces 860, and/or, one or more operating systems 831, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 831 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the present application also provides an intent recognition model update device, the computer device includes a memory and a processor, and computer readable instructions are stored in the memory, and when the computer readable instructions are executed by the processor, the processor executes the intent in the above-mentioned embodiments Steps to identify the model update method.
  • the present application also provides a computer-readable storage medium
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium
  • the computer-readable storage medium may also be a volatile computer-readable storage medium
  • the computer-readable storage medium may be Instructions are stored in the read storage medium, and when the instructions are run on the computer, the computer is made to execute the steps of the method for updating the intention recognition model.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc.
  • This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

A method, apparatus, and device for updating an intent recognition model, and a readable medium, which relate to the field of artificial intelligence. The method improves the accuracy of an intent recognition model recognizing corpus intent. The method comprises: obtaining an original dialog corpus, and using an intent model for recognition to obtain a first intent category; initializing the corpus into a mask list, and replacing an element value in the mask list according to a selection rule; constructing an auxiliary dialogue corpus corresponding to the original dialogue corpus, and performing recognition by means of a training model to obtain a second intent category; detecting the degree of difference between the first and second intent categories, obtaining a first detection result, and performing selection to obtain an alternative outside corpus; by means of a training model, recognizing the alternative outside corpus to obtain a third intent category; detecting the degree of difference between the first and third intent categories to obtain a second detection result, and performing selection to obtain a final outside corpus; and retraining said corpus and the original dialogue corpus to obtain an intent recognition model.

Description

意图识别模型更新方法、装置、设备及可读介质Intent recognition model updating method, device, equipment and readable medium
本申请要求于2021年09月18日提交中国专利局、申请号为202111095912.8、发明名称为“意图识别模型更新方法、装置、设备及可读介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on September 18, 2021, with the application number 202111095912.8, and the title of the invention is "Intention Recognition Model Updating Method, Device, Equipment, and Readable Medium", the entire content of which is passed References are incorporated in the application.
技术领域technical field
本申请涉及人工智能领域,公开了一种意图识别模型更新方法、装置、设备及可读介质。The present application relates to the field of artificial intelligence, and discloses a method, device, equipment and readable medium for updating an intention recognition model.
背景技术Background technique
随着计算机技术的不断发展和创新的背景下,人工智能技术已经逐渐应用于各行各业之中,相关智能产品和技术应用也逐渐渗透于人们日常生活的方方面面之中,极大地改善人们的生产生活,其中人机对话是人工智能的一个重要研究领域,它是研究如何使计算机能够理解和运用人类社会的自然语言,实现人机之间的自然语言通信,起到计算机能代替人的部分脑力劳动,以及延伸人类大脑、减少人类部分工作的作用,在日常生活中,对话场景的复杂多样,要求计算机能够在对话的过程中准确地识别客户的意图,更好地理解客户的需要以便于更好地展开对话,满足客户真正的需求。With the continuous development and innovation of computer technology, artificial intelligence technology has been gradually applied in all walks of life, and related intelligent products and technology applications have gradually penetrated into all aspects of people's daily life, greatly improving people's production In life, human-computer dialogue is an important research field of artificial intelligence. It is to study how to enable computers to understand and use the natural language of human society, realize natural language communication between humans and computers, and play a part of the brain power that computers can replace human beings. Labor, as well as the role of extending the human brain and reducing part of the work of humans. In daily life, the complexity and variety of dialogue scenarios require computers to accurately identify the customer's intentions during the dialogue, and better understand the customer's needs in order to facilitate more Start the dialogue well and meet the real needs of customers.
“意图识别”指的是对用户输入的一段用于表达查询需求的信息,判断用户所述的意图类别,发明人意识到目前的意图识别技术主要应用于搜索引擎、人机对话系统等,在应用于人机对话之时,通过构建一个意图识别模型用来识别客户的意图,由于在日常人机对话中,会受到环境噪音的干扰,会产生大量不属于已有意图类别的中的语料如果意图识别模型不能正确辨识出这种语料,则会对用户体验产生较大影响,严重时可能会有泄露用户隐私的风险。现有解决方法是通过数据增强方法生成集外语料,一般是随机插入、删除、交换等操作,通过集外语料训练意图识别模型的拒识能力,但是这种数据增强方法无法保证生成的语料一定属于集外类别,还会导致训练语料中出现语料纠缠的现象,影响所训练出来的意图识别模型对正常语料的识别效果,即现有意图识别模型更新精准度较低的问题。"Intent recognition" refers to a piece of information input by the user to express the query demand, and judge the type of intent stated by the user. The inventor realizes that the current intent recognition technology is mainly used in search engines, man-machine dialogue systems, etc., in When applied to human-computer dialogue, an intention recognition model is constructed to identify the customer's intention. Due to the interference of environmental noise in daily human-computer dialogue, a large amount of corpus that does not belong to the existing intention category will be generated. If If the intent recognition model cannot correctly identify this kind of corpus, it will have a great impact on the user experience, and in severe cases, there may be a risk of leaking user privacy. The existing solution is to generate out-of-set corpus through data enhancement methods, generally random insertion, deletion, exchange and other operations, and train the rejection ability of the intention recognition model through out-of-set corpus, but this data enhancement method cannot guarantee that the generated corpus will be certain. Belonging to the out-of-set category will also lead to corpus entanglement in the training corpus, affecting the recognition effect of the trained intent recognition model on normal corpus, that is, the problem of low update accuracy of the existing intent recognition model.
发明内容Contents of the invention
本申请提供了一种意图识别模型更新方法、装置、设备及存储介质一种意图识别模型更新方法、装置、设备及可读介质,用于提高了意图识别模型对语料意图的识别精准度。The present application provides an intent recognition model update method, device, device and storage medium. An intent recognition model update method, device, device and readable medium are used to improve the recognition accuracy of the intent recognition model for corpus intent.
本申请第一方面提供了一种意图识别模型更新方法,其中,所述意图识别模型更新方法包括:获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集外语料,对所述意图识别模型进行训练,得到新的意图识别模型。The first aspect of the present application provides a method for updating an intent recognition model, wherein the method for updating an intent recognition model includes: obtaining original dialogue material, and identifying the meaning of each sentence in the original dialogue material by presetting an intent recognition model The first intent category; initialize the mask list corresponding to the original dialogue material, and adjust a group of element values in the mask list according to preset selection rules to obtain an adjusted mask list; based on the adjusted A mask list, constructing the auxiliary dialogue material corresponding to the original dialogue material, and identifying the second intent category of each statement in the auxiliary dialogue material through the intent recognition model; for the first intent category and the The degree of difference is detected for the second intent category to obtain the first detection result, and based on the first detection result, a sentence satisfying the preset difference condition is selected from the attached dialogue corpus as the final extra-collection corpus; The foreign corpus is marked as out-of-set intent, and the original dialogue corpus and the final out-of-set corpus are used to train the intent recognition model to obtain a new intent recognition model.
本申请第二方面提供了一种意图识别模型更新设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集 外语料,对所述意图识别模型进行训练,得到新的意图识别模型。The second aspect of the present application provides an intent recognition model updating device, including a memory, a processor, and computer-readable instructions stored on the memory and operable on the processor, and the processor executes the computer The following steps are implemented when the instruction is readable: obtain the original dialogue material, and identify the first intent category of each statement in the original dialogue material through a preset intent recognition model; initialize the mask list corresponding to the original dialogue material, and Adjust a set of element values in the mask list according to preset selection rules to obtain an adjusted mask list; based on the adjusted mask list, construct an auxiliary dialogue material corresponding to the original dialogue material, and pass The intent recognition model identifies the second intent category of each statement in the attached dialogue material; detects the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the According to the first detection result, a sentence that satisfies the preset difference condition is selected from the attached dialogue material as the final out-of-set corpus; the final out-of-set corpus is marked as out-of-set intent, and the original dialogue material and the Finally, the extra-collection corpus is used to train the intent recognition model to obtain a new intent recognition model.
本申请的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集外语料,对所述意图识别模型进行训练,得到新的意图识别模型。The third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are run on the computer, the computer is made to perform the following steps: obtain the original dialogue material, And through the preset intention recognition model, identify the first intent category of each sentence in the original dialogue material; initialize the mask list corresponding to the original dialogue material, and adjust the mask list according to the preset selection rules A set of element values to obtain an adjusted mask list; based on the adjusted mask list, construct an auxiliary dialogue corpus corresponding to the original dialogue corpus, and use the intent recognition model to identify the the second intent category of each sentence; detecting the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, from the attached dialogue data Selecting sentences that satisfy the preset difference conditions as the final extra-set corpus; marking the final extra-set corpus as extra-set intent, and using the original dialogue material and the final extra-set corpus to train the intent recognition model , to get a new intent recognition model.
本申请第四方面提供了一种意图识别模型更新装置,其中,所述意图识别模型更新装置包括:语料获取模块,用于获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;掩码构建模块,用于初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;第二意图模块,用于基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;最终集外模块,用于对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;语料训练模块,用于将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集外语料,对所述意图识别模型进行训练,得到新的意图识别模型。The fourth aspect of the present application provides an intent recognition model update device, wherein the intent recognition model update device includes: a corpus acquisition module, used to obtain original dialogue materials, and recognize the original dialogue by presetting the intent recognition model The first intent category of each sentence in the corpus; the mask construction module is used to initialize the mask list corresponding to the original dialogue material, and adjust a group of element values in the mask list according to preset selection rules to obtain An adjusted mask list; a second intention module, configured to construct an auxiliary dialogue material corresponding to the original dialogue material based on the adjusted mask list, and identify the auxiliary dialogue material through the intent recognition model The second intent category of each sentence in the sentence; the final out-of-set module is used to detect the degree of difference between the first intent category and the second intent category, obtain the first detection result, and based on the first detection result , selecting sentences satisfying the preset difference conditions from the attached dialogue data as the final extra-set corpus; the corpus training module is used to mark the final out-of-set corpus as out-of-set intent, and adopt the original dialogue corpus and all The final out-of-collection corpus is used to train the intent recognition model to obtain a new intent recognition model.
本申请提供的技术方案中,获取待混淆页面,在所述待混淆页面中利用目标检测模型确定文字区域,并确定所述文字区域对应的位置坐标;采用预置神经卷积网络对所述文字区域中的文字进行识别,得到文本文字;利用正则表达式在所述文本文字中查询待混淆文本以及所述待混淆文本对应的位置坐标,并利用颜色提取算法提取所述待混淆文本的文字颜色;根据所述待混淆文本的文字颜色,在所述待混淆文本的位置坐标所对应的待混淆界面上生成混淆图层,利用所述混淆图层对所述待混淆文本进行覆盖,得到覆盖页面。本申请利用计算机数据处理方法中的掩码列表来实现对原始对话语料的语句处理,构建掩码列表进行相关数值替换得到满足第一差异条件的备选集外语料,语料的处理方法利用预设的数学规律进行处理,使得生成的备选集外语料更加满足集外意图类别,同时通过对得到的备选集外语料进行识别和第二差异条件比较,能进一步的避免的训练语料中出现语料纠缠的现象,从而将获得的最终集外语料与原始对话语料进行重新训练得到新的意图识别模型,此方法可以免去了通过设定置信度阈值来实现拒识功能的额外对比实验,缩短了意图识别模型上线和优化的周期,同时使训练出的意图识别模型具有拒识能力的同时,减轻对正常语料识别效果的影响,进而提高了意图识别模型更新精准度。In the technical solution provided by the present application, the page to be confused is obtained, and the target detection model is used to determine the text area in the page to be confused, and the position coordinates corresponding to the text area are determined; Recognize the text in the region to obtain the text; use regular expressions to query the text to be confused and the position coordinates corresponding to the text to be confused in the text, and use the color extraction algorithm to extract the text color of the text to be confused ; According to the text color of the text to be confused, a confusion layer is generated on the interface to be confused corresponding to the position coordinates of the text to be confused, and the text to be confused is covered by the confusion layer to obtain a covered page . This application uses the mask list in the computer data processing method to realize the sentence processing of the original dialogue material, constructs the mask list and replaces the relevant values to obtain the candidate foreign language material that meets the first difference condition, and the processing method of the language material uses the preset Processing the mathematical laws of the set, so that the generated candidate out-of-set corpus is more suitable for the category of out-of-set intentions. At the same time, by identifying and comparing the obtained candidate out-of-set corpus with the second difference condition, it is possible to further avoid the appearance of corpus in the training corpus The phenomenon of entanglement, so as to retrain the obtained final out-of-set corpus and the original dialogue corpus to obtain a new intent recognition model. This method can eliminate the need for additional comparison experiments to realize the recognition rejection function by setting the confidence threshold, shortening the time The cycle of launching and optimizing the intent recognition model, while enabling the trained intent recognition model to have the ability to reject recognition, reduces the impact on the normal corpus recognition effect, thereby improving the update accuracy of the intent recognition model.
附图说明Description of drawings
图1为本申请意图识别模型更新方法的第一个实施例示意图;FIG. 1 is a schematic diagram of the first embodiment of the method for updating the intent recognition model of the present application;
图2为本申请意图识别模型更新方法的第二个实施例示意图;FIG. 2 is a schematic diagram of a second embodiment of the method for updating the intent recognition model of the present application;
图3为本申请意图识别模型更新方法的第三个实施例示意图;FIG. 3 is a schematic diagram of a third embodiment of the method for updating the intent recognition model of the present application;
图4为本申请意图识别模型更新方法的第四个实施例示意图;FIG. 4 is a schematic diagram of a fourth embodiment of the method for updating the intent recognition model of the present application;
图5为本申请意图识别模型更新方法的第五个实施例示意图;FIG. 5 is a schematic diagram of a fifth embodiment of the method for updating the intent recognition model of the present application;
图6为本申请意图识别模型更新装置的一个实施例示意图;Fig. 6 is a schematic diagram of an embodiment of the device for updating the intention recognition model of the present application;
图7为本申请意图识别模型更新装置的另一个实施例示意图;Fig. 7 is a schematic diagram of another embodiment of the device for updating the intention recognition model of the present application;
图8为本申请意图识别模型更新设备的一个实施例示意图。Fig. 8 is a schematic diagram of an embodiment of an intention recognition model updating device of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种意图识别模型更新方法、装置、设备及可读介质。本申请实施例中,相比于现有技术的对数据增强的处理方法是随机的插入、删除、交换等操作,本申请利用计算机数据处理方法中的掩码列表来实现对原始对话语料的语句处理,构建掩码列表进行相关数值替换得到满足第一差异条件的备选集外语料,语料的处理方法利用预设的数学规律进行处理,使得生成的备选集外语料更加满足集外意图类别,此方法可以免去了通过设定置信度阈值来实现拒识功能的额外对比实验,缩短了意图识别模型上线和优化的周期,同时使训练出的意图识别模型具有拒识能力的同时,减轻对正常语料识别效果的影响,进而提高了意图识别模型更新精准度。Embodiments of the present application provide a method, device, device, and readable medium for updating an intent recognition model. In the embodiment of the present application, compared with the prior art processing method for data enhancement is random insertion, deletion, exchange and other operations, the present application uses the mask list in the computer data processing method to realize the statement of the original dialogue material Processing, constructing a mask list and replacing relevant values to obtain candidate out-of-collection corpus that meets the first difference condition. The processing method of the corpus uses preset mathematical laws to make the generated candidate out-of-collection corpus more satisfy the category of out-of-collection intent , this method can avoid the additional comparison experiment of realizing the recognition rejection function by setting the confidence threshold, shorten the cycle of launching and optimizing the intention recognition model, and at the same time make the trained intention recognition model have the recognition rejection ability, and reduce the The impact on the recognition effect of normal corpus improves the update accuracy of the intent recognition model.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and not necessarily Used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the term "comprising" or "having" and any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to those explicitly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中意图识别模型更新方法的第一个实施例包括:For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. The first embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
101、获取原始对话语料,并通过预置意图识别模型,识别原始对话语料中每个语句的第一意图类别;101. Obtain the original dialogue material, and identify the first intent category of each sentence in the original dialogue material through a preset intent recognition model;
可以理解的是,本申请的执行主体可以为意图识别模型更新装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution subject of the present application may be an intent recognition model updating device, or may also be a terminal or a server, which is not specifically limited here. The embodiment of the present application is described by taking the server as an execution subject as an example.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
本实施例中,这里的原始对话语料,一般是由相关行业的业务人员,根据该行业常被咨询的问题和其常需处理的业务等进行收集总结得到的对话语料,它是建立意图识别模型的基础,而这里的预置意图识别模型可以是FastText(快速文本分类器)、textCNN(文本分类算法)或基于预训练语言模型的任何分类模型,通过获取原始对话语料,并通过预置的意图识别模型,识别得到的原始对话语料中每个语句的第一意图类别,这里的是意图类别的识别原理是由于预置意图识别模型本质上是一连串的数学运算,能够输出识别运算得到的不同意图的概率分数形式的结果,并选取其中意图运算概率最大的标签并记录下该标签和对应的概率进行输出。In this embodiment, the original dialogue materials here are generally collected and summarized by business personnel in related industries according to the frequently consulted questions in this industry and the business that they often need to deal with. It is used to establish an intent recognition model. , and the preset intent recognition model here can be FastText (fast text classifier), textCNN (text classification algorithm) or any classification model based on the pre-trained language model, by obtaining the original dialogue material and passing the preset intent Recognition model, identifying the first intent category of each sentence in the original dialogue data, here is the recognition principle of the intent category because the preset intent recognition model is essentially a series of mathematical operations, which can output different intents obtained by recognition operations The results in the form of probability scores, and select the label with the highest probability of intention operation and record the label and corresponding probability for output.
在实际应用中,目标人机对话系统获取由相关业务人员收集的相关原始对话语料,并通过预置的意图识别模型,对获取得到的原始对话语料中的每个语句进行意图识别其对应的意图标签和概率,这里的概率代表该条语料属于这个类别的可能性,进而获取得到的原始对话语料中的每个语句中第一意图类别。In practical applications, the target man-machine dialogue system obtains relevant original dialogue materials collected by relevant business personnel, and through the preset intent recognition model, performs intent recognition on each sentence in the obtained original dialogue materials and its corresponding intent Label and probability, where the probability represents the possibility that the corpus belongs to this category, and then the first intent category in each sentence in the original dialogue corpus is obtained.
102、初始化原始对话语料对应的掩码列表,并根据预置选取规则调整掩码列表中的一组元素值,得到调整后的掩码列表;102. Initialize the mask list corresponding to the original dialogue material, and adjust a group of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
本实施例中,这里的掩码是一串二进制代码,在实际应用中也可用其他字符元素代替,由于计算机本质上是进行二进制代码进行运算处理的,本方案采用掩码列表,便于数值变换和计 算机的处理,这里的预置选取规则,是根据所要训练的意图识别模型对话的业务种类而定的,根据原始对话语料的长度和对话处理难度,可设置两个及以上的掩码进行数值调整,而其调整的规则只要符合某种数学规律即可,如正常按照语料的读取顺序进行调整,将获取得到的原始对话语料初始化为掩码列表,并根据预置选取规则对掩码列表的元素进行调整替换,得到调整后的掩码列表。In this embodiment, the mask here is a string of binary codes, which can also be replaced by other character elements in practical applications. Since computers essentially perform operations on binary codes, this program uses a mask list, which is convenient for numerical conversion and For computer processing, the preset selection rules here are determined according to the business type of the intention recognition model dialogue to be trained. According to the length of the original dialogue material and the difficulty of dialogue processing, two or more masks can be set for numerical adjustment. , and the adjustment rules only need to conform to certain mathematical laws. For example, adjust according to the reading order of the corpus, initialize the obtained original dialogue corpus as a mask list, and adjust the mask list according to the preset selection rules. Elements are adjusted and replaced, resulting in an adjusted mask list.
在实际应用中,基于获取得到的原始对话语料,首先对每一条对话语料进行初始化一个和语料长度相同的掩码元素,其中的掩码元素都为0,进而选取该语料掩码中的前两个字符,将所选字符对应位置的掩码元素调整变为1,然后将原始对话语料中对应掩码列表位置元素为0的字符和变换后元素1的字符,按原来位置排列顺序组成一个新的句子,得到调整后的掩码列表。In practical applications, based on the obtained original dialogue material, first initialize a mask element with the same length as the corpus for each dialogue material, and the mask elements are all 0, and then select the first two in the corpus mask. characters, adjust the mask element at the corresponding position of the selected character to 1, and then form a new character in the original dialogue corpus with the character at the position element 0 corresponding to the mask list and the character at the transformed element 1 according to the order of the original position. sentences, get the adjusted mask list.
103、基于调整后的掩码列表,构建原始对话语料对应的附属对话语料,并通过意图识别模型,识别附属对话语料中每个语句的第二意图类别;103. Based on the adjusted mask list, construct an auxiliary dialogue corpus corresponding to the original dialogue corpus, and identify the second intent category of each sentence in the auxiliary dialogue corpus through an intent recognition model;
本实施例中,采用调整后的掩码列表,分别确定调整后掩码列表中每段语料对应的第一元素值,这里的第一元素值是原始对话语料初始化处理得到的掩码列表中对应的掩码元素值,通过确定第一元素值的及其相应位置,采用原始对话语料对应的词语进行替换并按原序进行排列得到附属对话语料,将该附属对话语料通过意图识别模型,识别附属对话语料中每个语句的标签和相应概率得到第二意图类别。In this embodiment, the adjusted mask list is used to determine the first element value corresponding to each piece of corpus in the adjusted mask list, where the first element value is the corresponding value in the mask list obtained from the initialization process of the original dialogue material. mask element value, by determining the value of the first element and its corresponding position, replacing it with the words corresponding to the original dialogue material and arranging them in the original order to obtain the auxiliary dialogue material, and identifying the auxiliary dialogue material through the intent recognition model. The labels and corresponding probabilities for each sentence in the dialogue corpus yield the second intent category.
在实际应用中,目标人机对话系统采用处理得到的数值变换后的掩码列表,先识别数值变换后掩码列表中第一掩码元素列表相应初始化掩码元素及其位置,再将识别得到的第一掩码元素及其对应位置利用与其对应的原对话语料中相对应位置的词语,进行替换第一掩码元素,进而将替换第一掩码元素的词语按照原始对话语料的顺序进行排列,得到原始对话语料对应的附属对话语料,然后将得到的附属对话语料通过当前所要训练采用的意图识别模型,识别附属对话语料中号每段语句对应的标签及其相应概率,进而得到附属对话语料对应的第二意图类别。In practical applications, the target man-machine dialogue system uses the processed mask list after numerical transformation, first identifies the first mask element list in the mask list after numerical transformation and correspondingly initializes the mask element and its position, and then recognizes the obtained The first mask element and its corresponding position are replaced by the words corresponding to the corresponding position in the original dialogue material, and then the words replacing the first mask element are arranged in the order of the original dialogue material , to obtain the auxiliary dialogue material corresponding to the original dialogue material, and then pass the obtained auxiliary dialogue material through the current intent recognition model to be trained to identify the labels and corresponding probabilities corresponding to each sentence in the auxiliary dialogue material, and then obtain the auxiliary dialogue material The corresponding second intent class.
104、对第一意图类别和第二意图类别进行差异程度检测,得到第一检测结果,并基于第一检测结果,从附属对话语料中选取满足预置差异条件的语句作为最终集外语料;104. Detecting the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, select a sentence satisfying a preset difference condition from the attached dialogue data as the final extra-collection data;
本实施例中,这里的预置差异条件包含第一差异条件和第二差异条件;根据识别处理得到的第一意图类别和第二意图类别,利用意图识别得到每句语料对应的标签及其相对应概率进行差异程度的检测,这里的差异程度检测是指利用意图识别得到原始对话语料和附属对话语料中比较两者的标签是否不同,或者相同进行比较两者相对应的概率,一般附属对话语料的相应标签概率和原始对话语料的相应标签基础概率之间的差距大于一定阈值(该阈值是根据意图识别模型的概率特征进行设置,一般阈值较小),并将得到的结果进行掩码列表变化、意图识别、差异度检测的循环操作处理,直至满足预置的循环处理条件,进而得到第一检测结果,基于得到的第一检测结果,从附属对话语料中选取满足预置第一差异条件的语句作为备选集外语料,进而对备选集外语料进行了类别识别、差异检测后,得到第二检测结果,从备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料。In this embodiment, the preset difference conditions here include the first difference condition and the second difference condition; according to the first intention category and the second intention category obtained by the recognition process, the label corresponding to each sentence corpus and its corresponding The detection of the degree of difference corresponds to the probability. The detection of the degree of difference here refers to the use of intent recognition to obtain whether the tags in the original dialogue corpus and the attached dialogue corpus are different, or to compare the corresponding probabilities of the two. Generally, the attached dialogue corpus The difference between the corresponding label probability of the original dialogue material and the corresponding label basic probability of the original dialogue material is greater than a certain threshold (the threshold is set according to the probability characteristics of the intent recognition model, and the general threshold is small), and the obtained result is changed to the mask list , intent recognition, and difference degree detection loop operation processing until the preset loop processing conditions are met, and then the first detection result is obtained, and based on the obtained first detection result, select from the attached dialogue materials that meet the preset first difference condition Sentences are used as the candidate out-of-set corpus, and after class identification and difference detection are carried out on the candidate out-of-set corpus, the second detection result is obtained, and the sentences satisfying the preset second difference condition are selected from the candidate out-of-set corpus as the final set foreign language materials.
在实际应用中,目标人机对话系统根据第一意图类别和第二意图类别,对两者的相对应每句语料的意图类别机器对应概率进行差异程度的检测,比较两者的差异度是否大于预置的数值,并将所得结果进行再次进行掩码列表变换、意图识别和差异程度检测,直至满足将全部语料的词语都进行置换处理的退出条件时,得到第一检测结果,进而根据第一检测结果利用预置的第一差异条件,从每个对话语料中选取满足两者意图识别概率差值相差较大的差值的语句作为备选集外语料;通过意图识别模型识别备选集外语料中每个语句的第三意图类别;对第一意图类别和第三意图类别进行差异程度检测,得到第二检测结果,并根据第二检测结果,从备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料。In practical applications, the target man-machine dialogue system detects the degree of difference between the corresponding probabilities of the machine corresponding to each sentence of the intention category according to the first intention category and the second intention category, and compares whether the difference between the two is greater than Preset values, and the obtained results are again subjected to mask list transformation, intent recognition and difference degree detection until the exit condition of replacing all words in the corpus is met, and the first detection result is obtained, and then according to the first The detection result uses the preset first difference condition, and selects a sentence from each dialogue material that satisfies the large difference in the difference between the two intent recognition probabilities as the candidate foreign language material; the candidate foreign language is identified through the intent recognition model The third intent category of each sentence in the data; the degree of difference between the first intent category and the third intent category is detected to obtain the second detection result, and according to the second detection result, select from the candidate out-of-set corpus that satisfies the preset The sentences of the second difference condition are used as the final out-of-set corpus.
105、将最终集外语料标注为集外意图,并采用原始对话语料和最终集外语料,对意图识别 模型进行训练,得到新的意图识别模型。105. Mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
本实施例中,根据步骤104所得的最终集外语料,将其标注为本模型的集外意图,进而将原始对话语料和处理所得最终集外语料合并为新的训练语料,基于机器学习方法对新的训练语料进行意图识别训练,获得新的意图识别模型,进而得到具有拒识功能的意图识别模型,这里的基础机器学习方法是指运用得到的新的训练语料数据和以往的意图识别模型训练经验,利用人工智能等技术对模型进行不断训练,以此来提高的意图识别模型的识别准确率。In this embodiment, according to the final out-of-set corpus obtained in step 104, it is marked as the out-of-set intention of this model, and then the original dialogue data and the final out-of-set corpus obtained from the processing are combined into new training corpus, which is based on machine learning methods Carry out intent recognition training with new training corpus, obtain a new intent recognition model, and then obtain an intent recognition model with a recognition rejection function. The basic machine learning method here refers to using the new training corpus data obtained and the previous intent recognition model training. Experience, using artificial intelligence and other technologies to continuously train the model, so as to improve the recognition accuracy of the intent recognition model.
在实际应用中,目标人机对话系统根据得到的最终集外语料,将其标注为此次训练模型的集外意图,进而将原始对话语料和最终集外语料进行合并,从而得到新的训练语料,将该新的训练语料基于机器学习的方法在本次所要训练意图识别模型中进行反复的训练,从而得到新的具有拒识功能的意图识别模型。In practical applications, the target human-computer dialogue system marks it as the out-of-set intention of the training model according to the obtained final out-of-set corpus, and then merges the original dialogue material and the final out-of-set corpus to obtain a new training corpus , the new training corpus is based on the method of machine learning to perform repeated training in the intention recognition model to be trained this time, so as to obtain a new intention recognition model with recognition rejection function.
本实施例中,通过训练得到新的意图识别模型,将获取的到待识别的语料传送至该意图识别模型中,新的意图识别模型根据所得到的待识别预料与模型中的集外语料进行语料识别,识别其中与真正意图无关的集外语料,进而获取该测试预料的真正意图对应的语料和其对应的意图类别,进而将识别得到的意图返回至目标人机对话系统中显示。In this embodiment, a new intention recognition model is obtained through training, and the acquired corpus to be recognized is transferred to the intention recognition model, and the new intention recognition model is performed according to the obtained prediction to be recognized and the out-of-set corpus in the model. Corpus identification, identifying the extra-set corpus that has nothing to do with the real intention, and then obtaining the corpus corresponding to the real intention predicted by the test and its corresponding intention category, and then returning the recognized intention to the target human-computer dialogue system for display.
本申请实施例中,相比于现有技术的对数据增强的处理方法是随机的插入、删除、交换等操作,本申请利用计算机数据处理方法中的掩码列表来实现对原始对话语料的语句处理,构建掩码列表进行相关数值替换得到满足第一差异条件的备选集外语料,语料的处理方法利用预设的数学规律进行处理,使得生成的备选集外语料更加满足集外意图类别,此方法可以免去了通过设定置信度阈值来实现拒识功能的额外对比实验,缩短了意图识别模型上线和优化的周期,同时使训练出的意图识别模型具有拒识能力的同时,减轻对正常语料识别效果的影响,进而提高了意图识别模型更新的精准度。In the embodiment of the present application, compared with the prior art processing method for data enhancement is random insertion, deletion, exchange and other operations, the present application uses the mask list in the computer data processing method to realize the statement of the original dialogue material Processing, constructing a mask list and replacing relevant values to obtain candidate out-of-collection corpus that meets the first difference condition. The processing method of the corpus uses preset mathematical laws to make the generated candidate out-of-collection corpus more satisfy the category of out-of-collection intent , this method can avoid the additional comparison experiment of realizing the recognition rejection function by setting the confidence threshold, shorten the cycle of launching and optimizing the intention recognition model, and at the same time make the trained intention recognition model have the recognition rejection ability, and reduce the The impact on the recognition effect of normal corpus improves the accuracy of updating the intent recognition model.
请参阅图2,本申请实施例中意图识别模型更新方法的第二个实施例包括:Please refer to Figure 2, the second embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
201、获取原始对话语料,并通过预置意图识别模型,识别原始对话语料中每个语句的第一意图类别;201. Obtain the original dialogue material, and identify the first intent category of each sentence in the original dialogue material through a preset intent recognition model;
202、对原始对话语料进行分句处理,得到多个语句,并分别计算原始对话语料每个语句的字符串长度;202. Perform sentence segmentation processing on the original dialogue data to obtain multiple sentences, and calculate the string length of each sentence in the original dialogue data respectively;
本实施例中,通过对原始对话语料进行分句处理,这里的分句处理是通过识别原始对话语料中的基本标点符号来实现对原始对话语料的中语料的分句处理,从而可得到原始对话语料的多个语句,并通过预置的字符串函数实现分别计算原始对话语料中得到的每个语句的字符串长度,可以得到每个语句对应的字符串长度值。In this embodiment, by performing sentence segmentation processing on the original dialogue data, the sentence segmentation processing here is to realize the sentence processing of the middle corpus of the original dialogue data by identifying the basic punctuation marks in the original dialogue data, so that the original dialogue can be obtained Multiple sentences of the corpus, and the string length of each sentence obtained in the original dialogue data is calculated separately through the preset string function, and the string length value corresponding to each sentence can be obtained.
203、采用与各字符串长度相同的预置第一元素值,分别组合成每个语句对应的掩码;203. Using the preset first element value with the same length as each character string, respectively combine into a mask corresponding to each statement;
本实施例中,通过得到每个语句对应的字符串长度,进而采用与每个字符串长度值对应的相同长度的预置第一掩码元素值,分别组合成得到每个语句对应的掩码。In this embodiment, by obtaining the string length corresponding to each sentence, and then using the preset first mask element value of the same length corresponding to each string length value, respectively combining to obtain the mask corresponding to each sentence .
204、采用掩码构建原始对话语料对应的掩码列表;204. Using a mask to construct a mask list corresponding to the original dialogue material;
本实施例中,通过采用掩码构建原始对话语料的对应的掩码列表。In this embodiment, the corresponding mask list of the original dialogue material is constructed by using masks.
在实际应用中,人机对话系统根据由外部输入获得的原始对话语料,利用预置的分句处理函数对原始对话语料进行分句处理,这里的分句处理函数通过设置的基本标点符号(如句号、感叹号等),进而对原始对话语料中的标点符号识别并划分,实现对分句处理操作,得到多个语句,再利用字符串函数对得到的语句进行字符数量的统计计算,得到每个语句的字符串长度,然后采用与每个字符串长度对应的相同长度的预置第一掩码值,分别组合成每个语句对应的掩码,从而采用掩码构建的原始对话语料对应的掩码列表,例如将第一掩码值元素设置为0,根据得到的字符串长度,初始化一个和语料长度相同的掩码列表,其中掩码元素都为0,如原始语料为“我想修改我的账户密码”,长度为10,则生成的对应的掩码列表为长度为10的列表[0,0,0,0,0,0,0,0,0,0]。In practical applications, the human-computer dialogue system uses the preset sentence processing function to process the original dialogue data according to the original dialogue data obtained from the external input. The sentence processing function here uses the set basic punctuation marks (such as full stop, exclamation point, etc.), and then recognize and divide the punctuation marks in the original dialogue data, realize the sentence processing operation, obtain multiple sentences, and then use the string function to perform statistical calculations on the number of characters in the obtained sentences, and get each The string length of the sentence, and then use the preset first mask value of the same length corresponding to each string length to form the mask corresponding to each sentence, so that the mask corresponding to the original dialogue material constructed by the mask code list, for example, set the first mask value element to 0, and initialize a mask list with the same length as the corpus according to the obtained string length, in which the mask elements are all 0, such as the original corpus is "I want to modify my account password" and the length is 10, then the generated corresponding mask list is a list [0,0,0,0,0,0,0,0,0,0,0] with a length of 10.
205、根据预置选取规则,分别确定掩码列表中每段掩码对应的调整位置;205. According to the preset selection rules, respectively determine the adjustment position corresponding to each segment of the mask in the mask list;
本实施例中,根据初始化处理得到的掩码列表,利用预置选取规则进行每段掩码位置选取,首先分别确定掩码列表中每段掩码对应的所要调整选取位置,然后分别选取每段掩码对应选取位置上的第一掩码值。In this embodiment, according to the mask list obtained by the initialization process, the preset selection rules are used to select the positions of each segment of the mask. The mask corresponds to the first mask value at the selected position.
206、采用预置第二元素值替换调整位置上的第一元素值,得到调整后的掩码列表;206. Use the preset second element value to replace the first element value at the adjusted position to obtain an adjusted mask list;
本实施例中,通过选取得到的每段掩码对应调整选取位置上的第一掩码值,进而采用第二掩码值来替换选取位置上的第一掩码值,替换后原有位置不变,从而可以得到数值变换后的掩码列表。In this embodiment, the first mask value at the selected position is correspondingly adjusted by selecting each segment of the mask, and then the second mask value is used to replace the first mask value at the selected position. After the replacement, the original position is not changed. Change, so that the mask list after numerical transformation can be obtained.
在实际应用中,人机对话系统根据初始化处理得到的掩码列表,利用预置的选取规则,分别确定掩码列表中每段掩码对应的选取位置,并且分别选取的掩码调整选取位置上的掩码元素,例如这里的选取规则采用按正常语句读取顺序,从左往右的顺序每次选取两个字符,第一次选取语句头两个掩码元素,从而得到选取位置上的第一掩码值[0,0],进而采用预置的第二掩码值来替换调整选取得到的相应位置上的第一掩码值,从而得到数值变换后的掩码列表,如设置第二掩码值为1,将所选取字符对应的掩码元0位置改为1,按上一个实施例的原始语料的列表则变为[1,1,0,0,0,0,0,0,0,0]。In practical applications, the man-machine dialogue system uses the preset selection rules to determine the selection position corresponding to each segment of the mask in the mask list according to the mask list obtained from the initialization process, and adjusts the selected position on the selected mask. For example, the selection rule here adopts the normal sentence reading order, selects two characters at a time from left to right, selects the first two mask elements of the sentence for the first time, and thus obtains the first two characters at the selected position A mask value [0,0], and then use the preset second mask value to replace the first mask value at the corresponding position selected by adjustment, so as to obtain the mask list after numerical transformation, such as setting the second The mask value is 1, and the mask element 0 position corresponding to the selected character is changed to 1, and the list of the original corpus according to the previous embodiment then becomes [1,1,0,0,0,0,0,0 ,0,0].
207、基于调整后的掩码列表,构建原始对话语料对应的附属对话语料,并通过意图识别模型,识别附属对话语料中每个语句的第二意图类别;207. Based on the adjusted mask list, construct the auxiliary dialogue corpus corresponding to the original dialogue corpus, and identify the second intent category of each statement in the auxiliary dialogue corpus through the intention recognition model;
208、对第一意图类别和第二意图类别进行差异程度检测,得到第一检测结果,并基于第一检测结果,从附属对话语料中选取满足预置差异条件的语句作为最终集外语料;208. Detect the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, select a sentence that satisfies the preset difference condition from the attached dialogue data as the final out-of-collection corpus;
209、将最终集外语料标注为集外意图,并采用原始对话语料和最终集外语料,对意图识别模型进行训练,得到新的意图识别模型。209. Mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
本申请实施例中,通过对原始对话语料进行分句处理,并进行字符串长度的计算得到每个语句的字符串长度,进而采用与各字符串长度相同的预置第一元素值,分别组合成每个语句对应的掩码,利用得到的掩码构建原始对话语料对应的掩码列表,从而基于预置选取规则,分别确定掩码列表中每段掩码对应的调整所选取的位置,并分别选取相应选取位置上的第一元素值,采用预置第二元素值替换选取位置上的第一元素值,进而得到调整后的掩码列表。相比于现有技术,本申请是利用掩码列表的方式对原始对话语料进行语句处理,通过计算每个语句的字符串长度后利用第一元素值组建新的掩码,进而利用预置选取规则用第二元素值代替选取的第一元素值,从而得到调整后的掩码列表,从而避免了现有技术对原始对话语料随机的插入、删除、交换等操作处理方法,进而对原始对话语料的处理更加符合数学运算规律,能得到更加符合要求训练集外语料。In the embodiment of the present application, the original dialogue material is divided into sentences, and the length of the string is calculated to obtain the string length of each sentence, and then the preset first element value that is the same as the length of each string is used to combine into a mask corresponding to each sentence, and use the obtained mask to construct a mask list corresponding to the original dialogue material, so as to determine the position selected for adjustment corresponding to each segment of the mask in the mask list based on the preset selection rules, and The first element value at the corresponding selected position is respectively selected, and the preset second element value is used to replace the first element value at the selected position, thereby obtaining an adjusted mask list. Compared with the prior art, the present application uses a mask list to process the sentences of the original dialogue data. After calculating the string length of each sentence, the first element value is used to construct a new mask, and then the preset selection is used. The rule replaces the selected first element value with the second element value, so as to obtain the adjusted mask list, thereby avoiding the random insertion, deletion, exchange and other operations and processing methods of the original dialogue material in the prior art, and then the original dialogue material The processing is more in line with the laws of mathematical operations, and can obtain more corpus outside the training set.
请参阅图3,本申请实施例中意图识别模型更新方法的第三个实施例包括:Please refer to Figure 3, the third embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
301、获取原始对话语料,并通过预置意图识别模型,识别原始对话语料中每个语句的第一意图类别;301. Obtain the original dialogue material, and identify the first intent category of each sentence in the original dialogue material through a preset intent recognition model;
302、初始化原始对话语料对应的掩码列表,并根据预置选取规则调整掩码列表中的一组元素值,得到调整后的掩码列表;302. Initialize the mask list corresponding to the original dialogue material, and adjust a set of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
303、分别确定数值变换后的掩码列表中每段掩码的第一元素值位置,并分别选取原始语料中每个语句与第一元素值位置相同的单字;303. Determine the position of the first element value of each segment of the mask in the mask list after numerical transformation, and respectively select the word in the original corpus that has the same position as the first element value in each sentence;
本实施例中,系统通过利用第一掩码值识别函数,这个函数通过函数遍历的方式利用预置的第一掩码值作为识别标志符进行全语句遍历,识别得到掩码列表中相应的第一掩码值及其对应位置,分别确定数值变换后的掩码阿列表中每段掩码的第一掩码值及其相应位置,并根据初始化掩码列表和原始对话语料中每个语句间的对应关系,得到原始对话语料中每个语料与第一掩码相同位置的单字。In this embodiment, the system uses the first mask value to identify the function, and this function uses the preset first mask value as the identification identifier to traverse the entire sentence through function traversal, and identifies the corresponding first mask value in the mask list. A mask value and its corresponding position, respectively determine the first mask value and its corresponding position of each mask in the mask A list after numerical transformation, and according to the initial mask list and the original dialogue material between each statement The corresponding relationship of each corpus in the original dialogue corpus and the word in the same position as the first mask are obtained.
304、按照第一元素值位置的顺序,分别对每个语句对应选取到的单字进行顺序组合,对应 得到新的语句;304. According to the order of the position of the first element value, each statement is correspondingly selected in order and combined in order to obtain a new statement;
本实施例中,根据得到的原始对话语料中每个语料与第一掩码相同位置的单字,按照第一元素值原有位置的顺序,将每个语句对应选取得到的单字进行顺序组合,得到相对应的新语句。In this embodiment, according to the words in the same position of each corpus and the first mask in the obtained original dialogue corpus, according to the order of the original position of the first element value, the corresponding selected words of each sentence are sequentially combined to obtain corresponding new statement.
305、对各新的语句进行拼接,得到原始对话语料对应的附属对话语料;305. Splicing each new sentence to obtain the auxiliary dialogue material corresponding to the original dialogue material;
本实施例中,将得到相应的新语句进行拼接,按照原始对话语料的语句组合方式进行组合,得到原始对话语料对应的附属对话语料,进而将得到的附属对话语料通过所要训练的意图识别模型进行意图识别,识别附属对话语料中每个语句对应的第二意图类别。In this embodiment, the corresponding new sentences are spliced and combined according to the sentence combination method of the original dialogue materials to obtain the auxiliary dialogue materials corresponding to the original dialogue materials, and then the obtained auxiliary dialogue materials are passed through the intention recognition model to be trained. Intent recognition, identifying a second intent category corresponding to each sentence in the attached dialogue data.
在实际应用中,人机对话系统通过使用第一掩码值识别函数,识别得到数值变换后的掩码列表中每段掩码的第一掩码值位置,并分别进行选取原始对话语料中每个语句与第一掩码值相同位置的单字,如上一个实施例中得到的[1,1,0,0,0,0,0,0,0,0],通过识别并选取得到第3位置到第10个位置的第一掩码值及其相应位置,然后根据该原始对话语料为“我想修改我的账户密码”,将语料中对应掩码列表位置元素为0的字符进行选取,然后根据选取得到的单字进行按原来语句的读取顺序进行拼接,得到原始对话语料对应的附属对话语料,如对原始对话语料进行选取拼接可以组成一个新的句子,此处得到“修改我的账户密码”,进而将得到的附属对话语料通过所要训练的意图识别模型进行意图识别,识别附属对话语料中每个语句对应的意图类别,从而得到第二意图类别。In practical applications, the man-machine dialogue system uses the first mask value recognition function to identify the position of the first mask value of each mask in the mask list after numerical transformation, and selects the position of each mask in the original dialogue data respectively. A word with the same position as the first mask value, such as [1,1,0,0,0,0,0,0,0,0,0] obtained in the previous embodiment, is identified and selected to obtain the third position Go to the first mask value at the 10th position and its corresponding position, and then according to the original dialogue material as "I want to modify my account password", select the character corresponding to the mask list position element of 0 in the corpus, and then Splice the selected words according to the reading order of the original sentence to obtain the attached dialogue material corresponding to the original dialogue material. For example, a new sentence can be formed by selecting and splicing the original dialogue material. Here you can get "Change my account password" ”, and then use the intention recognition model to be trained to perform intent recognition on the obtained auxiliary dialogue data, and identify the intent category corresponding to each sentence in the auxiliary dialogue data, so as to obtain the second intention category.
306、对第一意图类别和第二意图类别进行差异程度检测,得到第一检测结果,并基于第一检测结果,从附属对话语料中选取满足预置差异条件的语句作为最终集外语料;306. Detect the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, select a sentence that satisfies the preset difference condition from the attached dialogue corpus as the final out-of-collection corpus;
307、将最终集外语料标注为集外意图,并采用原始对话语料和最终集外语料,对意图识别模型进行训练,得到新的意图识别模型。307. Mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
本申请实施例中,基于调整后的掩码列表,通过分别确定数值变换后的掩码列表中每段掩码的第一元素值位置,并选取原始语料中每个语句与第一元素值位置相同的单字,进而按照第一元素值位置的顺序,分别对每个语句对应选取到的单字进行顺序组合,对应得到新的语句,将新的语句进行拼接,得到原始对话语料对应的附属对话语料,从而对附属对话语料进行意图识别处理得到第二意图类别。相比于现有技术,本申请通过对调整后的掩码列表通过确定相应元素位置,将相应位置转变为原始对话语句词语组合的语句,进而实现对组合语句的意图识别处理得到第二意图类别,通过对原始对话语料采用掩码列表的处理方式,将调整后的掩码列表转换为原来语句相对应的单字,以实现对附属对话语料的意图识别处理操作,处理实现方式简单,更快得到所需的意图识别结果。In the embodiment of the present application, based on the adjusted mask list, by respectively determining the first element value position of each mask in the mask list after numerical transformation, and selecting each sentence in the original corpus and the first element value position The same word, and then according to the order of the position of the first element value, sequentially combine the words corresponding to each sentence to obtain a new sentence, splice the new sentence, and obtain the attached dialogue material corresponding to the original dialogue material , so that the second intent category is obtained by performing intent recognition processing on the auxiliary dialogue data. Compared with the prior art, the present application converts the corresponding element position into the original dialogue sentence combination sentence by determining the position of the corresponding element in the adjusted mask list, and then realizes the intent recognition processing of the combined sentence to obtain the second intent category , by using the mask list processing method for the original dialogue material, the adjusted mask list is converted into the corresponding words of the original sentence, so as to realize the intention recognition and processing operation of the attached dialogue material, and the processing method is simple and faster. The desired intent recognition result.
请参阅图4,本申请实施例中意图识别模型更新方法的第四个实施例包括:Please refer to Figure 4, the fourth embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
401、获取原始对话语料,并通过预置意图识别模型,识别原始对话语料中每个语句的第一意图类别;401. Obtain the original dialogue material, and identify the first intent category of each sentence in the original dialogue material through a preset intent recognition model;
402、初始化原始对话语料对应的掩码列表,并根据预置选取规则调整掩码列表中的一组元素值,得到调整后的掩码列表;402. Initialize the mask list corresponding to the original dialogue material, and adjust a set of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
403、基于调整后的掩码列表,构建原始对话语料对应的附属对话语料,并通过意图识别模型,识别附属对话语料中每个语句的第二意图类别;403. Based on the adjusted mask list, construct the auxiliary dialogue corpus corresponding to the original dialogue corpus, and identify the second intent category of each statement in the auxiliary dialogue corpus through the intention recognition model;
404、若第一检测结果为第一意图类别和第二意图类别相同,则将附属对话语料作为初始对话语料;404. If the first detection result shows that the first intent category and the second intent category are the same, use the auxiliary dialogue material as the initial dialogue material;
本实施例中,对两个所得的意图类别进行意图类别种类的判断,判断它们相对应的语句的意图类别是否相同,若判断第一检测结果为第一意图类别和第二意图类别的类别种类相同,则将本轮意图识别的附属对话语料作为新一轮的初始对话语料。In this embodiment, the judgment of the intention category is performed on the two obtained intention categories, and it is judged whether the intention categories of their corresponding sentences are the same. If the same, then use the auxiliary dialogue data of this round of intent recognition as the initial dialogue data of a new round.
405、若第一检测结果为第一意图类别和第二意图类别不同,则将原始对话语料作为初始对话语料;405. If the first detection result is that the first intent category and the second intent category are different, use the original dialogue material as the initial dialogue material;
本实施例中,若判断第一检测结果为第一意图类别和第二意图类别种类不同,则将本轮原 始对话语料作为新一轮的初始对话语料。In this embodiment, if it is judged that the first detection result is that the first intent category and the second intent category are different, the current round of original dialogue material is used as a new round of initial dialogue material.
406、对初始对话语料进行下一轮的对应掩码列表数值变换、意图识别和差异程度检测,直到初始对话语料满足预置退出条件时停止,得到新的第一检测结果;406. Carry out the next round of corresponding mask list value conversion, intent identification, and difference degree detection on the initial dialogue material until the initial dialogue material satisfies the preset exit condition and stop to obtain a new first detection result;
本实施例中,对初始对话语料案进行下一轮的对应掩码列表数值变换,意图识别和差异程度检测,直至初始对话语料满足预置退出条件时停止处理,得到新的第一检测结果,这里的预置退出条件是原始对话语料中每段语料的词语都被预置的选取规则全部遍历选取完时,则满足处理退出条件。In this embodiment, the next round of corresponding mask list value transformation, intention recognition and difference degree detection is performed on the initial dialogue material case, and the processing is stopped until the initial dialogue material satisfies the preset exit condition, and a new first detection result is obtained. The preset exit condition here is that when all the words in each segment of the corpus in the original dialogue corpus have been traversed and selected by the preset selection rules, the processing exit condition is met.
在实际应用中,人机对话系统根据对第一意图类别和第二意图类别进行意图种类是否相同的判断,若检测结果两者的意图类别相同,则将本轮意图类别的附属对话语料作为下一轮新的初始对话语料,若检测结果两者的意图类别不同,则将本轮意图识别的初始对话语料作为下一轮新的初始对话语料,如前面实施例所得的附属对话语料为“修改我的账户密码”,和初始对话语料为“我想修改我的账户密码”,这两者的意图识别标签应该是一样的,概率也会非常接近,所以掩码列表的前两个元素就会保持为1,可以认为掩码列表里面的元素为1就代表了其所对应的字符在原句中对语义的贡献程度不大,进而将附属对话语料即“修改我的账户密码”作为下一轮的新的初始对话语料,若判断两者的意图类别不相同,则将初始对话语料即“我想修改我的账户密码”作为下轮新的初始对话语料,然后对得到新的初始对话语料进行下一轮的对应掩码数值变换、意图识别、差异程度比较的循环处理操作,直至初始对话语料中所有的词语都被遍历置换完毕,如新的初始对话语料为“修改我的账户密码”,接着选取第二和第三个字符进行循环处理,然后当置换“密码”两个字符处理则满足退出条件,进而得到新的第一的检测结果。In practical applications, the human-computer dialogue system judges whether the intent types of the first intent category and the second intent category are the same. For a new round of initial dialogue material, if the detection results have different intent categories, the initial dialogue material for this round of intent recognition will be used as the next round of new initial dialogue material, such as the auxiliary dialogue material obtained in the previous embodiment. My account password", and the initial dialogue material is "I want to change my account password", the intention identification tags of the two should be the same, and the probability will be very close, so the first two elements of the mask list will be Keep it as 1. It can be considered that the element in the mask list is 1, which means that the corresponding character does not contribute much to the semantics in the original sentence, and then the attached dialogue data, namely "change my account password" is used as the next round If it is judged that the intent categories of the two are not the same, the initial dialogue material, namely "I want to change my account password" will be used as the new initial dialogue material for the next round, and then the new initial dialogue material will be obtained. The next round of corresponding mask value conversion, intent recognition, and difference degree comparison cycle processing operations until all the words in the initial dialogue data have been traversed and replaced. For example, the new initial dialogue material is "Change my account password", Then select the second and third characters for loop processing, and then when the two characters of "password" are replaced, the exit condition is satisfied, and then a new first detection result is obtained.
407、根据第一检测结果,依次判断每个第一意图类别和每个对应的第二意图类别是否相同;407. According to the first detection result, sequentially determine whether each first intent category is the same as each corresponding second intent category;
本实施例中,根据处理得到的第一检测结果,依次判断原始对话语料中每段语料对应的第一意图类别和附属对话语料对应的第二意图类别的差异程度是否相同,这里的预置差异程度是否相同是当两个相应语料进行意图类别识别后,当两者的意图类别种类相同且两者的概率很大(一般大于80%),则满足差异程度是相同的。In this embodiment, according to the first detection result obtained through processing, it is sequentially judged whether the degree of difference between the first intention category corresponding to each piece of corpus in the original dialogue corpus and the second intention category corresponding to the subsidiary dialogue corpus are the same, and the preset difference here is Whether the degree is the same means that after two corresponding corpora are identified for intent categories, when the intent categories of the two are the same and the probability of both is very high (generally greater than 80%), the degree of difference is the same.
408、若不相同,则确定与第一意图类别不同的第二意图类别对应的语句满足预置第一差异条件并作为备选集外语料;408. If they are not the same, determine that the sentence corresponding to the second intent category different from the first intent category satisfies the preset first difference condition and serves as a candidate out-of-collection corpus;
本实施例中,这里的预置第一差异条件是根据原始对话语料意图识别进行意图识别得到的标签其相应概率和附属对话语料进行意图识别得到的标签及其概率进行差异检测,得到一个两者概率差值相差较大的差值条件。若与第一意图类别不同的第二意图类别对应的语句的意图类别不相同,则可以确定第二意图类别对应的语句满足预置第一差异条件并作为备选集外语料。In this embodiment, the preset first difference condition here is to perform difference detection based on the corresponding probabilities of the labels obtained from the intention recognition of the original dialogue data and the labels and their probabilities obtained from the intention recognition of the attached dialogue materials, and obtain a two A difference condition where the difference in probability is large. If the intent category of the sentence corresponding to the second intent category different from the first intent category is different, it may be determined that the sentence corresponding to the second intent category satisfies the preset first difference condition and is used as a candidate out-of-set corpus.
在实际应用中,人机对话系统根据处理得到的第一检测结果,依次对原始对话语料的每段语料对应的第一意图类别和处理得到附属对话语料对应的第二意图类别进行判断,判断两者对应的类别是否类别相同且对应的概率较大,若判断结果两者的意图类别不相同的第二意图类别对应的语句,则可以确定该条第二意图类别对应的语句满足预置第一差异差异条件并作为备选集外语料。In practical applications, the man-machine dialogue system judges the first intention category corresponding to each segment of the original dialogue corpus and the second intention category corresponding to the processed auxiliary dialogue corpus according to the first detection result obtained through processing, and judges the two Whether the categories corresponding to the two are the same category and the corresponding probability is relatively high. If the judgment result is that the two intention categories are not the same as the statement corresponding to the second intention category, it can be determined that the statement corresponding to the second intention category satisfies the preset first The difference difference condition is used as an alternative out-of-set corpus.
409、通过意图识别模型识别备选集外语料中每个语句的第三意图类别;409. Identify the third intent category of each sentence in the corpus outside the candidate set by using the intent recognition model;
本实施例中,根据步骤408得到的备选集外语料,利用本次所要训练的意图识别模型对选取处理得到的备选集外语料中每个语句进行意图标签及其相应概率的意图识别处理,进而可以得到备选集外语料中每个语句的第三意图类别。In this embodiment, according to the candidate out-of-collection corpus obtained in step 408, use the intention recognition model to be trained this time to perform the intention label and corresponding probability intention recognition processing on each sentence in the candidate out-of-collection corpus obtained from the selection process , and then the third intent category of each sentence in the candidate foreign corpus can be obtained.
410、对第一意图类别和第三意图类别进行差异程度检测,得到第二检测结果,并根据第二检测结果,从备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料;410. Detect the degree of difference between the first intent category and the third intent category, obtain the second detection result, and select the sentences satisfying the preset second difference condition from the corpus outside the candidate set as the final set according to the second detection result foreign language material;
本实施例中,根据原始对话语料进行意图识别得到的第一意图类别和根据备选集外语料进行意图识别得到的第三意图类别,利用意图识别得到每段语料对应的标签及其相对应概率进行 差异程度检测,进而得到第二检测结果,并根据所得的第二检测结果,从备选集外语料中进行选取,从而得到满足预置第二差异条件的语句作为最终集外语料,这里的预置第二差异条件是根据原始对话语料意图识别进行意图识别得到的标签其相应概率,及备选集外语料进行意图识别得到的标签及其概率进行差异检测,比较得到一个两者概率差值相差较大的差值条件。In this embodiment, the first intent category obtained by performing intent recognition on the basis of the original dialogue data and the third intent class obtained by performing intent recognition on the alternative foreign corpus, use intent recognition to obtain the labels corresponding to each piece of corpus and their corresponding probabilities Carry out the degree of difference detection, and then obtain the second detection result, and according to the obtained second detection result, select from the candidate foreign language material, so as to obtain the sentence that satisfies the preset second difference condition as the final foreign language material, here The preset second difference condition is to perform difference detection based on the corresponding probabilities of the tags obtained by the intention recognition of the original dialogue data, and the tags and their probabilities obtained by the intent recognition of the alternative foreign language data, and compare them to obtain a probability difference between the two The difference condition with a large difference.
在实际应用中,目标人机对话系统对原始对话语料进行意图识别处理得到的第一意图类别和对备选集外语料进行意图识别处理得到的第三意图类别,根据两种语料中每段语句对应的标签及其相应概率进行差值比较,得到两者差值比较的检测结果,作为第二检测结果,进而根据第二检测结果,从备选集外语料中进行选取,利用第二检测结果中每段语句对应的标签及其对应概率,及原始对话语料中每段语料的标签及其对应概率进行对比概率差值比较,得到满足较大差值条件的的语料,进而得到满足预置第二差异条件的语句作为最终集外语料。In practical applications, the target human-computer dialogue system performs intent recognition processing on the original dialogue data to obtain the first intent category and the third intent category obtained from the alternative corpus. Compare the difference between the corresponding labels and their corresponding probabilities, and obtain the detection result of the difference comparison between the two as the second detection result. Then, according to the second detection result, select from the candidate foreign corpus, and use the second detection result The label corresponding to each sentence in the sentence and its corresponding probability, and the label and its corresponding probability of each sentence in the original dialogue corpus are compared to compare the probability difference, and the corpus that meets the larger difference condition is obtained, and then the corpus that satisfies the preset second The sentences of the two different conditions are used as the final extra-corpus corpus.
411、将最终集外语料标注为集外意图,并采用原始对话语料和最终集外语料,对意图识别模型进行训练,得到新的意图识别模型。411. Mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
本申请实施例中,通过对所得的第一意图类别和第二意图类别进行差异程度检测,若检测结果相同则将附属对话语料作为新的初始对话语料,若检测结果不同则将初始对话语料作为新的初始对话语料,进而利用预置选取规则进行掩码变换、意图识别、差异检测的循环处理,当语句中的词语都进行置换处理后得到第一检测结果,进而基于第一检测结果,利用预置差异程度是否相同进行第一和第二意图类别进行判断,得到判断结果不相同的第二意图类别对应的语句作为备选集外语料,进而通过对备选集外语料进行意图识别、差异检测和差异条件判断后,得到最终集外语料。相对于现有技术,本申请能对训练的原始对话语料词语组合成的所有集外语料进行识别,从而得到一个备选集外语料,该方法能获得尽可能多的集外意图的语料,免去了现有方法的通过设定置信度阈值来实现拒识功能的额外对比实验,缩短了意图识别模型上线和后期优化所需的时间,进而得到意图识别准确率更高的意图识别模型。In the embodiment of the present application, by detecting the degree of difference between the obtained first intention category and the second intention category, if the detection results are the same, the auxiliary dialogue material is used as a new initial dialogue material, and if the detection results are different, the initial dialogue material is used as The new initial dialogue data, and then use the preset selection rules to perform the cycle processing of mask transformation, intent recognition, and difference detection. When the words in the sentence are replaced, the first detection result is obtained, and then based on the first detection result, use Whether the preset difference degree is the same is judged by the first and second intent categories, and the sentence corresponding to the second intent category with different judgment results is obtained as the candidate foreign language data, and then through the intention identification and difference of the candidate foreign language data After detection and judgment of difference conditions, the final out-of-collection corpus is obtained. Compared with the prior art, the present application can identify all the out-of-set corpus composed of the original dialogue material words of the training, so as to obtain a candidate out-of-set corpus. This method can obtain as much out-of-set corpus as possible, avoiding The additional comparative experiment of realizing the rejection function by setting the confidence threshold of the existing method is eliminated, which shortens the time required for the intention recognition model to go online and the post-optimization, and then obtains an intention recognition model with higher accuracy of intention recognition.
请参阅图5,本申请实施例中意图识别模型更新方法的第五个实施例包括:Please refer to Figure 5, the fifth embodiment of the method for updating the intent recognition model in the embodiment of the present application includes:
501、获取原始对话语料,并通过预置意图识别模型,识别原始对话语料中每个语句的第一意图类别;501. Obtain the original dialogue material, and identify the first intent category of each sentence in the original dialogue material through a preset intent recognition model;
502、初始化原始对话语料对应的掩码列表,并根据预置选取规则调整掩码列表中的一组元素值,得到调整后的掩码列表;502. Initialize the mask list corresponding to the original dialogue material, and adjust a set of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
503、基于调整后的掩码列表,构建原始对话语料对应的附属对话语料,并通过意图识别模型,识别附属对话语料中每个语句的第二意图类别;503. Based on the adjusted mask list, construct the auxiliary dialogue corpus corresponding to the original dialogue corpus, and identify the second intent category of each statement in the auxiliary dialogue corpus through the intention recognition model;
504、对第一意图类别和第二意图类别进行差异程度检测,得到第一检测结果,并根据第一检测结果,依次判断每个第一意图类别和每个对应的第二意图类别是否相同;504. Detect the degree of difference between the first intent category and the second intent category, obtain a first detection result, and sequentially determine whether each first intent category and each corresponding second intent category are the same according to the first detection result;
505、若不相同,则确定与第一意图类别不同的第二意图类别对应的语句满足预置第一差异条件并作为备选集外语料;505. If they are not the same, determine that the sentence corresponding to the second intent category different from the first intent category satisfies the preset first difference condition and serves as an alternative foreign language material;
506、通过意图识别模型识别备选集外语料中每个语句的第三意图类别;506. Recognize the third intent category of each sentence in the corpus outside the candidate set by using the intent recognition model;
507、对第一意图类别和第三意图类别进行差异程度检测,得到第二检测结果,并根据第二检测结果,判断每个第一意图类别和每个对应的第三意图类别的差异程度是否大于预置差异程度阈值;507. Detect the degree of difference between the first intent category and the third intent category, obtain a second detection result, and judge whether the degree of difference between each first intent category and each corresponding third intent category is based on the second detection result Greater than the preset difference degree threshold;
本实施例中,通过对所得的第一意图类别和第三意图类别进行差异程度检测,得到第二检测结果,进而根据第二检测结果,对原始对话语料的每个第一意图类别和备选集外语料对应的第三意图类别进行意图类别差异程度是否大于预置差异程度阈值的判断处理。In this embodiment, the second detection result is obtained by detecting the degree of difference between the obtained first intention category and the third intention category, and then according to the second detection result, each first intention category and alternative For the third intent category corresponding to the out-of-set corpus, it is judged whether the difference degree of the intent category is greater than a preset difference degree threshold.
508、若大于,则确定差异程度大于预置差异程度阈值的第三意图类别对应的语句满足第二差异条件并作为最终集外语料;508. If it is greater, determine that the sentence corresponding to the third intent category whose degree of difference is greater than the preset difference degree threshold satisfies the second difference condition and is used as the final extra-corpus corpus;
本实施例中,若进行判断两者对应语句的意图类别大于预置差异程度阈值,则可以确定第三意图类别对应的语句满足第二差异条件并作为最终集外语料,这里的预置第二差异条件是根 据原始对话语料意图识别进行意图识别得到的标签其相应概率,和备选集外语料进行意图识别得到的标签及其概率,判断两者对应语料意图类别是否不同且大于预置差异程度阈值(这里的差异程度阈值一般设置为80%,进而判断得到一个两者意图标签不同且概率值较大)的作为差异检测结果。In this embodiment, if it is judged that the intent category of the sentences corresponding to the two is greater than the preset difference degree threshold, it can be determined that the sentence corresponding to the third intent category satisfies the second difference condition and is used as the final out-of-collection corpus, where the preset second The difference condition is based on the corresponding probabilities of the tags obtained from the intent recognition of the original dialogue data, and the tags and their probabilities obtained from the intent recognition of the alternative corpus, and judging whether the two corresponding corpus intent categories are different and greater than the preset difference degree Threshold (here, the difference degree threshold is generally set to 80%, and then it is judged that the intention labels of the two are different and the probability value is relatively large) as the difference detection result.
在实际应用中,人机对话系统根据第二检测结果,判断原始对话语料中每段语料意图识别所得到的第一意图识别模型,它和备选集外语料中对应的原始对话语料每段语料处理后意图识别得到的第三意图类别,判断所得的相应备选集外意图类别和对应的原始语料的意图种类是否大于预置差异程度阈值,如果大于即满足两者意图种类不同且其对应的概率大于一定的差异程度阈值,则选择这些不相同的第三意图类别对应的备选集外语料,作为模型训练得到的最终集外语料。In practical applications, the human-computer dialogue system judges the first intention recognition model obtained by recognizing the intention of each piece of corpus in the original dialogue corpus according to the second detection result, and it is the same as the corresponding original corpus of dialogue corpus in the candidate foreign corpus. After processing the third intention category obtained by intention recognition, judge whether the corresponding intention category outside the candidate set and the corresponding intention category of the original corpus are greater than the preset difference degree threshold. If the probability is greater than a certain difference degree threshold, select the candidate out-of-set corpus corresponding to these different third intent categories as the final out-of-set corpus obtained from model training.
509、将最终集外语料标注为集外意图,并采用原始对话语料和最终集外语料,对意图识别模型进行训练,得到新的意图识别模型。509. Mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
本申请实施例中,通过对由备选集外语料进行意图识别得到的第三意图类别和原始对话语料得到第一意图类别进行差异程度检测,进而对检测得到的第二检测结果进行选取,从备选集外语料中选取得到满足预置第二差异条件的语句作为最终集外语料。相比于现有技术,对备选集外语料进而二次的差异条件选取,能避免传统基于数据增强的语料合成方法所构建的集外语料可能与正常训练语料有纠缠的问题,通过使用掩码列表和现有意图识别模型,保证了生成语料的集外性质,从而在使训练出的意图识别模型具有拒识能力的同时,减轻对正常语料识别效果的影响。In the embodiment of the present application, by detecting the degree of difference between the third intent category obtained from the intention recognition of the candidate foreign language data and the first intent category obtained from the original dialogue data, and then selecting the second detection result obtained from the detection, from From the candidate foreign corpus, the sentences satisfying the preset second difference condition are selected as the final foreign corpus. Compared with the existing technology, the selection of the candidate out-of-set corpus and then the second difference condition can avoid the problem that the out-of-set corpus constructed by the traditional data-enhanced corpus synthesis method may be entangled with the normal training corpus. The code list and the existing intent recognition model ensure the out-of-set nature of the generated corpus, so that the trained intent recognition model has the ability to reject recognition while reducing the impact on the normal corpus recognition effect.
上面对本申请实施例中意图识别模型更新方法进行了描述,下面对本申请实施例中意图识别模型更新装置进行描述,请参阅图6,本申请实施例中意图识别模型更新装置一个实施例包括:语料获取模块601,用于获取原始对话语料,并通过预置意图识别模型,识别原始对话语料中每个语句的第一意图类别;掩码构建模块602,用于初始化原始对话语料对应的掩码列表,并根据预置选取规则调整掩码列表中的一组元素值,得到调整后的掩码列表;第二意图模块603,用于基于调整后的掩码列表,构建原始对话语料对应的附属对话语料,并通过意图识别模型,识别附属对话语料中每个语句的第二意图类别;最终集外模块604,用于对第一意图类别和第二意图类别进行差异程度检测,得到第一检测结果,并基于第一检测结果,从附属对话语料中选取满足预置差异条件的语句作为最终集外语料;语料训练模块605,用于将最终集外语料标注为集外意图,并采用原始对话语料和最终集外语料,对意图识别模型进行训练,得到新的意图识别模型。The method for updating the intent recognition model in the embodiment of the present application is described above, and the device for updating the intent recognition model in the embodiment of the present application is described below. Please refer to FIG. 6. An embodiment of the device for updating the intent recognition model in the embodiment of the present application includes: corpus The obtaining module 601 is used to obtain the original dialogue material, and through the preset intention recognition model, identifies the first intent category of each sentence in the original dialogue material; the mask construction module 602 is used to initialize the mask list corresponding to the original dialogue material , and adjust a group of element values in the mask list according to the preset selection rules to obtain the adjusted mask list; the second intent module 603 is used to construct an auxiliary dialogue corresponding to the original dialogue material based on the adjusted mask list corpus, and identify the second intent category of each statement in the attached dialogue material through the intent recognition model; the final out-of-set module 604 is used to detect the degree of difference between the first intent category and the second intent category to obtain the first detection result , and based on the first detection result, select sentences satisfying the preset difference conditions from the attached dialogue materials as the final out-of-set corpus; the corpus training module 605 is used to mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final extra-set corpus to train the intent recognition model to obtain a new intent recognition model.
本申请实施例中,相比于现有技术的对数据增强的处理方法是随机的插入、删除、交换等操作,本申请利用计算机数据处理方法中的掩码列表来实现对原始对话语料的语句处理,构建掩码列表进行相关数值替换得到满足第一差异条件的备选集外语料,语料的处理方法利用预设的数学规律进行处理,使得生成的备选集外语料更加满足集外意图类别,此方法可以免去了通过设定置信度阈值来实现拒识功能的额外对比实验,缩短了意图识别模型上线和优化的周期,同时使训练出的意图识别模型具有拒识能力的同时,减轻对正常语料识别效果的影响,进而提高了意图识别模型更新精准度。In the embodiment of the present application, compared with the prior art processing method for data enhancement is random insertion, deletion, exchange and other operations, the present application uses the mask list in the computer data processing method to realize the statement of the original dialogue material Processing, constructing a mask list and replacing relevant values to obtain candidate out-of-collection corpus that meets the first difference condition. The processing method of the corpus uses preset mathematical laws to make the generated candidate out-of-collection corpus more satisfy the category of out-of-collection intent , this method can avoid the additional comparison experiment of realizing the recognition rejection function by setting the confidence threshold, shorten the cycle of launching and optimizing the intention recognition model, and at the same time make the trained intention recognition model have the recognition rejection ability, and reduce the The impact on the recognition effect of normal corpus improves the update accuracy of the intent recognition model.
请参阅图7,本申请实施例中意图识别模型更新装置的另一个实施例包括:语料获取模块601,用于获取原始对话语料,并通过预置意图识别模型,识别原始对话语料中每个语句的第一意图类别;掩码构建模块602,用于初始化原始对话语料对应的掩码列表,并根据预置选取规则调整掩码列表中的一组元素值,得到调整后的掩码列表;第二意图模块603,用于基于调整后的掩码列表,构建原始对话语料对应的附属对话语料,并通过意图识别模型,识别附属对话语料中每个语句的第二意图类别;最终集外模块604,用于对第一意图类别和第二意图类别进行差异程度检测,得到第一检测结果,并基于第一检测结果,从附属对话语料中选取满足预置 差异条件的语句作为最终集外语料;语料训练模块605,用于将最终集外语料标注为集外意图,并采用原始对话语料和最终集外语料,对意图识别模型进行训练,得到新的意图识别模型。Please refer to Fig. 7, another embodiment of the device for updating the intention recognition model in the embodiment of the present application includes: a corpus acquisition module 601, which is used to obtain the original dialogue materials, and recognize each sentence in the original dialogue materials through the preset intention recognition model The first intent category of the mask construction module 602, which is used to initialize the mask list corresponding to the original dialogue material, and adjust a set of element values in the mask list according to preset selection rules to obtain an adjusted mask list; The second intention module 603 is used to construct the auxiliary dialogue material corresponding to the original dialogue material based on the adjusted mask list, and identify the second intent category of each statement in the auxiliary dialogue material through the intent recognition model; the final out-of-set module 604 , which is used to detect the degree of difference between the first intent category and the second intent category to obtain the first detection result, and based on the first detection result, select the sentence satisfying the preset difference condition from the attached dialogue corpus as the final out-of-set corpus; The corpus training module 605 is used to mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
具体的,掩码构建模块602包括:字符计算单元6021,用于对原始对话语料进行分句处理,得到多个语句,并分别计算原始对话语料每个语句的字符串长度;掩码组合单元6022,用于采用与各字符串长度相同的预置第一元素值,分别组合成每个语句对应的掩码;语料对应单元6023,用于采用掩码构建原始对话语料对应的掩码列表。Specifically, the mask construction module 602 includes: a character calculation unit 6021, which is used to perform sentence segmentation processing on the original dialogue material to obtain a plurality of sentences, and calculate the string length of each sentence of the original dialogue material respectively; a mask combination unit 6022 , for combining the preset first element value with the same length as each string to form a mask corresponding to each sentence; the corpus correspondence unit 6023 is used for constructing a mask list corresponding to the original dialogue corpus by using the mask.
具体的,掩码构建模块602还包括:元素选取单元6024,用于根据预置选取规则,分别确定掩码列表中每段掩码对应的调整位置;元素替换单元6025,用于采用预置第二元素值替换调整位置上的第一元素值,得到调整后的掩码列表。Specifically, the mask construction module 602 also includes: an element selection unit 6024, which is used to respectively determine the adjustment position corresponding to each segment of the mask in the mask list according to preset selection rules; an element replacement unit 6025, which is used to adopt the preset The two-element value replaces the first element value at the adjusted position to obtain the adjusted mask list.
具体的,第二意图模块603包括:词语选取单元6031,用于分别确定数值变换后的掩码列表中每段掩码的第一元素值位置,并分别选取原始语料中每个语句与第一元素值位置相同的单字;顺序组合单元6032,用于按照第一元素值位置的顺序,分别对每个语句对应选取到的单字进行顺序组合,对应得到新的语句;语句拼接单元6033,用于对各新的语句进行拼接,得到原始对话语料对应的附属对话语料。Specifically, the second intention module 603 includes: a word selection unit 6031, which is used to respectively determine the position of the first element value of each mask in the mask list after numerical transformation, and select each sentence in the original corpus with the first The single word that element value position is identical; Sequential combination unit 6032, is used for according to the order of first element value position, each sentence is correspondingly selected single word carries out order combination respectively, correspondingly obtains new sentence; Sentence splicing unit 6033, is used for Each new sentence is spliced to obtain the attached dialogue material corresponding to the original dialogue material.
具体的,最终集外模块604还包括:若第一检测结果为第一意图类别和第二意图类别相同,则将附属对话语料作为初始对话语料;若第一检测结果为第一意图类别和第二意图类别不同,则将原始对话语料作为初始对话语料;对初始对话语料进行下一轮的对应掩码列表数值变换、意图识别和差异程度检测,直到初始对话语料满足预置退出条件时停止,得到新的第一检测结果。Specifically, the final out-of-set module 604 also includes: if the first detection result is that the first intent category and the second intent category are the same, then use the auxiliary dialogue material as the initial dialogue material; if the first detection result is that the first intent category and the second If the two intent categories are different, the original dialogue corpus is used as the initial dialogue corpus; the next round of corresponding mask list value transformation, intent recognition and difference degree detection is performed on the initial dialogue corpus until the initial dialogue corpus meets the preset exit conditions. A new first detection result is obtained.
具体的,最终集外语料604包括:差异判断单元6041,用于根据第一检测结果,依次判断每个第一意图类别和每个对应的第二意图类别是否相同;备选选择单元6042,用于若不相同,则确定与第一意图类别不同的第二意图类别对应的语句满足预置第一差异条件并作为备选集外语料;备选识别单元6043,用于通过意图识别模型识别备选集外语料中每个语句的第三意图类别;最终选择单元6044,用于对第一意图类别和第三意图类别进行差异程度检测,得到第二检测结果,并根据第二检测结果,从备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料。Specifically, the final out-of-collection corpus 604 includes: a difference judging unit 6041, configured to sequentially judge whether each first intent category and each corresponding second intent category are the same according to the first detection result; an alternative selection unit 6042, using If not the same, then determine that the sentence corresponding to the second intent category different from the first intent category satisfies the preset first difference condition and is used as an alternative foreign language material; the alternative identification unit 6043 is used to identify the alternative through the intention recognition model The third intention category of each sentence in the foreign language material; the final selection unit 6044 is used to detect the degree of difference between the first intention category and the third intention category, obtain the second detection result, and according to the second detection result, from In the candidate foreign corpus, the sentences satisfying the preset second difference condition are selected as the final foreign corpus.
具体的,最终集外单元6044包括根据第二检测结果,判断每个第一意图类别和每个对应的第三意图类别的差异程度是否大于预置差异程度阈值;若大于,则确定差异程度大于预置差异程度阈值的第三意图类别对应的语句满足第二差异条件并作为最终集外语料。Specifically, the final out-of-set unit 6044 includes judging whether the degree of difference between each first intention category and each corresponding third intention category is greater than a preset difference degree threshold according to the second detection result; if greater, determine that the degree of difference is greater than Sentences corresponding to the third intent category with a preset difference degree threshold satisfy the second difference condition and are used as the final out-of-set corpus.
本实施例在上一实施例的基础上,详细描述了各个模块的具体功能以及部分模块的单元构成,通过本装置,通过使用掩码元素对原始对话语料进行处理得到掩码列表,进而通过元素置换、意图识别、差异检测的循环处理,直至将原始对话语料中每个语句中的词语都进行相应处理后,得到备选集外语料,进而通过对备选集外语料进行意图识别和差异条件判断得到最终集外语料,进行将其和原始对话语料进行基础机器学习方法训练,得到拒识功能的拒识意图识别模型,不仅能够加快模型的训练速度,还能避免产生语料纠缠的现象,得到对正常语料意图识别效率更高的意图识别模型。On the basis of the previous embodiment, this embodiment describes in detail the specific functions of each module and the unit structure of some modules. Through this device, the mask list is obtained by processing the original dialogue data with the mask element, and then through the element The cycle processing of replacement, intent recognition, and difference detection is performed until the words in each sentence in the original dialogue data are processed accordingly, and the candidate out-of-set corpus is obtained, and then the intention recognition and difference conditions are performed on the candidate out-of-set corpus Judging the final extra-corpus corpus, and training it with the original dialogue corpus using basic machine learning methods to obtain a rejection intent recognition model with the rejection function, which can not only speed up the training speed of the model, but also avoid the phenomenon of corpus entanglement, and obtain An intent recognition model with higher efficiency for normal corpus intent recognition.
上面图6和图7从模块化功能实体的角度对本申请实施例中的意图识别模型更新装置进行详细描述,下面从硬件处理的角度对本申请实施例中意图识别模型更新设备进行详细描述。The above Figures 6 and 7 describe in detail the intent recognition model updating device in the embodiment of the present application from the perspective of modular functional entities, and the following describes the intent recognition model updating device in the embodiment of the present application in detail from the perspective of hardware processing.
图8是本申请实施例提供的一种意图识别模型更新设备的结构示意图,该意图识别模型更新设备800可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)810(例如,一个或一个以上处理器)和存储器820,一个或一个以上存储应用程序833或数据832的存储介质830(例如一个或一个以上海量存储设备)。其中,存储器820和存储介质830可以是短暂存储或持久存储。存储在存储介质830的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对意图识别模型更新设备 800中的一系列指令操作。更进一步地,处理器810可以设置为与存储介质830通信,在意图识别模型更新设备800上执行存储介质830中的一系列指令操作。FIG. 8 is a schematic structural diagram of an intention recognition model updating device provided by an embodiment of the present application. The intent recognition model updating device 800 may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units (CPU) 810 (for example, one or more processors) and memory 820, one or more storage media 830 for storing application programs 833 or data 832 (for example, one or more mass storage devices). Wherein, the memory 820 and the storage medium 830 may be temporary storage or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations for the intent recognition model update device 800 . Furthermore, the processor 810 may be configured to communicate with the storage medium 830 , and execute a series of instruction operations in the storage medium 830 on the intent recognition model update device 800 .
意图识别模型更新设备800还可以包括一个或一个以上电源840,一个或一个以上有线或无线网络接口850,一个或一个以上输入输出接口860,和/或,一个或一个以上操作系统831,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图8示出的意图识别模型更新设备结构并不构成对意图识别模型更新设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The intent recognition model update device 800 may also include one or more power sources 840, one or more wired or wireless network interfaces 850, one or more input and output interfaces 860, and/or, one or more operating systems 831, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the intention recognition model updating device shown in FIG. Different component arrangements.
本申请还提供一种意图识别模型更新设备,计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的意图识别模型更新方法的步骤。The present application also provides an intent recognition model update device, the computer device includes a memory and a processor, and computer readable instructions are stored in the memory, and when the computer readable instructions are executed by the processor, the processor executes the intent in the above-mentioned embodiments Steps to identify the model update method.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,计算机可读存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行意图识别模型更新方法的步骤。The present application also provides a computer-readable storage medium, the computer-readable storage medium may be a non-volatile computer-readable storage medium, the computer-readable storage medium may also be a volatile computer-readable storage medium, and the computer-readable storage medium may be Instructions are stored in the read storage medium, and when the instructions are run on the computer, the computer is made to execute the steps of the method for updating the intention recognition model.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc. This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, and are not intended to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still understand the foregoing The technical solutions described in each embodiment are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the application.

Claims (20)

  1. 一种意图识别模型更新方法,其中,所述意图识别模型更新方法包括:A method for updating an intent recognition model, wherein the method for updating an intent recognition model includes:
    获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;Obtaining the original dialogue material, and identifying the first intent category of each statement in the original dialogue material through a preset intent recognition model;
    初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;Initializing a mask list corresponding to the original dialogue material, and adjusting a group of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
    基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;Based on the adjusted mask list, construct the auxiliary dialogue material corresponding to the original dialogue material, and identify the second intent category of each sentence in the auxiliary dialogue material through the intent recognition model;
    对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;Detecting the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, selecting a sentence that satisfies a preset difference condition from the attached dialogue material as Final extra-corporate corpus;
    将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集外语料,对所述意图识别模型进行训练,得到新的意图识别模型。Marking the final out-of-set corpus as out-of-set intent, and using the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
  2. 根据权利要求1所述的意图识别模型更新方法,其中,所述初始化所述原始对话语料对应的掩码列表包括:The method for updating an intent recognition model according to claim 1, wherein said initializing a mask list corresponding to said original dialogue material comprises:
    对所述原始对话语料进行分句处理,得到多个语句,并分别计算所述原始对话语料每个语句的字符串长度;Sentence processing is performed on the original dialogue data to obtain multiple sentences, and the string length of each sentence in the original dialogue data is calculated respectively;
    采用与各所述字符串长度相同的预置第一元素值,分别组合成每个语句对应的掩码;Using a preset first element value with the same length as each of the character strings to form a mask corresponding to each statement;
    采用所述掩码构建所述原始对话语料对应的掩码列表。A mask list corresponding to the original dialogue material is constructed by using the mask.
  3. 根据权利要求2所述的意图识别模型更新方法,其中,所述根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表包括:The method for updating an intention recognition model according to claim 2, wherein said adjusting a group of element values in said mask list according to preset selection rules, and obtaining an adjusted mask list includes:
    根据预置选取规则,分别确定所述掩码列表中每段掩码对应的调整位置;According to preset selection rules, respectively determine the adjustment position corresponding to each mask in the mask list;
    采用预置第二元素值替换所述调整位置上的第一元素值,得到调整后的掩码列表。The first element value at the adjusted position is replaced by the preset second element value to obtain an adjusted mask list.
  4. 根据权利要求3所述的意图识别模型更新方法,其中,所述基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料包括:The method for updating an intent recognition model according to claim 3, wherein, based on the adjusted mask list, constructing the auxiliary dialogue material corresponding to the original dialogue material comprises:
    分别确定所述数值变换后的掩码列表中每段掩码的第一元素值位置,并分别选取所述原始语料中每个语句与所述第一元素值位置相同的单字;Respectively determine the first element value position of each mask in the mask list after the numerical transformation, and respectively select the single word in each sentence in the original corpus that has the same position as the first element value;
    按照所述第一元素值位置的顺序,分别对每个语句对应选取到的单字进行顺序组合,对应得到新的语句;According to the order of the position of the first element value, each sentence is respectively combined in sequence with the corresponding selected words to obtain a new sentence;
    对各所述新的语句进行拼接,得到所述原始对话语料对应的附属对话语料。Each of the new sentences is spliced to obtain the auxiliary dialogue material corresponding to the original dialogue material.
  5. 根据权利要求1所述的意图识别模型更新方法,其中,在所述对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果之后,还包括:The method for updating an intention recognition model according to claim 1, wherein, after the degree of difference between the first intention category and the second intention category is detected and the first detection result is obtained, further comprising:
    若所述第一检测结果为所述第一意图类别和所述第二意图类别相同,则将所述附属对话语料作为初始对话语料;If the first detection result is that the first intent category is the same as the second intent category, using the auxiliary dialogue material as the initial dialogue material;
    若所述第一检测结果为所述第一意图类别和所述第二意图类别不同,则将所述原始对话语料作为初始对话语料;If the first detection result is that the first intent category is different from the second intent category, then using the original dialogue material as initial dialogue material;
    对所述初始对话语料进行下一轮的对应掩码列表数值变换、意图识别和差异程度检测,直到所述初始对话语料满足预置退出条件时停止,得到新的第一检测结果。Carry out the next round of corresponding mask list numerical transformation, intent recognition and difference degree detection on the initial dialogue material until the initial dialogue material meets the preset exit condition, and a new first detection result is obtained.
  6. 根据权利要求1所述的意图识别模型更新方法,其中,所述差异条件包括第一差异条件和第二差异条件,所述基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料包括:The method for updating an intent recognition model according to claim 1, wherein the difference condition includes a first difference condition and a second difference condition, and based on the first detection result, selecting from the attached dialogue data that satisfies the predetermined The sentences with difference conditions as the final out-of-set corpus include:
    根据所述第一检测结果,依次判断每个第一意图类别和每个对应的第二意图类别是否相同;According to the first detection result, sequentially determine whether each first intent category and each corresponding second intent category are the same;
    若不相同,则确定与第一意图类别不同的第二意图类别对应的语句满足预置第一差异条件并作为备选集外语料;If not the same, then determine that the sentence corresponding to the second intent category different from the first intent category satisfies the preset first difference condition and is used as an alternative foreign language material;
    通过所述意图识别模型识别所述备选集外语料中每个语句的第三意图类别;Recognizing the third intent category of each sentence in the candidate foreign corpus through the intent recognition model;
    对所述第一意图类别和所述第三意图类别进行差异程度检测,得到第二检测结果,并根据所述第二检测结果,从所述备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料。Detecting the degree of difference between the first intent category and the third intent category to obtain a second detection result, and selecting from the candidate out-of-collection corpus to satisfy the preset second difference according to the second detection result Conditional sentences are used as the final out-of-set corpus.
  7. 根据权利要求6所述的意图识别模型更新方法,其中,所述根据所述第二检测结果,从所述备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料包括:The method for updating the intent recognition model according to claim 6, wherein, according to the second detection result, selecting a sentence satisfying the preset second difference condition from the candidate out-of-set corpus as the final out-of-set corpus includes :
    根据所述第二检测结果,判断每个第一意图类别和每个对应的第三意图类别的差异程度是否大于预置差异程度阈值;According to the second detection result, it is judged whether the degree of difference between each first intention category and each corresponding third intention category is greater than a preset difference degree threshold;
    若大于,则确定差异程度大于预置差异程度阈值的第三意图类别对应的语句满足第二差异条件并作为最终集外语料。If greater, it is determined that the sentence corresponding to the third intent category whose degree of difference is greater than the preset difference degree threshold satisfies the second difference condition and is taken as the final out-of-collection corpus.
  8. 一种意图识别模型更新设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:An intent recognition model updating device, comprising a memory, a processor, and computer-readable instructions stored on the memory and operable on the processor, and the processor implements the following steps when executing the computer-readable instructions :
    获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;Obtaining the original dialogue material, and identifying the first intent category of each statement in the original dialogue material through a preset intent recognition model;
    初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;Initializing a mask list corresponding to the original dialogue material, and adjusting a group of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
    基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;Based on the adjusted mask list, construct the auxiliary dialogue material corresponding to the original dialogue material, and identify the second intent category of each sentence in the auxiliary dialogue material through the intent recognition model;
    对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;Detecting the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, selecting a sentence that satisfies a preset difference condition from the attached dialogue material as Final extra-corporate corpus;
    将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集外语料,对所述意图识别模型进行训练,得到新的意图识别模型。Marking the final out-of-set corpus as out-of-set intent, and using the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
  9. 根据权利要求8所述的意图识别模型更新设备,其中,所述初始化所述原始对话语料对应的掩码列表包括:The device for updating an intention recognition model according to claim 8, wherein said initializing a mask list corresponding to said original dialogue material comprises:
    对所述原始对话语料进行分句处理,得到多个语句,并分别计算所述原始对话语料每个语句的字符串长度;Sentence processing is performed on the original dialogue data to obtain multiple sentences, and the string length of each sentence in the original dialogue data is calculated respectively;
    采用与各所述字符串长度相同的预置第一元素值,分别组合成每个语句对应的掩码;Using a preset first element value with the same length as each of the character strings to form a mask corresponding to each statement;
    采用所述掩码构建所述原始对话语料对应的掩码列表。A mask list corresponding to the original dialogue material is constructed by using the mask.
  10. 根据权利要求9所述的意图识别模型更新设备,其中,所述根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表包括:The device for updating an intention recognition model according to claim 9, wherein said adjusting a group of element values in said mask list according to a preset selection rule, and obtaining an adjusted mask list includes:
    根据预置选取规则,分别确定所述掩码列表中每段掩码对应的调整位置;According to preset selection rules, respectively determine the adjustment position corresponding to each mask in the mask list;
    采用预置第二元素值替换所述调整位置上的第一元素值,得到调整后的掩码列表。The first element value at the adjusted position is replaced by the preset second element value to obtain an adjusted mask list.
  11. 根据权利要求10所述的意图识别模型更新设备,其中,所述基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料包括:The device for updating an intention recognition model according to claim 10, wherein, based on the adjusted mask list, constructing the auxiliary dialogue material corresponding to the original dialogue material comprises:
    分别确定所述数值变换后的掩码列表中每段掩码的第一元素值位置,并分别选取所述原始语料中每个语句与所述第一元素值位置相同的单字;Respectively determine the first element value position of each mask in the mask list after the numerical transformation, and respectively select the single word in each sentence in the original corpus that has the same position as the first element value;
    按照所述第一元素值位置的顺序,分别对每个语句对应选取到的单字进行顺序组合,对应得到新的语句;According to the order of the position of the first element value, each sentence is respectively combined in sequence with the corresponding selected words to obtain a new sentence;
    对各所述新的语句进行拼接,得到所述原始对话语料对应的附属对话语料。Each of the new sentences is spliced to obtain the auxiliary dialogue material corresponding to the original dialogue material.
  12. 根据权利要求8所述的意图识别模型更新设备,其中,在所述对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果之后,还包括:The device for updating the intent recognition model according to claim 8, wherein, after the degree of difference between the first intent category and the second intent category is detected and the first detection result is obtained, further comprising:
    若所述第一检测结果为所述第一意图类别和所述第二意图类别相同,则将所述附属对话语料作为初始对话语料;If the first detection result is that the first intent category is the same as the second intent category, using the auxiliary dialogue material as the initial dialogue material;
    若所述第一检测结果为所述第一意图类别和所述第二意图类别不同,则将所述原始对话语料作为初始对话语料;If the first detection result is that the first intent category is different from the second intent category, then using the original dialogue material as initial dialogue material;
    对所述初始对话语料进行下一轮的对应掩码列表数值变换、意图识别和差异程度检测,直到所述初始对话语料满足预置退出条件时停止,得到新的第一检测结果。Carry out the next round of corresponding mask list numerical transformation, intent recognition and difference degree detection on the initial dialogue material until the initial dialogue material meets the preset exit condition, and a new first detection result is obtained.
  13. 根据权利要求8所述的意图识别模型更新设备,其中,所述差异条件包括第一差异条件和第二差异条件,所述基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料包括:The device for updating an intention recognition model according to claim 8, wherein the difference condition includes a first difference condition and a second difference condition, and based on the first detection result, selecting from the attached dialogue data that satisfies the predetermined The sentences with difference conditions as the final out-of-set corpus include:
    根据所述第一检测结果,依次判断每个第一意图类别和每个对应的第二意图类别是否相同;According to the first detection result, sequentially determine whether each first intent category and each corresponding second intent category are the same;
    若不相同,则确定与第一意图类别不同的第二意图类别对应的语句满足预置第一差异条件并作为备选集外语料;If not the same, then determine that the sentence corresponding to the second intent category different from the first intent category satisfies the preset first difference condition and is used as an alternative foreign language material;
    通过所述意图识别模型识别所述备选集外语料中每个语句的第三意图类别;Recognizing the third intent category of each sentence in the candidate foreign corpus through the intent recognition model;
    对所述第一意图类别和所述第三意图类别进行差异程度检测,得到第二检测结果,并根据所述第二检测结果,从所述备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料。Detecting the degree of difference between the first intent category and the third intent category to obtain a second detection result, and selecting from the candidate out-of-collection corpus to satisfy the preset second difference according to the second detection result Conditional sentences are used as the final out-of-set corpus.
  14. 根据权利要求13中任一项所述的意图识别模型更新设备,其中,所述根据所述第二检测结果,从所述备选集外语料中选取满足预置第二差异条件的语句作为最终集外语料包括:The device for updating the intent recognition model according to any one of claim 13, wherein, according to the second detection result, a sentence that satisfies the preset second difference condition is selected from the candidate foreign language material as the final Extra-corporate materials include:
    根据所述第二检测结果,判断每个第一意图类别和每个对应的第三意图类别的差异程度是否大于预置差异程度阈值;According to the second detection result, it is judged whether the degree of difference between each first intention category and each corresponding third intention category is greater than a preset difference degree threshold;
    若大于,则确定差异程度大于预置差异程度阈值的第三意图类别对应的语句满足第二差异条件并作为最终集外语料。If greater, it is determined that the sentence corresponding to the third intent category whose degree of difference is greater than the preset difference degree threshold satisfies the second difference condition and is taken as the final out-of-collection corpus.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are run on the computer, the computer is made to perform the following steps:
    获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;Obtaining the original dialogue material, and identifying the first intent category of each statement in the original dialogue material through a preset intent recognition model;
    初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;Initializing a mask list corresponding to the original dialogue material, and adjusting a group of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
    基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;Based on the adjusted mask list, construct the auxiliary dialogue material corresponding to the original dialogue material, and identify the second intent category of each sentence in the auxiliary dialogue material through the intent recognition model;
    对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;Detecting the degree of difference between the first intent category and the second intent category to obtain a first detection result, and based on the first detection result, selecting a sentence that satisfies a preset difference condition from the attached dialogue material as Final extra-corporate corpus;
    将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集外语料,对所述意图识别模型进行训练,得到新的意图识别模型。Marking the final out-of-set corpus as out-of-set intent, and using the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述初始化所述原始对话语料对应的掩码列表包括:The computer-readable storage medium according to claim 15, wherein the initializing the mask list corresponding to the original dialogue material comprises:
    对所述原始对话语料进行分句处理,得到多个语句,并分别计算所述原始对话语料每个语句的字符串长度;Sentence processing is performed on the original dialogue data to obtain multiple sentences, and the string length of each sentence in the original dialogue data is calculated respectively;
    采用与各所述字符串长度相同的预置第一元素值,分别组合成每个语句对应的掩码;Using a preset first element value with the same length as each of the character strings to form a mask corresponding to each statement;
    采用所述掩码构建所述原始对话语料对应的掩码列表。A mask list corresponding to the original dialogue material is constructed by using the mask.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表包括:The computer-readable storage medium according to claim 16, wherein said adjusting a group of element values in the mask list according to preset selection rules, and obtaining the adjusted mask list includes:
    根据预置选取规则,分别确定所述掩码列表中每段掩码对应的调整位置;According to preset selection rules, respectively determine the adjustment position corresponding to each mask in the mask list;
    采用预置第二元素值替换所述调整位置上的第一元素值,得到调整后的掩码列表。The first element value at the adjusted position is replaced by the preset second element value to obtain an adjusted mask list.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料包括:The computer-readable storage medium according to claim 17, wherein said constructing the auxiliary dialogue material corresponding to the original dialogue material based on the adjusted mask list comprises:
    分别确定所述数值变换后的掩码列表中每段掩码的第一元素值位置,并分别选取所述原始语料中每个语句与所述第一元素值位置相同的单字;Respectively determine the first element value position of each mask in the mask list after the numerical transformation, and respectively select the single word in each sentence in the original corpus that has the same position as the first element value;
    按照所述第一元素值位置的顺序,分别对每个语句对应选取到的单字进行顺序组合,对应得到新的语句;According to the order of the position of the first element value, each sentence is respectively combined in sequence with the corresponding selected words to obtain a new sentence;
    对各所述新的语句进行拼接,得到所述原始对话语料对应的附属对话语料。Each of the new sentences is spliced to obtain the auxiliary dialogue material corresponding to the original dialogue material.
  19. 根据权利要求15所述的计算机可读存储介质,其中,在所述对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果之后,还包括:The computer-readable storage medium according to claim 15, wherein, after the degree of difference between the first intent category and the second intent category is detected and the first detection result is obtained, further comprising:
    若所述第一检测结果为所述第一意图类别和所述第二意图类别相同,则将所述附属对话语料作为初始对话语料;If the first detection result is that the first intent category is the same as the second intent category, using the auxiliary dialogue material as the initial dialogue material;
    若所述第一检测结果为所述第一意图类别和所述第二意图类别不同,则将所述原始对话语料作为初始对话语料;If the first detection result is that the first intent category is different from the second intent category, then using the original dialogue material as initial dialogue material;
    对所述初始对话语料进行下一轮的对应掩码列表数值变换、意图识别和差异程度检测,直到所述初始对话语料满足预置退出条件时停止,得到新的第一检测结果。Carry out the next round of corresponding mask list numerical transformation, intent recognition and difference degree detection on the initial dialogue material until the initial dialogue material meets the preset exit condition, and a new first detection result is obtained.
  20. 一种意图识别模型更新装置,其中,所述意图识别模型更新装置包括:A device for updating an intention recognition model, wherein the device for updating an intention recognition model includes:
    语料获取模块,用于获取原始对话语料,并通过预置意图识别模型,识别所述原始对话语料中每个语句的第一意图类别;The corpus acquisition module is used to obtain the original dialogue material, and identify the first intent category of each statement in the original dialogue material through a preset intent recognition model;
    掩码构建模块,用于初始化所述原始对话语料对应的掩码列表,并根据预置选取规则调整所述掩码列表中的一组元素值,得到调整后的掩码列表;A mask construction module, configured to initialize a mask list corresponding to the original dialogue material, and adjust a group of element values in the mask list according to preset selection rules to obtain an adjusted mask list;
    第二意图模块,用于基于所述调整后的掩码列表,构建所述原始对话语料对应的附属对话语料,并通过所述意图识别模型,识别所述附属对话语料中每个语句的第二意图类别;The second intention module is configured to construct the attached dialogue corpus corresponding to the original dialogue corpus based on the adjusted mask list, and identify the second sentence of each sentence in the attached dialogue corpus through the intent recognition model. intent class;
    最终集外模块,用于对所述第一意图类别和所述第二意图类别进行差异程度检测,得到第一检测结果,并基于所述第一检测结果,从所述附属对话语料中选取满足预置差异条件的语句作为最终集外语料;The final out-of-set module is used to detect the degree of difference between the first intent category and the second intent category, obtain a first detection result, and select from the attached dialogue material based on the first detection result that satisfies The sentences with preset difference conditions are used as the final extra-corpus corpus;
    语料训练模块,用于将所述最终集外语料标注为集外意图,并采用所述原始对话语料和所述最终集外语料,对所述意图识别模型进行训练,得到新的意图识别模型。The corpus training module is used to mark the final out-of-set corpus as out-of-set intent, and use the original dialogue corpus and the final out-of-set corpus to train the intent recognition model to obtain a new intent recognition model.
PCT/CN2022/071694 2021-09-18 2022-01-13 Method, apparatus, and device for updating intent recognition model, and readable medium WO2023040153A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111095912.8A CN113792540B (en) 2021-09-18 2021-09-18 Method for updating intention recognition model and related equipment
CN202111095912.8 2021-09-18

Publications (1)

Publication Number Publication Date
WO2023040153A1 true WO2023040153A1 (en) 2023-03-23

Family

ID=78878897

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071694 WO2023040153A1 (en) 2021-09-18 2022-01-13 Method, apparatus, and device for updating intent recognition model, and readable medium

Country Status (2)

Country Link
CN (1) CN113792540B (en)
WO (1) WO2023040153A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792540B (en) * 2021-09-18 2024-03-22 平安科技(深圳)有限公司 Method for updating intention recognition model and related equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180232435A1 (en) * 2017-02-13 2018-08-16 Kabushiki Kaisha Toshiba Dialogue system, a dialogue method and a method of adapting a dialogue system
CN110070852A (en) * 2019-04-26 2019-07-30 平安科技(深圳)有限公司 Synthesize method, apparatus, equipment and the storage medium of Chinese speech
CN111611366A (en) * 2020-05-20 2020-09-01 北京百度网讯科技有限公司 Intention recognition optimization processing method, device, equipment and storage medium
CN112131890A (en) * 2020-09-15 2020-12-25 北京慧辰资道资讯股份有限公司 Method, device and equipment for constructing intelligent recognition model of conversation intention
CN112417127A (en) * 2020-12-02 2021-02-26 网易(杭州)网络有限公司 Method, device, equipment and medium for training conversation model and generating conversation
CN112686051A (en) * 2020-12-22 2021-04-20 科大讯飞股份有限公司 Semantic recognition model training method, recognition method, electronic device, and storage medium
CN113792540A (en) * 2021-09-18 2021-12-14 平安科技(深圳)有限公司 Intention recognition model updating method and related equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800306B (en) * 2019-01-10 2023-10-17 深圳Tcl新技术有限公司 Intention analysis method, device, display terminal and computer readable storage medium
CN111552821B (en) * 2020-05-14 2022-03-01 北京华宇元典信息服务有限公司 Legal intention searching method, legal intention searching device and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180232435A1 (en) * 2017-02-13 2018-08-16 Kabushiki Kaisha Toshiba Dialogue system, a dialogue method and a method of adapting a dialogue system
CN110070852A (en) * 2019-04-26 2019-07-30 平安科技(深圳)有限公司 Synthesize method, apparatus, equipment and the storage medium of Chinese speech
CN111611366A (en) * 2020-05-20 2020-09-01 北京百度网讯科技有限公司 Intention recognition optimization processing method, device, equipment and storage medium
CN112131890A (en) * 2020-09-15 2020-12-25 北京慧辰资道资讯股份有限公司 Method, device and equipment for constructing intelligent recognition model of conversation intention
CN112417127A (en) * 2020-12-02 2021-02-26 网易(杭州)网络有限公司 Method, device, equipment and medium for training conversation model and generating conversation
CN112686051A (en) * 2020-12-22 2021-04-20 科大讯飞股份有限公司 Semantic recognition model training method, recognition method, electronic device, and storage medium
CN113792540A (en) * 2021-09-18 2021-12-14 平安科技(深圳)有限公司 Intention recognition model updating method and related equipment

Also Published As

Publication number Publication date
CN113792540B (en) 2024-03-22
CN113792540A (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN109992783B (en) Chinese word vector modeling method
CN111159485B (en) Tail entity linking method, device, server and storage medium
KR102189688B1 (en) Mehtod for extracting synonyms
CN112084381A (en) Event extraction method, system, storage medium and equipment
CN110633577B (en) Text desensitization method and device
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN113204952B (en) Multi-intention and semantic slot joint identification method based on cluster pre-analysis
CN110555084A (en) remote supervision relation classification method based on PCNN and multi-layer attention
CN115292463B (en) Information extraction-based method for joint multi-intention detection and overlapping slot filling
WO2023134083A1 (en) Text-based sentiment classification method and apparatus, and computer device and storage medium
CN111858843A (en) Text classification method and device
CN111858898A (en) Text processing method and device based on artificial intelligence and electronic equipment
CN113705237A (en) Relation extraction method and device fusing relation phrase knowledge and electronic equipment
CN112417153A (en) Text classification method and device, terminal equipment and readable storage medium
CN111159332A (en) Text multi-intention identification method based on bert
WO2023040153A1 (en) Method, apparatus, and device for updating intent recognition model, and readable medium
CN111984780A (en) Multi-intention recognition model training method, multi-intention recognition method and related device
CN110309252B (en) Natural language processing method and device
Ambili et al. Siamese Neural Network Model for Recognizing Optically Processed Devanagari Hindi Script
CN113220865A (en) Text similar vocabulary retrieval method, system, medium and electronic equipment
CN116720519B (en) Seedling medicine named entity identification method
CN112784603A (en) Patent efficacy phrase identification method
CN114546326A (en) Virtual human sign language generation method and system
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
Selamat et al. Arabic script web documents language identification using decision tree-ARTMAP model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868534

Country of ref document: EP

Kind code of ref document: A1