CN114706943A - Intention recognition method, apparatus, device and medium - Google Patents

Intention recognition method, apparatus, device and medium Download PDF

Info

Publication number
CN114706943A
CN114706943A CN202210262775.0A CN202210262775A CN114706943A CN 114706943 A CN114706943 A CN 114706943A CN 202210262775 A CN202210262775 A CN 202210262775A CN 114706943 A CN114706943 A CN 114706943A
Authority
CN
China
Prior art keywords
sample
training
character
characters
compressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210262775.0A
Other languages
Chinese (zh)
Inventor
汪硕芃
张林箭
宋有伟
张聪
吕唐杰
范长杰
胡志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202210262775.0A priority Critical patent/CN114706943A/en
Publication of CN114706943A publication Critical patent/CN114706943A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Character Discrimination (AREA)

Abstract

The application discloses a method, a device, equipment and a medium for intention recognition, wherein the method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, each training sample is a character combination of an abstract intention, and the training samples at least comprise attention characters in ordered arrangement; when the total number of the characters in the training sample set after the duplication removal is larger than or equal to a first threshold value, compressing each training sample according to the occurrence times of the characters to obtain a compressed sample set; and inputting the compressed sample set into a pre-constructed mask language model for training, and outputting to obtain an intention recognition result, wherein the mask language model is used for filling the input compressed sample into a preset position in a language prompt template to be trained and then training. The embodiment provided by the application effectively improves the convergence rate of the mask language model by carrying out uniform preprocessing transformation on the training samples.

Description

Intention recognition method, apparatus, device and medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a medium for intent recognition.
Background
The intention recognition refers to classification recognition of search requirements of the user. The field of intent recognition applications relates to search engines, dialog systems, intelligent internet of things, robots, and the like. In these application fields, the search information input by the user may have the problems of non-standard, diversified input modes, even non-standard natural language, and the like. Therefore, training of the intent recognition model typically requires a large amount of sample learning.
In a dialog system, the intent recognition task may be considered a typical text classification task. Which determines whether the user input belongs to a preset intent category given the user input. The text input by the user is classified, and a large amount of training data is needed to achieve a good effect. However, when a new dialogue task is created by a dialogue system, a large amount of standard data is not available, each intention often has only a few samples or dozens of samples, and in the face of the situation, building an intention classification model by using a small number of samples becomes a new direction for the development of natural language processing technology.
Disclosure of Invention
In view of the above-mentioned drawbacks or deficiencies in the prior art, it is desirable to provide a method, an apparatus, a device, and a medium for intention recognition to solve the problems of slow convergence speed and poor prediction effect of the pre-training process of the existing language model.
In a first aspect, an embodiment of the present invention provides an intent recognition method, where the method includes:
acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, each training sample is a character combination of an abstract intention, and the training samples at least comprise attention characters in ordered arrangement;
when the total number of characters in the de-duplicated training sample set is greater than or equal to a first threshold value, compressing each training sample according to the number of times of occurrence of the characters to obtain a compressed sample set;
and inputting the compressed sample set into a pre-constructed mask language model for training, and outputting to obtain an intention recognition result, wherein the mask language model is used for filling the input compressed sample into a preset position in a language prompt template to be trained and then training.
Optionally, compressing each training sample according to the number of occurrences of the character to obtain a compressed sample set, including: acquiring a word frequency list corresponding to a training sample set, wherein the word frequency list comprises attention characters contained in the training sample set and the occurrence frequency of each attention character in the training sample set; performing character de-emphasis processing on each training sample in the training sample set to obtain a de-emphasis sample set, wherein the de-emphasis sample set comprises a plurality of de-emphasis samples, and the de-emphasis samples correspond to the training samples one to one; and respectively compressing each de-duplicated sample in the de-duplicated sample set according to the word frequency list to obtain a compressed sample set, wherein the compressed sample set comprises a plurality of compressed samples, and the compressed samples correspond to the de-duplicated samples one to one.
Optionally, performing character deduplication processing on each training sample in the training sample set to obtain a deduplication sample set, including: for each training sample, when the training sample is determined to contain the non-attention character, replacing the non-attention character with a preset character to obtain a replacement sample corresponding to the training sample; carrying out character de-duplication processing on the replacement sample corresponding to the training sample to obtain a de-duplication sample corresponding to the replacement sample; adding the de-duplicated samples corresponding to the replacement samples to the set of de-duplicated samples; when the training sample is determined not to contain the non-concerned characters, carrying out character de-duplication processing on the training sample to obtain a de-duplication sample corresponding to the training sample; the de-weight samples corresponding to the training samples are added to the set of de-weight samples.
Optionally, the compressing each of the de-duplicated samples in the de-duplicated sample set according to the word frequency list to obtain a compressed sample set, including: determining a character length of the deduplication sample for each deduplication sample; when the character length is larger than or equal to a second threshold value, screening the duplicate removal samples according to the character frequency list to obtain screening samples corresponding to the duplicate removal samples; compressing the screened sample corresponding to the de-weighted sample to obtain a compressed sample corresponding to the screened sample; adding the compressed sample corresponding to the screened sample to the set of compressed samples.
Optionally, the method further comprises: for each of the de-duplicated samples, when the character length is less than a second threshold, the de-duplicated sample is determined to be a compressed sample.
Optionally, the screening the duplicate removal samples according to the word frequency list to obtain screened samples corresponding to the duplicate removal samples includes: determining a candidate character set from the first concerned character of the word frequency list according to the arrangement sequence of the word frequency list, wherein the candidate character set comprises a first threshold number of concerned characters which are continuous, and the word frequency list is sorted from high to low according to the occurrence frequency of the concerned characters in a training sample set; and screening the concerned characters contained in the duplication removing sample according to the candidate character set to obtain a screening sample corresponding to the duplication removing sample.
Optionally, compressing the screening sample corresponding to the duplicate removal sample to obtain a compressed sample corresponding to the screening sample, including: performing sliding window processing on the screened sample by using an intercepting window to obtain an intercepting sample corresponding to the screened sample, wherein the size of the intercepting window is a second threshold, the step length of each sliding of the intercepting window is a preset number of characters, and each intercepting sample comprises the second threshold number of characters; when the intercepted sample is determined to belong to the compressed sample contained in the compressed sample set, sliding the intercepted window according to the step length, returning to the step of performing sliding window processing on the screened sample by using the intercepted window to obtain the intercepted sample corresponding to the screened sample until the intercepted sample is determined not to belong to the compressed sample contained in the compressed sample set; and when determining that the intercepted sample does not belong to the compressed samples already contained in the compressed sample set, determining the intercepted sample as the compressed sample corresponding to the screened sample.
Optionally, performing sliding window processing on the screened sample by using an interception window to obtain an intercepted sample corresponding to the screened sample, including: when the number of characters contained in the screening sample corresponding to the intercepting window is determined to be smaller than the size of the intercepting window, according to the character arrangement sequence of the screening sample, setting completion characters with difference numbers at the tail parts of the characters contained in the screening sample corresponding to the intercepting window, extracting the characters contained in the screening sample corresponding to the intercepting window and the set completion characters as the intercepting sample corresponding to the screening sample, wherein the difference numbers are determined by the difference between the size of the intercepting window and the number of characters to be intercepted in the screening sample corresponding to the intercepting window; and when the number of the characters contained in the screening sample corresponding to the intercepting window is determined to be larger than or equal to the size of the intercepting window, extracting the characters corresponding to the intercepting window from the screening sample as the intercepting sample corresponding to the screening sample according to the character arrangement sequence of the screening sample.
Optionally, if the duplicate sample contains a predetermined character, the predetermined character is set at the head of the duplicate sample.
Optionally, obtaining a word frequency list corresponding to the training sample set includes: respectively counting the occurrence times of each concerned character in the training sample set; and sequencing the attention characters in the training sample set according to the occurrence frequency of each attention character to obtain a character frequency list corresponding to the training sample set.
In a second aspect, an embodiment of the present invention further provides an apparatus for intention identification, where the apparatus includes:
the training sample acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, each training sample is a character combination of an abstract intention, and the training samples at least comprise attention characters in ordered arrangement;
the sample compression processing module is used for compressing each training sample according to the occurrence times of the characters to obtain a compressed sample set when the total number of the characters in the de-duplicated training sample set is greater than or equal to a first threshold;
and the model training module is used for inputting the compressed sample set into a pre-constructed mask language model for training and outputting to obtain an intention recognition result, and the mask language model is used for filling the input compressed sample into a preset position in a language prompt template to be trained and then training.
In a third aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the method described in the embodiments of the present invention is implemented.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method described in the embodiment of the present invention.
The technical scheme provided by the invention has the following beneficial effects:
the invention provides a method, a device, equipment and a medium for identifying intentions, wherein the method comprises the steps of obtaining a training sample set; then, when the total number of characters in the de-duplicated training sample set is greater than or equal to a first threshold value, compressing each training sample according to the occurrence frequency of the characters to obtain a compressed sample set; and finally, inputting the compressed sample set into a mask language model which is constructed in advance for training, and outputting to obtain an intention recognition result. The method effectively improves the convergence rate of the mask language model by acquiring the training samples and carrying out uniform preprocessing transformation on the training samples.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 shows a flow diagram of a method of intent recognition proposed by an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a method of intent recognition according to an embodiment of the present invention;
FIG. 3 illustrates a flow diagram of a method of intent recognition as set forth in a further embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating an apparatus for intention recognition provided by an embodiment of the present invention;
fig. 5 shows a schematic structural diagram of an electronic device provided in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
With the continuous development of artificial intelligence technology, in a chat system, the intention expression text of a user generally has the problems of spoken language, wide content and the like, and the number of the chat text is limited. In order to solve these problems, the industry proposes to convert the original query task into a language mask model by inserting a sample input by a user into a preset template by using a natural language prompt message method. For example, PET (Pattern-Exploiting Training, Chinese: template learning) is a semi-supervised Training process, which splices the input into a task description, i.e. adds a [ mask ] tag at a certain position of the task description, thereby converting the input into a task in a complete form and fill-in-space form. Although the PET is applied to the intention recognition, the dependence on the labeling data can be reduced, but the query information input by the user has unfixed length, so that the convergence of the PET model training process is slow, or the convergence cannot be realized. It can be seen that the prediction effect of directly applying the PET model in the intention identification of each scene is not very good.
The invention provides an intention recognition method which can accelerate the convergence speed of PET training in the intention recognition training process and effectively improve the efficiency of template construction.
In order to more clearly understand the inventive concept provided by the present invention, the test case management method proposed by the present invention is described below with reference to fig. 1 to 5.
Referring to fig. 1, fig. 1 is a flowchart illustrating an intention identification method according to an embodiment of the present invention, which may be implemented by an intention identification apparatus configured in an electronic device. The method comprises the following steps:
step 101, a training sample set is obtained, the training sample set includes a plurality of training samples, each training sample is a character combination of an abstract intention, and the training samples at least include attention characters in an ordered arrangement.
And 102, when the total number of the characters in the de-duplicated training sample set is greater than or equal to a first threshold value, compressing each training sample according to the number of times of occurrence of the characters to obtain a compressed sample set.
And 103, inputting the compressed sample set into a pre-constructed mask language model for training, and outputting to obtain an intention recognition result, wherein the mask language model is used for filling the input compressed sample into a preset position in a language prompt template to be trained and then training.
In the above steps, the training sample is a character combination obtained by performing intent abstraction processing on input data. For example, the input data is: what are you what? What are you called? What is your name? .
Performing intention abstraction processing on the input data to obtain a character combination, namely a training sample, wherein the training sample is as follows: what name you call?
Training samples may be understood as character combinations that may include, but are not limited to, chinese characters, punctuation marks, english characters, and the like. The training sample comprises attention characters and non-attention characters which are arranged in order. For example training samples { what name you call? Either } or { (attraction/location/item) query location }. The ordered arrangement of Chinese characters and tone marks expresses the original intent of the input data.
Similarly, the intention abstraction processing is performed on the plurality of input data to obtain a plurality of training samples. A plurality of training samples are combined into a training sample set. The set of training samples may be, for example:
{ (certain attraction/location/item) query location; asking (for some good/service) for a price; seek to recommend sights/places of play (a location); seeking to recommend gourmet food (at a location); (certain goods/services) inquiry purchase means; (certain sight/certain location/certain item) query feature; (certain sight/certain location) query activity; asking the show (at a certain sight/location); (certain sight/certain location) query history; (certain sight/certain location/certain item) ask person in charge; (certain business/certain location) query culture; (someone/thing) asking nationality, (someone/thing) asking age; (at a location) asking for specialty; }.
Wherein, the query location (of a certain sight/location/item) can be referred to as a training sample.
The total number of characters in the de-duplicated training sample set refers to the number of Chinese characters in the de-duplicated training sample set. Carrying out de-duplication processing on the training sample set; then, the total number of characters corresponding to the training sample set after the deduplication processing is obtained. The training sample set is subjected to de-duplication digital statistics, and the training sample set is assumed to be that the training sample set comprises { (certain scenery/location/item) inquiry position, (certain item/service) inquiry price, (certain location) inquiry scenery/play place, (certain location) inquiry food, (certain item/service) inquiry purchase mode, (certain scenery/certain location/certain item) inquiry feature, (certain scenery/certain location) inquiry activity, (certain scenery/certain location) inquiry performance, (certain scenery/certain location) inquiry history, (certain scenery/certain location/certain item) inquiry person in charge, (certain enterprise/certain location) inquiry culture, (certain person/thing) inquiry nationality, (certain person/thing) inquiry age, (certain location) inquiry speciality }, and after merging de-duplication processing, the duplicate removal list corresponding to the training sample set is obtained as follows: { certain, some, inquiry, ground, landscape, thing, article, person, service, quest, ask, recommend, special, location, position, price, grid, tour, play, beauty, food, buy, formula, color, activity, action, show, calendar, history, item, eye, burden, responsibility, enterprise, business, literary property, chemistry, country, nationality, year, age, birth }. The combined and de-duplicated list contains more than 20 words.
The number of occurrences of a character refers to the number of occurrences of each character of interest in the training sample set.
The first threshold refers to an ideal threshold condition determined through a plurality of times of the intention training recognition training. The first threshold may be 2-30 words. Optionally, the first threshold value is 20 words.
The compressed sample is a character combination obtained by compressing a certain training sample, the character length of the character combination is a fixed value, and the sum of the occurrence times of the concerned characters of the character combination is the largest. The compression process includes, but is not limited to, truncating the training samples according to a preset character length and semantics of the training samples.
The focus character refers to a character appearing in the training sample for characterizing the intention. For example, the training sample is { (a certain sight/location/item) query location }, and the training sample comprises the attention characters of a certain sight, point, ground, object, item, query, question, location, and location. The training samples included non-attention characters as "(", ")", "/". Alternatively, the attention characters include, but are not limited to, chinese characters, english words, and the like.
The mask language model is a model used for training after filling an input sample into a preset position in a language prompt template to be trained. The language presentation template may be, for example, "next sentence is in question [ mask ]", and "next sentence is intended to be [ mask ]".
The mask language model may be a pre-training task of a BERT model (called Bidirective Encode expressions from transformations in English). The pre-training task includes, but is not limited to, an MLM Model (mask Language Model, full name in english). The BERT model is a novel language model that can train a pre-trained deep bi-directional representation by jointly adjusting the bi-directional Transformer encoders in all layers. The input representation of the BERT model may represent a single text sentence, or a pair of texts, e.g., [ question, answer ] in a sequence of words. The purpose of pre-training is to construct a language model, the BERT model is a bidirectional encoder, when the language model is pre-trained to process a downstream task, language information on the left side of a certain word and language information on the right side of the word are needed, and the MLM randomly masks some words from input, which allows the context on the left side and the context on the right side to be fused, so that bidirectional transform representation is realized. That is, the MLM model randomly masks a part of words in a sentence and then predicts the masked words using a context. The method fuses two-way text information, so that a deep two-way Transformer model can be pre-trained.
According to the intention identification method provided by the embodiment of the invention, when the language model is pre-trained, the training sample set is subjected to unified preprocessing modification based on word frequency, the size of the candidate word list can be reduced, and the length needing prediction during multiplication is shortened, so that the convergence speed of the PET model is effectively improved.
Referring to fig. 2, fig. 2 is a flowchart illustrating an intention recognition method according to an embodiment of the invention, which can be implemented by an intention recognition apparatus configured in an electronic device. The method comprises the following steps:
step 201, a training sample set is obtained, where the training sample set includes a plurality of training samples, each training sample is a character combination of an abstract intention, and the training samples at least include attention characters in an ordered arrangement.
Step 202, when the total number of characters in the training sample set after de-duplication is greater than or equal to a first threshold, acquiring a word frequency list corresponding to the training sample set, where the word frequency list includes the attention characters included in the training sample set and the occurrence frequency of each attention character in the training sample set.
Step 203, performing character deduplication processing on each training sample in the training sample set to obtain a deduplication sample set, where the deduplication sample set includes multiple deduplication samples, and the deduplication samples correspond to the training samples one to one.
And 204, respectively compressing each de-duplicated sample in the de-duplicated sample set according to the word frequency list to obtain a compressed sample set, wherein the compressed sample set comprises a plurality of compressed samples, and the compressed samples correspond to the de-duplicated samples one to one.
Step 205, inputting the compressed sample set into a pre-constructed mask language model for training, and outputting to obtain an intention recognition result, wherein the mask language model is used for training after filling the input compressed sample into a preset position in a language prompt template to be trained.
In step 206, when the total number of characters in the training sample set after deduplication is smaller than the first threshold, no special processing is required.
Optionally, obtaining a word frequency list corresponding to the training sample set may include the following steps:
respectively counting the occurrence times of each concerned character in the training sample set;
and sequencing the attention characters in the training sample set according to the occurrence frequency of each attention character to obtain a character frequency list corresponding to the training sample set.
Optionally, separately counting the occurrence number of each attention character in the training sample set, may include:
performing word segmentation processing on each training sample to obtain a single attention character;
and traversing the training sample set, and performing cumulative counting on each concerned character to obtain the occurrence frequency corresponding to each concerned character.
Assuming a training sample set as { (certain sight/location/item) query location; asking (for some good/service) for a price; seek to recommend sights/places of play (a location); seeking to recommend food (at a location); (certain goods/services) inquiry purchase means; (certain sight/certain location/certain item) query feature; (certain sight/certain location) query activity; asking the show (at a certain sight/location); (certain sight/certain location) query history; asking the person in charge (certain sight/certain location/certain item); (certain business/certain location) query culture; (someone/thing) asking nationality, (someone/thing) asking age; (a location) asking for specials; }. The characters of interest of the training sample set are chinese characters.
Only the word frequencies of the Chinese characters are sorted in the training sample set, and a word frequency list is obtained as follows:
' A ', 22 ', 17 ', 12 ', 11 ', 7 ', 6 ', 4 ', 3 ', 2 ', 1 ', a, for ' the purpose, 'national' 1, 'book' 1, 'year' 1, 'age' 1, 'birth' 1}
The sorting is optionally performed in order of word frequency from high to low, or from low to high.
Optionally, performing character deduplication processing on each training sample in the training sample set, including:
for each training sample, when the training sample is determined to contain the non-attention character, replacing the non-attention character with a preset character to obtain a replacement sample corresponding to the training sample;
carrying out character deduplication processing on the replacement sample corresponding to the training sample to obtain a deduplication sample corresponding to the replacement sample;
adding the de-duplicated samples corresponding to the replacement samples to the set of de-duplicated samples;
when the training sample is determined not to contain the non-concerned characters, carrying out character de-duplication processing on the training sample to obtain a de-duplication sample corresponding to the training sample;
the de-weight samples corresponding to the training samples are added to the set of de-weight samples.
The non-attention character refers to a character other than the attention character in each training sample. Non-attention characters include, but are not limited to, english characters, special symbols, and the like. The english character may be a single letter.
And replacing the non-Chinese characters in the training sample set by preset characters. For example, the non-Chinese characters in the training sample set are replaced with the [ UNK ] symbol.
The preset character refers to a character or a character combination which is specified in advance and used for replacing a non-attention character. The default characters include, but are not limited to, representations of [ UNK ].
For example, after the [ UNK ] symbol is used for substitution processing on the training sample set, the result after substitution is obtained:
[ UNK ] a certain scenery [ UNK ] location [ UNK ] article [ UNK ] inquiry position, [ UNK ] a certain article [ UNK ] service [ UNK ] inquiry price, [ UNK ] a certain location [ UNK ] seeks recommended scenery [ UNK ] play place, [ UNK ] a certain location [ UNK ] seeks recommended food, [ UNK ] a certain article [ UNK ] service [ UNK ] inquiry purchase mode, [ UNK ] a certain scenery [ UNK ] location [ UNK ] certain article [ UNK ] inquiry feature, [ UNK ] a certain scenery [ UNK ] a certain location [ UNK ] inquiry activity, [ UNK ] a certain location [ UNK ] certain country [ UNK ] play, [ UNK ] a certain country [ UNK ] certain location [ UNK ] inquiry history, [ UNK ] a certain scenery [ UNK ] certain location [ UNK ] item [ UNK ] inquiry history, [ UNK ] a certain country [ UNK ] certain country age [ UNK ] inquiry activity, [ UNK ] A certain location [ UNK ] enquires a special product }.
And carrying out character de-duplication processing on the result subjected to the replacement processing to obtain a de-duplication sample corresponding to each replacement sample. For example, the alternative sample is { [ UNK ] some attraction [ UNK ] location [ UNK ] item [ UNK ] query location }. The de-weight sample corresponding to the alternative sample is { [ UNK ] locality-related item query location } of the attraction.
Adding the replaced result to a de-duplication sample set with an empty initial state, and finally obtaining the de-duplication sample set as follows:
{ [ UNK ] position of object inquiry in certain attraction, [ UNK ] price inquiry by certain object service, [ UNK ] play recommendation by certain position, [ UNK ] food recommendation by certain position, [ UNK ] purchase mode inquiry by certain object service, [ UNK ] feature inquiry by certain attraction, [ UNK ] activity inquiry in certain attraction, [ UNK ] show inquiry in certain attraction, [ UNK ] history inquiry in certain attraction, [ UNK ] responsible person inquiry in certain attraction, [ UNK ] culture inquiry in certain enterprise position, [ UNK ] nationality inquiry in nation, [ UNK ] age inquiry in certain object, [ UNK ] speciality inquiry in certain attraction }.
Optionally, if the duplicate sample contains a predetermined character, the predetermined character is set at the head of the duplicate sample.
The sampling intention does not process the special symbols of the input data, different special symbols have no substantial influence on intention recognition, but the existence of the sampling intention can occupy the length of the processed characters in a character form, so that the PET prediction length cannot be effectively shortened, the training effect is influenced, and the convergence is slow.
The invention can reserve the characteristic of input data with special symbols by setting the preset characters at the head of the duplicate removal sample on the basis of reserving the character arrangement of the training sample, but effectively solves the problem of PET prediction length.
Optionally, each of the de-duplicated samples in the de-duplicated sample set is compressed according to the word frequency list, so as to obtain a compressed sample set, where the compressed sample set includes a plurality of compressed samples, and the compressed samples correspond to the de-duplicated samples one to one. And inputting the compressed sample set into a mask language model which is constructed in advance for training, and outputting to obtain an intention recognition result.
According to the method for identifying the intention, the word frequency list is used for carrying out de-duplication processing on the training sample set, the size of the candidate word list is reduced, and the model convergence speed is effectively improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating an intention recognition method according to another embodiment of the invention, which can be implemented by an intention recognition apparatus configured in an electronic device. The method comprises the following steps:
step 301, a training sample set is obtained, where the training sample set includes a plurality of training samples, each training sample is a character combination of an abstract intention, and the training samples at least include attention characters in an ordered arrangement.
Step 302, when the total number of characters in the training sample set after de-duplication is greater than or equal to a first threshold, acquiring a word frequency list corresponding to the training sample set. The word frequency list comprises the attention characters contained in the training sample set and the occurrence number of each attention character in the training sample set.
Step 303, performing character deduplication processing on each training sample in the training sample set to obtain a deduplication sample set.
For each deduplication sample, a character length for each deduplication sample is determined, step 304.
And 305, when the character length is greater than or equal to the second threshold, screening the duplicate removal samples according to the character frequency list to obtain screened samples corresponding to the duplicate removal samples.
And step 306, compressing the screening sample corresponding to the de-weight sample to obtain a compressed sample corresponding to the screening sample.
Step 307, adding the compressed sample corresponding to the screened sample to the compressed sample set.
And 308, determining the de-duplicated samples as compressed samples when the character length is smaller than a second threshold value for each de-duplicated sample. The compressed sample is added to the set of compressed samples.
And 309, inputting the compressed sample set into a pre-constructed mask language model for training, and outputting to obtain an intention recognition result, wherein the mask language model is used for filling the input compressed sample into a preset position in a language prompt template to be trained and then training.
In step 310, when the total number of characters in the training sample set after deduplication is smaller than the first threshold, no special processing is required.
In the above step, the character length of each deduplication sample refers to the number of characters included in a character combination obtained by subjecting the deduplication samples to character deduplication processing. For example, the deduplication sample is { [ UNK ] geographical object query location of a certain attraction }, and the corresponding character length is 11.
The second threshold is a maximum length constraint value preset for the de-duplicated samples, which may be a value related to the characteristic intended to be identified. The second threshold value may be represented by max _ length, for example, and is preferably 5.
Optionally, the screening the duplicate removal samples according to the word frequency list to obtain a screened sample corresponding to the duplicate removal samples includes:
determining a candidate character set from the first concerned character of the word frequency list according to the arrangement sequence of the word frequency list, wherein the candidate character set comprises a first threshold number of concerned characters which are continuous, and the word frequency list is sorted from high to low according to the occurrence frequency of the concerned characters in a training sample set;
and screening the concerned characters contained in the duplication removing sample according to the candidate character set to obtain a screening sample corresponding to the duplication removing sample.
On the basis of the word frequency list, if the value of the first threshold is 20, continuously extracting 20 attention characters from the first attention character of the word frequency list, and obtaining a character set as a candidate character set. The candidate character set is, for example, { ' 22, ' point ':17, ' query ':12, ' ask ':12, ' ground ':11, ' scene ':7, ' object ':6, ' article ':4, ' person ':3, ' service ':2, ' seek ':2, ' ask ':2, ' recommend ':2, ' special ':2, ' position ':1, ' set ':1, ' price ':1, ' check ':1 }. The set of candidate characters contains the high frequency attention characters that appear in the training sample. The time to process the data can be further reduced by using these high frequency attention characters to screen each of the de-duplicated samples.
Screening the attention characters contained in the duplicate removal sample according to the candidate character set may include:
comparing each concerned character contained in the de-duplication sample with the concerned character contained in the candidate character set according to the character arrangement sequence of the de-duplication sample;
when determining that the attention character contained in the de-duplication sample belongs to the attention character contained in the candidate character set, reserving the attention character contained in the de-duplication sample;
and deleting the attention character contained in the de-duplication sample when the attention character contained in the de-duplication sample is determined not to belong to the attention character contained in the candidate character set.
For example, the deduplication sample is { [ UNK ] Party area item query feature }. And comparing each Chinese character in the de-duplication sample with the candidate character set to obtain a screening sample { [ UNK ] ground object query characteristics of the scenic spot corresponding to the de-duplication sample.
Optionally, compressing the filtered sample corresponding to the de-weighted sample to obtain a compressed sample corresponding to the filtered sample, including:
performing sliding window processing on the screened sample by using an intercepting window to obtain an intercepting sample corresponding to the screened sample, wherein the size of the intercepting window is a second threshold, the step length of each sliding of the intercepting window is a preset number of characters, and each intercepting sample comprises the second threshold number of characters;
when the intercepted sample is determined to belong to the compressed sample contained in the compressed sample set, sliding the intercepted window according to the step length, returning to the step of performing sliding window processing on the screened sample by using the intercepted window to obtain the intercepted sample corresponding to the screened sample until the intercepted sample is determined not to belong to the compressed sample contained in the compressed sample set;
and when determining that the intercepted sample does not belong to the compressed samples already contained in the compressed sample set, determining the intercepted sample as the compressed sample corresponding to the screened sample.
In the above step, the cutout window is used to extract a plurality of characters whose positions are continuous from the screened sample. The size of the capture window is used to determine the number of characters extracted from the screened sample. For example, a size of the capture window of 5 indicates that 5 consecutive characters are extracted from the filtered sample each time. The position of the character is determined by the character arrangement order of the screened sample. For example, the screening sample is { [ UNK ] query features of a geographic feature of a certain scene }, the first character of the character arrangement order is [ UNK ], the second character is "some", and the last character is "particular", which are arranged in sequence.
Intercepting the sample refers to the result obtained by performing sliding window processing on the screened sample by utilizing the intercepting window every time. For example, the size of the clipping window is 5, and the first clipping sample corresponding to the above filtered sample is obtained by performing the first sliding window processing on the filtered sample, and is { [ UNK ] somewhere else. The step size refers to the number of characters of each movement of the sliding window. The step size may take the value of 1 character.
Optionally, the initial state of the compressed sample set is empty.
When the current screening sample is the first sample to be processed, the fact that the compressed sample set is empty means that the compressed sample contained in the compressed sample set is empty. When the intercepted sample is determined not to belong to the compressed samples already contained in the compressed sample set, the intercepted sample is determined to be a compressed sample, and the compressed sample is added to the compressed sample set.
And when the current screening sample is not the first sample to be processed and the state of the compressed sample set is a non-empty state, comparing whether the intercepted sample obtained from the current screening sample belongs to the compressed sample contained in the compressed sample set or not, if so, determining that the intercepted sample belongs to the compressed sample contained in the compressed sample set, and if not, determining that the intercepted sample does not belong to the compressed sample contained in the compressed sample set.
And when determining that the intercepted sample belongs to the compressed sample contained in the compressed sample set, performing sliding window processing on the current screened sample again to obtain the intercepted sample corresponding to the current screened sample, and comparing the intercepted sample with the compressed sample contained in the compressed sample set until the intercepted sample corresponding to the current screened sample does not belong to the compressed sample contained in the compressed sample set. And processing the next sample of the current screened sample until the last screened sample in the screened sample set.
According to the method, each screening sample is compressed respectively to obtain the compressed sample corresponding to each screening sample, the character length of each compressed sample is ensured to be the same, and the compressed samples have uniqueness, so that the richness of the samples for model training is ensured.
Optionally, performing sliding window processing on the screened sample by using an interception window to obtain an intercepted sample corresponding to the screened sample, including:
when the number of characters contained in the screening sample corresponding to the intercepting window is determined to be smaller than the size of the intercepting window, according to the character arrangement sequence of the screening sample, setting completion characters with difference numbers at the tail parts of the characters contained in the screening sample corresponding to the intercepting window, extracting the characters contained in the screening sample corresponding to the intercepting window and the set completion characters as the intercepting sample corresponding to the screening sample, wherein the difference numbers are determined by the difference between the size of the intercepting window and the number of characters to be intercepted in the screening sample corresponding to the intercepting window;
and when the number of the characters contained in the screening sample corresponding to the intercepting window is determined to be larger than or equal to the size of the intercepting window, extracting the characters corresponding to the intercepting window from the screening sample as the intercepting sample corresponding to the screening sample according to the character arrangement sequence of the screening sample.
For example, the screening sample is { [ UNK ] geographic feature query characteristics } of a certain attraction, and the size of the interception window is 5. After multiple times of sliding window processing, if the characters contained in the screening sample corresponding to the intercepting window are { pin, inquiry, special }, a completion symbol is set behind the characters contained in the screening sample corresponding to the intercepting window, which are { pin, inquiry, special }, according to the character arrangement sequence of the screening sample. The completion characters are symbols or character combinations used for filling the intercepted samples when the number of characters contained in the intercepted samples is smaller than a second threshold number. The completion characters, including but not limited to [ PAD ] character form.
Through the processing, the character combination with optimal semantics and optimal processing length can be obtained, and the convergence efficiency of the PET model is effectively improved.
And further, inputting the compressed sample set into a mask language model which is constructed in advance for training, and outputting to obtain an intention recognition result. It is assumed that the pre-built mask language model may include one or more language prompt templates, such as: the following statement is in question [ mask ]; the following phrase is intended to mean mask; the intent next is to ask [ mask ].
And after a compressed sample set is obtained, splicing each compressed sample with the selected language prompt template, and then sending the compressed sample into a PET training frame to finish the training of the PET model.
The intention identification method provided by the embodiment of the invention performs unified preprocessing modification on training samples, shortens the length of prediction during multiplication and effectively improves the convergence speed of a mask language model.
The PET template determined based on the intention recognition method provided by the embodiment of the invention can be suitable for a plurality of classification related tasks, such as emotion recognition, relevance discrimination and the like. However, different tasks generally require different templates, so that different training samples can be obtained according to different application scenarios to perform pre-training processing respectively, thereby obtaining different PET templates.
In the following, text recognition in a dialog system is taken as an example, and it is assumed that intention data input by a user is acquired in the dialog system, and a unified preprocessing modification is performed on an intention name (i.e., a training sample). The intention name refers to a character combination for indicating a certain intention. The intention name is obtained by abstracting a certain intention. The intention may be expressed in various forms of text, voice, and the like.
a. And a preprocessing decision module. After the full amount of intention name data is obtained, the characters of all intention names are subjected to de-duplication, and the total word number of all current intention sets is calculated.
For example, there is a list of intent names as follows: (certain sight spot/location/item) inquiry position, (certain item/service) inquiry price, (certain location) inquiry about sight spot/playing place, (certain location) inquiry about food, (certain item/service) inquiry about purchasing mode, (certain sight spot/certain location/certain item) inquiry about feature, (certain sight spot/certain location) inquiry about activity, (certain sight spot/certain location) inquiry about show, (certain sight spot/certain location) inquiry about history, (certain sight spot/certain location/certain item) inquiry about responsible person, (certain business/certain location) inquiry about culture, (certain person/thing) inquiry about nationality, (certain person/thing) inquiry about age, and (certain location) inquiry about specialty.
And carrying out duplication elimination on the Chinese characters in the intention name list, wherein the obtained duplication elimination number list is as follows: one, some, inquiry, land, scenery, object, article, person, service, search, ask, recommend, special, location, position, price, game, play, beauty, food, purchase, buy, prescription, formula, color, activity, action, show, calendar, history, item, object, burden, responsibility, enterprise, business, literary, chemical, national, nationality, year, age, birth.
If the total number of words corresponding to the deduplicated intention name list is greater than or equal to 20, the following steps (namely, triggering intention name pre-training) are required, otherwise, special processing is not required.
b. When the total number of words corresponding to the deduplicated intention name list is greater than or equal to 20 characters, the following processing is carried out on the intention name:
i. firstly, Chinese characters appearing in the intention name list are respectively counted, and the Chinese characters appearing in the intention name list are sequenced from high to low according to the occurrence times, so that a word frequency sequencing list corresponding to the intention name list is obtained.
The word frequency ordered list corresponding to the list of intention names is:
{ ' some: 22, ' point ' 17, ' query ' 12, ' question ' 12, ' ground ' 11, ' scene ' 7, ' object ' 6, ' product ' 4, ' person ' 3, ' service ' 2, ' seek ' 2, ' ask ' 2, ' push ' 2, ' recommend ' 2, ' special ' 2, ' position ' 2, ' place ' 1, ' price ' 1, ' vessel ' 1, ' game ' 1, ' play ' 1, food ' 1 ', purchase ' 1, ' prescription ' 1, ' 1, ' move ' 1, ' go ' 1 ', play ' 1 ', go ' 1 ', one ', 1 ', one ' or, 'national' 1, 'book' 1, 'year' 1, 'age' 1, 'birth' 1: 1}
Replace special symbols and non-Chinese symbols appearing in the intended name with UNK symbols. Considering that the intention name is usually freely filled in by people, some special symbols or English letters may appear in the intention name. For these special symbols, an alternative representation may be made by using a special symbol [ UNK ] uniformly. For example, the replacement processing is performed for each intention name included in the intention name list mentioned in the previous step.
(certain sight/location/item) interrogation location- > [ UNK ] certain sight [ UNK ] location [ UNK ] item [ UNK ] interrogation location,
(certain goods/services) ask for price- > [ UNK ] certain goods [ UNK ] services [ UNK ] ask for price,
seek to recommend sight/playground- > [ UNK ] some point [ UNK ] seek to recommend sight [ UNK ] playground,
(certain place) seeking to recommend food- > [ UNK ] certain place [ UNK ] seeking to recommend food,
(certain article/service) inquiry purchase mode- > [ UNK ] certain article [ UNK ] service [ UNK ] inquiry purchase mode,
(certain sight spot/certain location/certain item) inquiry feature- > [ UNK ] certain sight spot [ UNK ] certain location [ UNK ] certain item [ UNK ] inquiry feature,
inquiry activity- > [ UNK ] attraction [ UNK ] location [ UNK ] inquiry activity,
(certain sight/certain location) inquiry show- > [ UNK ] certain sight [ UNK ] certain location [ UNK ] inquiry show,
(certain sight/certain location) query history- > [ UNK ] certain sight [ UNK ] certain location [ UNK ] query history,
(certain sight/certain location/certain item) person in charge- > [ UNK ] certain sight [ UNK ] certain location [ UNK ] certain item [ UNK ] person in charge,
(certain business/certain location) query culture- > [ UNK ] certain business [ UNK ] certain location [ UNK ] query culture,
(someone/thing) enquiry of nationality- > [ UNK ] someone [ UNK ] thing [ UNK ] enquiry of nationality,
(person/article) age-inquiry [ UNK ] person [ UNK ] article [ UNK ] age inquiry,
(site) query specialty- > [ UNK ] site [ UNK ] query specialty
And iii, carrying out internal character deduplication processing on the intention name subjected to the replacement processing. For example, character deduplication processing is performed for each of the aforementioned intent names subjected to the replacement processing.
[ UNK ] position [ UNK ] article [ UNK ] inquiry position of certain scenic spot [ UNK ] - > [ UNK ] position of certain scenic spot article inquiry position,
[ UNK ] certain item [ UNK ] service [ UNK ] ask price- > [ UNK ] certain item service ask price,
UNK [ UNK ] certain location [ UNK ] seeks recommended sight [ UNK ] play ground- > [ UNK ] certain location seeks recommended sight play,
UNK [ UNK ] in a certain place, seeking to recommend food- > [ UNK ] in a certain place, seeking to recommend food,
[ UNK ] certain article [ UNK ] service [ UNK ] inquiry purchase mode- > [ UNK ] certain article service inquiry purchase mode,
[ UNK ] certain scenic spot [ UNK ] certain location [ UNK ] certain article [ UNK ] inquiry characteristic- > [ UNK ] certain scenic spot article inquiry characteristic,
[ UNK ] certain attraction [ UNK ] certain location [ UNK ] inquiry activity- > [ UNK ] certain attraction ground inquiry activity,
[ UNK ] inquiry performance of a certain scenery [ UNK ] and a certain position [ UNK ] - > [ UNK ] inquiry performance of a certain scenery,
[ UNK ] inquiry history of a certain spot [ UNK ] of a certain scenic spot [ UNK ] - > [ UNK ] inquiry history of a certain scenic spot,
[ UNK ] some scenery spot [ UNK ] some location [ UNK ] some item [ UNK ] inquiry responsible person- > [ UNK ] some scenery spot item inquiry responsible person,
[ UNK ] certain enterprise [ UNK ] certain location [ UNK ] inquiry culture- > [ UNK ] certain enterprise location inquiry culture,
[ UNK ] someone [ UNK ] substance [ UNK ] inquiries nationality- > [ UNK ] someone inquiries nationality,
[ UNK ] somebody [ UNK ] substance [ UNK ] Inquiry age- > [ UNK ] somebody Inquiry age,
[ UNK ] some place [ UNK ] inquires special local product- > [ UNK ] some place inquires special local product
And iv, carrying out constraint length extraction on each intention name subjected to character de-duplication processing to obtain the intercepted intention name.
Because the character length of each of the intent names after the deduplication processing is still too long, which is not beneficial to downstream training, the preferred character is obtained from the character list to process each of the intent names after the deduplication processing, and then each of the intent names after the deduplication processing is truncated according to the truncation window with max _ length being 5, so as to shorten the predicted length, thereby ensuring the convergence speed of the PET model.
For example, the first 20 characters of the word frequency list are obtained, and then the respective intention names subjected to the character de-duplication processing and the first 20 characters of the word frequency list are screened, so that each intention name subjected to the character de-duplication processing only contains characters belonging to the range of the 20 characters.
Then, a maximum length constraint called max _ length is set for each intention name, and if the word number length corresponding to the current intention name is smaller than max _ length after the character deduplication and screening processing is performed, the intention name after the deduplication and screening processing is directly added to the intention name training set (i.e., the aforementioned compressed sample set). If the word number length of the current intention name is greater than or equal to max _ length, the current intention name is truncated using a truncation window.
For example, the first 20 characters in the word frequency list are as follows:
{ 'some 22,' point 17, 'query' 12, 'ground' 11, 'scene' 7, 'object' 6, 'article' 4, 'person' 3, 'service' 2, 'search' 2, 'request' 2, 'push' 2, 'recommendation' 2, 'special' 2, 'position' 1, 'setting' 1, 'price' 1, 'grid' 1}
And after screening each intention name subjected to duplicate removal, judging whether sliding window processing is required by utilizing the intercepting window according to the result subjected to screening processing. For example,
comparing the ' UNK ' position for seeking recommended food with the first 20 characters in the character frequency list to obtain a screened result, ' UNK ' position for seeking recommendation ' (namely a screened sample). Then, the first window sliding processing is carried out on the intercepted window to obtain the 'UNK' certain place for seeking recommendation
[ UNK ] some place seeks to recommend food- > [ UNK ] some place seeks to recommend- > [ UNK ] some place seeks.
In the same way, the other de-duplicated intention names are processed one by one.
[ UNK ] some place seeks to recommend food- > [ UNK ] some place seeks to recommend- > [ UNK ] some place seeks;
[ UNK ] some place seeks recommended scene play- > [ UNK ] some place seeks recommendations- >; # since "[ UNK ] someplace hunt" has occurred, the window is slid, shifted one bit to the right
[ UNK ] some place inquiry special product- > [ UNK ] some place inquiry;
[ UNK ] some enterprise location query culture- > [ UNK ] some location query- > some location query; # As above, since the characters "enterprise", "business", "text", "change" are not defaulted in the first 20 characters of the word frequency list, and "[ UNK ] some inquiry has occurred", the window is slid, and moved one bit to the right
[ UNK ] some article service inquiry purchase mode- > [ UNK ] some article service inquiry- > [ UNK ] some article service;
[ UNK ] the service inquiry price of a certain article- > the service of a certain article; # sliding Window, Right Shift one bit
[ UNK ] inquiry characteristic of a local object of a certain scenic spot- > [ UNK ] inquiry of a certain scenic spot;
[ UNK ] inquiry position of a place object of a certain scenic spot- > [ UNK ] inquiry of a place object of a certain scenic spot- >, a place object of a certain scenic spot, # sliding window, and a right-hand movement by one position
[ UNK ] some sight spot inquiry activity- > [ UNK ] some sight spot inquiry- > some sight spot inquiry; # sliding Window, Right shifted by one bit
[ UNK ] inquiring performance of a certain sight spot- >; # sliding Window, two bits Right Shift
[ UNK ] some scenic spot inquiry history- > [ UNK ] some scenic spot inquiry- > point inquiry [ PAD ]; after the # sliding window is shifted to the right by two digits, less than 5 characters are found, and the [ PAD ] symbol is used for completing
[ UNK ] inquiry person in charge of inquiry in site of certain attraction- > [ UNK ] inquiry person in site of certain attraction- >;
[ UNK ] somebody inquiry age- > [ UNK ] somebody inquiry;
[ UNK ] someone enquires nationality- > [ UNK ] someone enquiry- > someone enquiry; # sliding Window, Right shifted by one bit
The intention names are obtained after the processing, the intention names with different character lengths and different meanings are compressed into the same character length, the length needing to be predicted during continuous multiplication is shortened, and the model convergence speed is improved.
And (3) applying the abbreviated representation of the intention name (namely the compressed sample described in the foregoing) obtained in the above steps as a training target of the language model, designing and testing several templates suitable for PET learning, and determining the optimal template through training learning. In experiments we found that a fixed template, in the form of "the sentence below is in question", can be used to a relatively good effect on the intent recognition task.
Fixed language prompt template Intention recognition accuracy
The following sentence is in the query 95.02%
The following is intended to be 94.13%
The intention of the following is to ask 92.446%
Watch (1)
The prompt template processing method provided by the embodiment of the invention performs uniform preprocessing and transformation on training samples, shortens the length of prediction during multiplication and effectively improves the convergence speed of a mask language model.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Optionally, referring to fig. 4, fig. 4 is a schematic structural diagram of an apparatus for intention identification according to an embodiment of the present invention. The apparatus may be configured in an electronic device, the apparatus comprising:
a training sample obtaining module 401, configured to obtain a training sample set, where the training sample set includes multiple training samples, each training sample is a character combination of an abstract intention, and the training samples at least include attention characters in an ordered arrangement;
a sample compression processing module 402, configured to, when the total number of characters in the training sample set after deduplication is greater than or equal to a first threshold, perform compression processing on each training sample according to the number of occurrences of the characters to obtain a compressed sample set;
the model training module 403 is configured to input the compressed sample set to a pre-constructed mask language model for training, and output a result of the intention recognition, where the mask language model is used to fill the input compressed sample to a preset position in a language prompt template to be trained and then train the compressed sample.
Optionally, the sample compression processing module 402 further includes:
the word frequency list acquisition submodule is used for acquiring a word frequency list corresponding to the training sample set, and the word frequency list comprises concerned characters contained in the training sample set and the occurrence frequency of each concerned character in the training sample set;
the character de-duplication processing submodule is used for respectively carrying out character de-duplication processing on each training sample in the training sample set to obtain a de-duplication sample set, the de-duplication sample set comprises a plurality of de-duplication samples, and the de-duplication samples correspond to the training samples one to one;
the compression processing submodule is used for respectively compressing each de-duplicated sample in the de-duplicated sample set according to the word frequency list to obtain a compressed sample set, wherein the compressed sample set comprises a plurality of compressed samples, and the compressed samples correspond to the de-duplicated samples one to one;
optionally, the word frequency list obtaining sub-module is further configured to:
respectively counting the occurrence times of each concerned character in the training sample set;
and sequencing the attention characters in the training sample set according to the occurrence frequency of each attention character to obtain a character frequency list corresponding to the training sample set.
Optionally, the deduplication processing sub-module is further configured to:
for each training sample, when the training sample is determined to contain the non-attention character, replacing the non-attention character with a preset character to obtain a replacement sample corresponding to the training sample;
carrying out character deduplication processing on the replacement sample corresponding to the training sample to obtain a deduplication sample corresponding to the replacement sample;
adding the de-duplicated samples corresponding to the replacement samples to the set of de-duplicated samples;
when the training sample is determined not to contain the non-concerned characters, carrying out character de-emphasis processing on the training sample to obtain a de-emphasis sample corresponding to the training sample;
the de-weight samples corresponding to the training samples are added to the set of de-weight samples.
Optionally, the compression processing sub-module is further configured to:
determining a character length of the deduplication sample for each deduplication sample;
when the character length is larger than or equal to a second threshold value, screening the duplicate removal samples according to the character frequency list to obtain screening samples corresponding to the duplicate removal samples;
compressing the screening sample corresponding to the de-weighting sample to obtain a compressed sample corresponding to the screening sample;
adding the compressed sample corresponding to the screening sample to the compressed sample set.
Optionally, the compression processing sub-module is further configured to: for each of the de-duplicated samples, when the character length is less than a second threshold, the de-duplicated sample is determined to be a compressed sample.
Optionally, the compression processing sub-module is further configured to:
determining a candidate character set from the first concerned character of the word frequency list according to the arrangement sequence of the word frequency list, wherein the candidate character set comprises the concerned characters with continuous first threshold number, and the word frequency list is sorted from high to low according to the occurrence frequency of the concerned characters in a training sample set;
and screening the concerned characters contained in the duplicate removal sample according to the candidate character set to obtain a screening sample corresponding to the duplicate removal sample.
Optionally, the compression processing sub-module is further configured to:
performing sliding window processing on the screened sample by using an intercepting window to obtain an intercepted sample corresponding to the screened sample, wherein the size of the intercepting window is the second threshold, the step length of each sliding of the intercepting window is a preset number of characters, and each intercepted sample comprises the second threshold number of characters;
when the intercepted sample is determined to belong to the compressed sample contained in the compressed sample set, sliding the intercepted window according to the step length, returning to the step of performing sliding window processing on the screened sample by using the intercepted window to obtain the intercepted sample corresponding to the screened sample until the intercepted sample is determined not to belong to the compressed sample contained in the compressed sample set;
and when determining that the intercepted sample does not belong to the compressed samples already contained in the compressed sample set, determining the intercepted sample as the compressed sample corresponding to the screened sample.
Optionally, the compression processing sub-module is further configured to:
when the number of characters contained in the screening sample corresponding to the intercepting window is determined to be smaller than the size of the intercepting window, according to the character arrangement sequence of the screening sample, setting completion characters with difference numbers at the tail parts of the characters contained in the screening sample corresponding to the intercepting window, extracting the characters contained in the screening sample corresponding to the intercepting window and the set completion characters as the intercepting sample corresponding to the screening sample, wherein the difference numbers are determined by the difference between the size of the intercepting window and the number of characters to be intercepted in the screening sample corresponding to the intercepting window;
and when the number of characters contained in the screening sample corresponding to the intercepting window is determined to be larger than or equal to the size of the intercepting window, extracting the characters corresponding to the intercepting window from the screening sample as the intercepting sample corresponding to the screening sample according to the character arrangement sequence of the screening sample.
Optionally, the position setting module is configured to set the predetermined character at a leading position of the duplicate removal sample if the duplicate removal sample contains the predetermined character.
According to the prompt template processing device provided by the embodiment of the invention, the training sample set is obtained during model pre-training, then unified pre-processing transformation is carried out on the training samples, the length of prediction during continuous multiplication is shortened, and the convergence speed of the mask language model is effectively improved.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device may be a terminal such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, other Mobile Internet Devices (MID), a PAD, a desktop computer, etc. Fig. 5 does not limit the structure of the electronic device. As shown in fig. 5, the electronic device includes at least a memory 501 and a processor 502. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 5.
In particular, the processes described above with reference to the flowcharts of fig. 1-3 may be implemented as computer software programs, according to embodiments provided by the present invention. For example, embodiments provided herein include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program performs the above-described functions defined in the system of the present invention when executed by a Central Processing Unit (CPU).
It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments provided by the present invention may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes a training sample acquisition module, a sample compression processing module, and a model training module. Where the names of such units or modules do not in some cases constitute a limitation of the unit or module itself, for example, a training sample acquisition module may also be described as a "module for acquiring a set of training samples".
As another aspect, embodiments of the present invention further provide a computer-readable storage medium, which may be included in the electronic device described in the foregoing embodiments; or may be separate and not incorporated into the electronic device. The computer readable storage medium stores one or more programs which, when executed by one or more processors, perform the methods described in the present application for intent recognition.
The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features and the technical features (but not limited to) having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims (13)

1. A method of intent recognition, the method comprising:
acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, each training sample is a character combination of an abstract intention, and the training samples at least comprise attention characters in ordered arrangement;
when the total number of the characters of the training sample set after the duplication removal is larger than or equal to a first threshold value, compressing each training sample according to the number of times of occurrence of the characters to obtain a compressed sample set;
and inputting the compressed sample set into a pre-constructed mask language model for training, and outputting to obtain an intention recognition result, wherein the mask language model is used for filling the input compressed sample into a preset position in a language prompt template to be trained and then training.
2. The method according to claim 1, wherein the compressing each training sample according to the number of occurrences of the character to obtain a compressed sample set comprises:
acquiring a word frequency list corresponding to the training sample set, wherein the word frequency list comprises attention characters contained in the training sample set and the occurrence times of each attention character in the training sample set;
performing character deduplication processing on each training sample in the training sample set to obtain a deduplication sample set, where the deduplication sample set includes multiple deduplication samples, and the deduplication samples correspond to the training samples one to one;
and compressing each de-duplicated sample in the de-duplicated sample set according to the word frequency list to obtain the compressed sample set, wherein the compressed sample set comprises a plurality of compressed samples, and the compressed samples correspond to the de-duplicated samples one to one.
3. The method of claim 2, wherein the performing character deduplication processing on each training sample in the training sample set to obtain a deduplication sample set comprises:
for each training sample, when the training sample is determined to contain a non-attention character, replacing the non-attention character with a preset character to obtain a replacement sample corresponding to the training sample;
carrying out character deduplication processing on the replacement sample corresponding to the training sample to obtain a deduplication sample corresponding to the replacement sample;
adding a de-duplicated sample corresponding to the replacement sample to the set of de-duplicated samples;
when the training sample is determined not to contain the non-attention characters, performing character de-emphasis processing on the training sample to obtain a de-emphasis sample corresponding to the training sample;
adding a de-emphasis sample corresponding to the training sample to the set of de-emphasis samples.
4. The method of claim 2, wherein the compressing each of the de-duplicated samples in the set of de-duplicated samples according to the word-frequency list to obtain the set of compressed samples comprises:
for each of the de-duplication samples, determining a character length of the de-duplication sample;
when the character length is larger than or equal to a second threshold value, screening the duplicate removal sample according to the character frequency list to obtain a screening sample corresponding to the duplicate removal sample;
compressing the screened sample corresponding to the de-weighted sample to obtain a compressed sample corresponding to the screened sample;
adding a compressed sample corresponding to the screened sample to the set of compressed samples.
5. The method of claim 4, further comprising:
for each of the de-duplicated samples, determining the de-duplicated sample as a compressed sample when the character length is less than a second threshold.
6. The method of claim 4, wherein the filtering the de-duplicated samples according to the word frequency list to obtain filtered samples corresponding to the de-duplicated samples comprises:
according to the arrangement sequence of the word frequency list, determining a candidate character set from the first concerned character of the word frequency list, wherein the candidate character set comprises a first threshold number of concerned characters which are continuous, and the word frequency list is sorted according to the sequence of the occurrence times of the concerned characters in the training sample set from high to low;
and screening the concerned characters contained in the duplication removing sample according to the candidate character set to obtain a screening sample corresponding to the duplication removing sample.
7. The method according to claim 4, wherein the compressing the screening sample corresponding to the de-duplicated sample to obtain a compressed sample corresponding to the screening sample comprises:
performing sliding window processing on the screened sample by using an intercepting window to obtain an intercepting sample corresponding to the screened sample, wherein the size of the intercepting window is the second threshold, the step length of each sliding of the intercepting window is a preset number of characters, and each intercepting sample comprises the second threshold number of characters;
when the intercepted sample is determined to belong to the compressed sample contained in the compressed sample set, sliding the intercepted window according to the step length, returning to the step of performing sliding window processing on the screened sample by using the intercepted window to obtain the intercepted sample corresponding to the screened sample until the intercepted sample is determined not to belong to the compressed sample contained in the compressed sample set;
when the intercepted sample is determined not to belong to the compressed sample already contained in the compressed sample set, determining the intercepted sample as the compressed sample corresponding to the screening sample.
8. The method of claim 7, wherein the performing a sliding window process on the filtered sample by using the capture window to obtain a capture sample corresponding to the filtered sample comprises:
when the number of characters contained in the screening sample corresponding to the intercepting window is determined to be smaller than the size of the intercepting window, according to the character arrangement sequence of the screening sample, setting a difference number of completion characters at the tail part of the characters contained in the screening sample corresponding to the intercepting window, and extracting the characters contained in the screening sample corresponding to the intercepting window and the set completion characters as the intercepting sample corresponding to the screening sample, wherein the difference number is determined by the difference between the size of the intercepting window and the number of characters to be intercepted in the screening sample corresponding to the intercepting window;
and when the number of characters contained in the screening sample corresponding to the intercepting window is determined to be larger than or equal to the size of the intercepting window, extracting characters corresponding to the intercepting window from the screening sample as the intercepting sample corresponding to the screening sample according to the character arrangement sequence of the screening sample.
9. The method according to any of claims 2-6, wherein if the de-duplicated samples contain a predetermined character, the predetermined character is set at the beginning of the de-duplicated samples.
10. The method of claim 2, wherein obtaining the word frequency list corresponding to the training sample set comprises:
respectively counting the occurrence times of each attention character in the training sample set;
and sequencing the attention characters in the training sample set according to the occurrence frequency of each attention character to obtain a character frequency list corresponding to the training sample set.
11. An apparatus for intent recognition, the apparatus comprising:
a training sample acquisition module, configured to acquire a training sample set, where the training sample set includes multiple training samples, each training sample is a character combination of an abstract intention, and the training samples at least include attention characters arranged in order;
the sample compression processing module is used for compressing each training sample according to the occurrence times of characters to obtain a compressed sample set when the total number of characters of the training sample set after the duplication removal is greater than or equal to a first threshold value, wherein the compressed sample set comprises a plurality of compressed samples;
and the model training module is used for inputting the compressed sample set into a pre-constructed mask language model for training and outputting to obtain an intention recognition result, and the mask language model is used for filling the input compressed sample into a preset position in a language prompt template to be trained and then training.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-10 when executing the program.
13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-10.
CN202210262775.0A 2022-03-17 2022-03-17 Intention recognition method, apparatus, device and medium Pending CN114706943A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210262775.0A CN114706943A (en) 2022-03-17 2022-03-17 Intention recognition method, apparatus, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210262775.0A CN114706943A (en) 2022-03-17 2022-03-17 Intention recognition method, apparatus, device and medium

Publications (1)

Publication Number Publication Date
CN114706943A true CN114706943A (en) 2022-07-05

Family

ID=82168344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210262775.0A Pending CN114706943A (en) 2022-03-17 2022-03-17 Intention recognition method, apparatus, device and medium

Country Status (1)

Country Link
CN (1) CN114706943A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035510A (en) * 2022-08-11 2022-09-09 深圳前海环融联易信息科技服务有限公司 Text recognition model training method, text recognition device, and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035510A (en) * 2022-08-11 2022-09-09 深圳前海环融联易信息科技服务有限公司 Text recognition model training method, text recognition device, and medium

Similar Documents

Publication Publication Date Title
CN109522553B (en) Named entity identification method and device
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN112131881B (en) Information extraction method and device, electronic equipment and storage medium
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN108228567B (en) Method and device for extracting short names of organizations
CN111221936B (en) Information matching method and device, electronic equipment and storage medium
CN111241285A (en) Method, device, equipment and storage medium for identifying question answer types
CN112364664B (en) Training of intention recognition model, intention recognition method, device and storage medium
CN110955766A (en) Method and system for automatically expanding intelligent customer service standard problem pairs
CN112188311B (en) Method and apparatus for determining video material of news
CN115329176A (en) Search request processing method and device, computer equipment and storage medium
CN113239668B (en) Keyword intelligent extraction method and device, computer equipment and storage medium
CN114706943A (en) Intention recognition method, apparatus, device and medium
CN115952854B (en) Training method of text desensitization model, text desensitization method and application
CN111159370A (en) Short-session new problem generation method, storage medium and man-machine interaction device
CN115840808A (en) Scientific and technological project consultation method, device, server and computer-readable storage medium
CN116306974A (en) Model training method and device of question-answering system, electronic equipment and storage medium
CN111522957B (en) Training method and system for phrase segmentation model
CN115730051A (en) Text processing method and device, electronic equipment and storage medium
CN110502741B (en) Chinese text recognition method and device
CN115168544A (en) Information extraction method, electronic device and storage medium
CN113869049A (en) Fact extraction method and device with legal attribute based on legal consultation problem
CN113705194A (en) Extraction method and electronic equipment for short
CN111723188A (en) Sentence display method and electronic equipment based on artificial intelligence for question-answering system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination