CN113345426B

CN113345426B - Voice intention recognition method and device and readable storage medium

Info

Publication number: CN113345426B
Application number: CN202110616990.1A
Authority: CN
Inventors: 张勇; 刘升平; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2023-02-28
Anticipated expiration: 2041-06-02
Also published as: CN113345426A

Abstract

The invention provides a voice intention recognition method, a device and a readable storage medium, which relate to the field of voice intention recognition, and the voice intention recognition method comprises the following steps: collecting first training data; inputting the first training data to a training model to complete forward calculation of the first training data; calculating a first loss function of the training model according to a forward data calculation result of the first training data; obtaining countermeasure sample data according to the first loss function; acquiring second training data, wherein the second training data at least comprises the countermeasure sample data and the first training data; inputting the second training data into the training model, and determining parameters of the training model to obtain a target model; and determining an intention classification of the voice data to be intention classified according to the target model. The method and the device are used for accurately classifying the voice intention so as to improve the user experience.

Description

Voice intention recognition method and device and readable storage medium

[ technical field ] A method for producing a semiconductor device

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for recognizing a speech intention, and a readable storage medium.

[ background of the invention ]

With the rapid development of artificial intelligence technology and the wide use of artificial intelligence technology in life, voice interaction becomes an important bridge for communication between people and machines. The voice intention recognition technology is one of the key technologies for realizing voice interaction.

The intention recognition is to perform semantic understanding on a text by a Natural Language Processing (NLP) technology to obtain keyword information, and recognize a voice intention of a user based on the keyword information. Currently, intention recognition cannot be distinguished according to contexts, tones and the like, and even an intention opposite to an original intention can be recognized due to poor recognition capability.

Therefore, how to accurately recognize the intention of the voice is one of the technical difficulties in the field.

[ summary of the invention ]

In view of the above, embodiments of the present invention provide a method, an apparatus, and a readable storage medium for recognizing a speech intention, which are used to accurately recognize a speech intention.

One aspect of the present invention provides a speech intention recognition method, including:

collecting first training data;

inputting the first training data to a training model to complete a forward calculation of the first training data;

calculating a first loss function of the training model according to a forward data calculation result of the first training data;

obtaining countermeasure sample data according to the first loss function;

acquiring second training data, wherein the second training data at least comprises the countermeasure sample data and the first training data;

inputting the second training data into the training model, and determining parameters of the training model to obtain a target model;

and determining an intention classification of the voice data to be intention classified according to the target model.

Optionally, the determining an intention classification of the speech data to be intention classified according to the target model includes:

performing intention recognition on each piece of voice text, and marking and recording a negative classification intention data set when the voice text is different from the corresponding original intention label

Or, carrying out intention recognition on each piece of voice text, and marking and recording a positive classification intention data set when the voice text is the same as the corresponding original intention label

Optionally, the inputting the first training data into a training model to complete forward calculation of the first training data includes:

wherein the content of the first and second substances,

representing the forward calculation, theta represents the parameters of the training model, f represents the forward function through the training model, x _i Representing each piece of phonetic text, y _i Representing the original intent tag corresponding to each piece of speech text.

Optionally, the calculating a first loss function of the training model according to the forward data calculation result of the first training data includes:

wherein J represents the cross entropy function, loss _i Representing a first loss function.

Optionally, the obtaining countermeasure sample data according to the first loss function includes:

calculating a first gradient of the training model according to the first loss function;

and calculating the countermeasure sample data according to the first gradient.

Optionally, the calculating a first gradient of the training model according to the first loss function includes:

wherein, grad _i Representing a gradient of the training model.

Optionally, the obtaining the challenge sample data according to the first gradient calculation includes:

X ^adv ＝x _i +∈sign(grad _i )

wherein sign (grad) _i ) Represents a sign function, ∈ represents a coefficient of the sign function, and ≦ 0 ≦ ∈ ≦ 1.

Optionally, the inputting the second training data into the training model, and determining parameters of the training model to obtain a target model includes:

calculating a second loss function of the training model from the second training data;

calculating a second gradient of the training model according to the second loss function;

and determining parameters of the training model according to the second gradient, and taking the training model with the determined parameters as the target model.

Optionally, the calculating a second loss function of the training model according to the second training data includes:

wherein the content of the first and second substances,

representing second training data, wherein the first part represents a cross entropy loss function of the second training data, the second part represents an intention correct ratio in the second training data, i is more than or equal to 1 and less than or equal to n, and n is a natural number.

The present invention in its second aspect provides a speech intention recognition apparatus comprising:

the acquisition module is used for acquiring first training data;

the calculation module is used for inputting the first training data into a training model so as to complete forward calculation of the first training data;

the calculation module is further configured to calculate a first loss function of the training model according to a forward data calculation result of the first training data;

the calculation module is also used for obtaining countermeasure sample data according to the first loss function;

the acquisition module is further used for acquiring second training data, wherein the second training data at least comprises the confrontation sample data and the first training data;

the calculation module is further used for inputting the second training data into the training model, and determining parameters of the training model to obtain a target model;

and the intention classification module is used for determining the intention classification of the voice data to be intention classified according to the target model.

A third aspect of the present invention provides a speech intent recognition apparatus comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the processor, the programs comprising instructions for performing any of the steps of the first aspect of the present invention.

A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program, which is executed by a processor to implement the speech intention recognition method according to any one of the first aspect of the present invention.

Any one of the above technical solutions has the following beneficial effects:

in an embodiment of the present invention, first training data is collected, and the first training data may be understood to include voice intention data. Inputting the first training data into a training model, where the training model in this embodiment may be an algorithm model for speech intention recognition, including but not limited to a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), to complete forward calculation of the first training data, calculating a first loss function of the training model according to a forward data calculation result of the first training data, obtaining countermeasure sample data according to the first loss function, and then completing the countermeasure training of the training model, where the robustness of the training model is enhanced by means of the countermeasure training and the contrast learning. Acquiring second training data, wherein the second training data at least comprises the countermeasure sample data and the first training data; and training the training model by taking the second training data as the voice intention data to determine parameters of the training model, and recording and fixing the parameters when the parameters of the training model reach the optimal value, wherein the training model at the moment is the target model. Further, the voice data to be intent-classified is intent-classified using the target model. The method improves the intention recognition capability in the scene and improves the actual experience of the user. Meanwhile, the scheme can be nested in a plurality of deep learning phonetic intention classification algorithms of any type, and the application range is wide.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flowchart illustrating a voice intent recognition method according to an embodiment of the present invention;

fig. 2 is another flow chart of the voice intention recognition method according to the embodiment of the present invention.

[ detailed description ] embodiments

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be noted that the terms "upper", "lower", "left", "right", and the like used in the description of the embodiments of the present invention are used in the angle shown in the drawings, and should not be construed as limiting the embodiments of the present invention. In addition, in this context, it will also be understood that when an element is referred to as being "on" or "under" another element, it can be directly formed on "or" under "the other element or be indirectly formed on" or "under" the other element through intervening elements.

The present invention provides a voice intention recognition method, as shown in fig. 1, which is a flow diagram of the voice intention recognition method provided by the embodiment of the present invention, and the voice intention recognition method includes:

s11, collecting first training data;

in this embodiment, the first training data may be represented by the following formula:

wherein x is _i Representing each piece of phonetic text, y _i And representing an original intention label corresponding to each voice text, wherein i is more than or equal to 1 and less than or equal to n, and n is a natural number.

S12, inputting the first training data into a training model to complete forward calculation of the first training data;

illustratively, the training model in the present embodiment may be a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN).

S13, calculating a first loss function of the training model according to a forward data calculation result of the first training data;

s14, obtaining countermeasure sample data according to the first loss function;

s15, collecting second training data, wherein the second training data at least comprises the confrontation sample data and the first training data;

s16, inputting the second training data into the training model, and determining parameters of the training model to obtain a target model;

and S17, determining the intention classification of the voice data to be intention classified according to the target model.

It should be noted that, in this embodiment, the speech intention recognition may be divided into two parts, where the first part may be understood as a training stage, i.e., training of a training model, and the second part may be understood as a testing stage, i.e., a model after the parameters of the training model are determined may be referred to as a target model, and in this stage, intention recognition may be performed on data to be subjected to speech intention recognition by the target model and classification may be performed according to a result after the recognition.

In an embodiment of the present invention, first training data is collected, and the first training data may be understood to include voice intention data. Inputting the first training data into a training model, where the training model in this embodiment may be an algorithm model for speech intention recognition, including but not limited to a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), to complete forward calculation of the first training data, calculating a first loss function of the training model according to a forward data calculation result of the first training data, obtaining countermeasure sample data according to the first loss function, and then completing the countermeasure training of the training model, where this embodiment enhances robustness of the training model by means of the countermeasure training and the comparative learning. Acquiring second training data, wherein the second training data at least comprises the countermeasure sample data and the first training data; and training the training model by taking the second training data as the voice intention data to determine parameters of the training model, and recording and fixing the parameters when the parameters of the training model reach the optimal value, wherein the training model at the moment is the target model. Further, the voice data to be intent-classified is intent-classified using the target model. The method improves the intention recognition capability in the scene and improves the actual experience of the user. Meanwhile, the scheme can be nested in a plurality of deep learning phonetic intention classification algorithms of any type, and the application range is wide.

In the training data of the speech intention recognition, in order to increase the accuracy of the training model, negative classification intention label data is added to the intention data according to a certain proportion. Typically the ratio of positive classification intention label data to negative classification intention label data is 1. If the proportion of the negative classification intention data is large and the corresponding space is large, the recognition capability of the positive classification intention is poor, and the negative classification intention label is often recognized as the positive classification intention label. Therefore, manual inspection is required subsequently to ensure the identification quality.

In the embodiment, the robustness of the training model is enhanced in the training stage through the mode of confrontation training and contrast learning of confrontation samples, and meanwhile, as negative classification intention data are not added into the voice intention training data, the possibility that the negative classification intention label is recognized as positive is avoided, so that the voice intention recognition capability is improved, and further the user experience is improved. In addition, negative classification intention data are avoided being used in the embodiment, on one hand, the use space of the voice data needing to be trained is compressed, the training speed is increased, on the other hand, follow-up manual checking is not needed, and the recognition capability of voice intention recognition is guaranteed.

Optionally, in step S12, inputting the first training data into a training model to complete forward calculation of the first training data, specifically including:

wherein the content of the first and second substances,

Further, the calculating a first loss function of the training model according to the forward data calculation result of the first training data includes:

Still further, as shown in fig. 2, which is another flow chart of the voice intention recognition method according to the embodiment of the present invention, the obtaining countermeasure sample data according to the first loss function includes:

s141, calculating a first gradient of the training model according to the first loss function;

and S142, calculating according to the first gradient to obtain the countermeasure sample data.

Optionally, step S142 includes calculating a first gradient of the training model according to the first loss function, specifically including:

wherein, grad _i Representing a gradient of the training model.

Optionally, the obtaining the challenge sample data by calculating according to the first gradient includes:

X ^adv ＝x _i +∈sign(grad _i )

in this embodiment, when grad _i At > 0, sign (grad) _i ) =1; when grad _i When < 0, sign (grad) _i ) =1, wherein sign (grad) _i ) Represents a sign function, ∈ represents a coefficient of the sign function, and ≦ 0 ≦ ∈ ≦ 1.

Further, the challenge sample data will be

And the first training data X form new training data, namely second training data X ^ADA Wherein the second training data X ^ADA ＝X∪X ^adv The second training data may be used as new speech intent training data to train the training model.

In this embodiment, the second training data includes confrontation sample data, so that the training model enhances the robustness of the training model in a mode of confrontation training and contrast learning.

In this embodiment, the second training data X ^ADA As new speech intent training data to train the model, examples may include:

setting the batch number of training models as b, wherein each batch at least comprises a voice text and an original intention label corresponding to the voice text;

training data for a certain intention within a batch

Intention recognition is carried out, i.e.

y _k ＝y _i

y _m ≠y _i

If y _k With the original intention label y _i Same, then belong to the positive classification intention dataset

If y _m With the original intention label y _i Not the same, belong to the negative classification intention data set

Further, the determining an intention classification of the voice data to be intention classified according to the target model includes:

By classifying the voice data according to the intention, the voice can be further classified according to the voice, intonation, conversation relationship, conversation content, conversation scene and the like, so as to improve the accuracy of recognition and the like.

In another embodiment, the inputting the second training data into the training model, and determining parameters of the training model to obtain the target model, includes:

It will be appreciated that after the second loss function and the second gradient calculation, it may modify the parameter according to the difference, the modified parameter value remains unchanged, and the training model using the modified (optimized) parameter value may be understood as the target model.

Further, the calculating a second loss function of the training model according to the second training data includes:

wherein, the first and the second end of the pipe are connected with each other,

representing second training data, wherein the first part represents a cross entropy loss function of the second training data, the second part represents an intention correct ratio in the second training data, i is larger than or equal to 1 and smaller than or equal to n, and n is a natural number.

In this embodiment, the inventors have conducted creative experiments to find that: when the highest score of the target model is lower than 0.8, the speech text input into the target model is considered to be inconsistent with the corresponding original intention label; otherwise, the two are consistent. Note that the output range of the target model is between 0 and 1.

In another embodiment of the present invention, there is provided a speech intention recognition apparatus including:

the acquisition module is used for acquiring first training data;

the calculation module is further used for calculating a first loss function of the training model according to a forward data calculation result of the first training data;

the acquisition module is further configured to acquire second training data, where the second training data at least includes the challenge sample data and the first training data;

The voice intention recognition device of the embodiment of the invention can realize any step in the voice intention recognition method, so that the voice intention recognition device comprises all the beneficial effects of the voice intention recognition method: first training data is collected, which may be understood to include voice intent data. Inputting the first training data into a training model, where the training model in this embodiment may be an algorithm model for speech intention recognition, including but not limited to a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), to complete forward calculation of the first training data, calculating a first loss function of the training model according to a forward data calculation result of the first training data, obtaining countermeasure sample data according to the first loss function, and then completing the countermeasure training of the training model, where the robustness of the training model is enhanced by means of the countermeasure training and the contrast learning. Acquiring second training data, the second training data comprising at least the challenge sample data and the first training data; and training the training model by taking the second training data as the voice intention data to determine parameters of the training model, and recording and fixing the parameters when the parameters of the training model reach the optimal value, wherein the training model at the moment is the target model. Further, the voice data to be intent-classified is intent-classified using the target model. The method improves the intention recognition capability in the scene and improves the actual experience of the user. Meanwhile, the scheme can be nested in a plurality of deep learning speech intention classification algorithms of any type, and the application range is wide.

In the embodiment, the robustness of the training model is enhanced in the training stage through the mode of the confrontation training and the contrast learning of the confrontation samples, and meanwhile, because negative classification intention data are not added into the voice intention training data, the possibility that the negative classification intention labels are recognized to be positive is avoided, so that the voice intention recognition capability is improved, and further the user experience is improved. In addition, negative classification intention data are avoided being used in the embodiment, on one hand, the use space of the voice data needing to be trained is compressed, the training speed is increased, on the other hand, follow-up manual checking is not needed, and the recognition capability of voice intention recognition is guaranteed.

In another embodiment of the present invention, a speech intent recognition apparatus is provided that includes a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing any of the steps of the speech intent recognition method of the present invention.

The voice intention recognition device of the embodiment of the invention can realize any step in the voice intention recognition method, so that the voice intention recognition device comprises all the beneficial effects of the voice intention recognition method: first training data is collected, which may be understood to include voice intent data. Inputting the first training data into a training model, where the training model in this embodiment may be an algorithm model for speech intention recognition, including but not limited to a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), to complete forward calculation of the first training data, calculating a first loss function of the training model according to a forward data calculation result of the first training data, obtaining countermeasure sample data according to the first loss function, and then completing the countermeasure training of the training model, where the robustness of the training model is enhanced by means of the countermeasure training and the contrast learning. Acquiring second training data, wherein the second training data at least comprises the countermeasure sample data and the first training data; and training the training model by taking the second training data as the voice intention data to determine parameters of the training model, and recording and fixing the parameters when the parameters of the training model reach the optimal value, wherein the training model at the moment is the target model. Further, the voice data to be intent-classified is intent-classified using the target model. The method improves the intention recognition capability in the scene and improves the actual experience of the user. Meanwhile, the scheme can be nested in a plurality of deep learning phonetic intention classification algorithms of any type, and the application range is wide.

In another embodiment of the present invention, a computer-readable storage medium is provided, and the computer-readable storage medium stores a computer program, which is executed by a processor to implement any one of the voice intention recognition methods described in the present invention.

In the embodiment of the present invention, since the computer-readable storage medium of the present embodiment can implement any step of the above method, it includes all the beneficial effects of the above method: first training data is collected, which may be understood to include voice intent data. Inputting the first training data into a training model, where the training model in this embodiment may be an algorithm model for speech intention recognition, including but not limited to a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN), to complete forward calculation of the first training data, calculating a first loss function of the training model according to a forward data calculation result of the first training data, obtaining countermeasure sample data according to the first loss function, and then completing the countermeasure training of the training model, where the robustness of the training model is enhanced by means of the countermeasure training and the contrast learning. Acquiring second training data, the second training data comprising at least the challenge sample data and the first training data; and training the training model by taking the second training data as the voice intention data to determine parameters of the training model, and recording and fixing the parameters when the parameters of the training model reach the optimal value, wherein the training model at the moment is the target model. Further, the voice data to be intent-classified is intent-classified using the target model. The method improves the intention recognition capability in the scene and improves the actual experience of the user. Meanwhile, the scheme can be nested in a plurality of deep learning phonetic intention classification algorithms of any type, and the application range is wide.

The methods and apparatuses in the embodiments of the present disclosure may be implemented in terminal devices, which may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.

An electronic device may include a processing means (e.g., a central processing unit, a graphics processor, etc.) that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) or a program loaded from a storage means into a Random Access Memory (RAM). In the RAM, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device, ROM and RAM are connected to each other by a bus 604. An input/output (I/O) interface is also connected to the bus.

Generally, the following devices may be connected to the I/O interface: input devices including, for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, and the like; output devices including, for example, liquid Crystal Displays (LCDs), speakers, vibrators, and the like; storage devices including, for example, magnetic tape, hard disk, etc.; and a communication device. The communication means may allow the electronic device to communicate wirelessly or by wire with other devices to exchange data.

In particular, the processes described above with reference to the flow diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means, or installed from a storage means, or installed from a ROM. The computer program, when executed by a processing device, performs the functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (Hyper Text Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases constitute a limitation of the module itself, and for example, the first acquisition module may also be described as a "module for acquiring whisper data to be processed".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include one or more wire-based electrical connections, a portable meter

A computer disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A speech intent recognition method, comprising:

collecting first training data;

inputting the first training data to a training model to complete forward calculation of the first training data;

obtaining countermeasure sample data according to the first loss function;

determining an intention classification of the voice data to be intention classified according to the target model;

wherein the determining an intention classification of the speech data to be intention classified according to the target model comprises: performing intention recognition on each piece of voice text, and marking and recording a negative classification intention data set when the voice text is different from the corresponding original intention label

Or, carrying out intention recognition on each piece of the voice text, and marking and recording the positive classification intention data set when the voice text is the same as the corresponding original intention label

Adding negative classification intention label data into intention data according to a certain proportion in training data of voice intention recognition; wherein the phonetic text does not correspond to the corresponding intent tag as a negative classification intent tag.

2. The method of claim 1, wherein inputting the first training data to a training model to perform forward computation on the first training data comprises:

wherein the content of the first and second substances,

representing forward calculation, theta represents the parameters of the training model, f represents the forward function through the training model, x _i Representing each piece of phonetic text, y _i Representing the original intent tag corresponding to each piece of speech text.

3. The method of claim 1, wherein the calculating a first loss function of the training model based on the forward data computation of the first training data comprises:

whereinJ denotes the cross entropy function, loss _i Representing a first loss function.

4. The method according to claim 1, wherein obtaining countermeasure sample data according to the first loss function comprises:

and calculating the countermeasure sample data according to the first gradient.

5. The speech intent recognition method of claim 4, wherein the computing a first gradient of the training model according to the first loss function comprises:

wherein, grad _i Representing a gradient of the training model.

6. The method according to claim 4, wherein the calculating the challenge sample data according to the first gradient comprises:

X ^adv ＝x _i +∈sign(grad _i )

7. The method for recognizing speech intention according to claim 1, wherein the inputting the second training data into the training model, and determining parameters of the training model to obtain the target model comprises:

8. A speech intent recognition apparatus, comprising:

the acquisition module is used for acquiring first training data;

the intention classification module is used for determining the intention classification of the voice data to be intention classified according to the target model;

wherein the intent classification module is further to perform the following:

Wherein the speech intent recognition apparatus is further configured to: adding negative classification intention label data into intention data according to a certain proportion in training data of voice intention recognition; wherein the phonetic text does not match the corresponding intent tag as a negative classification intent tag.

9. A speech intent recognition apparatus comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which is executed by a processor to implement the speech intention recognition method of any one of claims 1-7.