CN114691839A

CN114691839A - Intention slot position identification method

Info

Publication number: CN114691839A
Application number: CN202011623049.4A
Authority: CN
Inventors: 祝官文
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-01

Abstract

The application provides an intention slot position identification method. The electronic equipment identifies a text sequence in the voice signal; sentence vectors and word vectors are extracted from the text sequence. The electronic equipment determines sentence vectors and the similarity between the word vectors subjected to multi-head attention calculation and a plurality of preset intention labels; extracting one or more intention labels in a text sequence by the electronic equipment based on the similarity between the sentence vector and the word vector subjected to multi-head attention calculation and a plurality of preset intention labels; the electronic equipment determines the similarity of the word vector and the slot position label template corresponding to one or more intention labels respectively; and extracting slot position tags corresponding to the one or more intention tags respectively. Therefore, the slot position coding information of the text sequence is referred to when the text sequence intention is identified, and the intention label coding information is referred to when the electronic equipment extracts the text sequence slot position, so that the accuracy of the electronic equipment in identifying the text sequence multiple intentions and extracting the slot position is improved.

Description

Intention slot position identification method

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an intention slot position identification method.

Background

With the continuous development and breakthrough of artificial intelligence, the frequency of human-computer interaction in daily work and life of people is higher and higher. Voice interaction is one of the most convenient ways to interact. The man-machine interactive dialogue system is widely applied to various intelligent electronic devices, such as mobile phones, televisions, vehicles and the like. In human-computer interaction, how the electronic device understands the intention of the user is the most critical.

Because human languages have complexity and ambiguity, a sentence often expresses multiple intentions. However, the current human-machine dialog system can only give one intention, including multiple intentions for a sentence or multiple intentions due to ambiguity for a sentence, and the human-machine dialog system can only give one intention, which may be different from the intention expressed by the user. Thus, current methods are inaccurate for the identification of multi-intent sentences.

Disclosure of Invention

The application provides an intention slot position identification method, which refers to slot position coding information of a text sequence when identifying the intention of the text sequence and embodies the restriction of the slot position coding information on the intention identification; when the electronic equipment extracts the slot position of the text sequence, the intention label coding information is referred to, and the restriction of the intention on the extraction of the slot position is embodied. The accuracy of the electronic equipment in recognizing the multiple intentions of the text sequence and extracting the slot positions is improved.

In a first aspect, the present application provides an intended slot identification method, comprising: the electronic equipment receives a voice signal input by a user; the electronic equipment identifies a text sequence in the voice signal; the electronic equipment extracts sentence vectors and word vectors from the text sequence; the electronic equipment determines the similarity between the sentence vector and a plurality of preset intention labels and the similarity between the word vector and the preset intention labels after multi-head attention calculation; extracting one or more intention labels in the text sequence by the electronic equipment based on the similarity between the sentence vector and the preset intention labels and the similarity between the word vector and the preset intention labels after multi-head attention calculation; the electronic equipment determines the similarity of the word vector and the slot position label template corresponding to the one or more intention labels respectively; and the electronic equipment extracts the slot position labels corresponding to the one or more intention labels from the text sequence based on the similarity between the word vector and the slot position label template corresponding to the one or more intention labels. And the electronic equipment executes the instruction corresponding to the one or more intention labels according to the slot position labels corresponding to the one or more intention labels respectively.

The method refers to slot position coding information of the text sequence when identifying the text sequence intention, and embodies the restriction of the slot position coding information on the intention identification; when the electronic equipment extracts the slot position of the text sequence, the intention label coding information is referred to, and the restriction of the intention on the extraction of the slot position is reflected. The accuracy of the electronic equipment in recognizing the multiple intentions of the text sequence and extracting the slot positions is improved.

With reference to the first aspect, in a possible implementation manner, before the electronic device determines similarity between the sentence vector and a plurality of preset intention labels and similarity between the word vector after the multi-head attention calculation and the plurality of preset intention labels, the method further includes: the electronic equipment splices the sentence vector and the word vector subjected to multi-head self-attention calculation to obtain a first text sequence vector; the electronic device determines similarity between the sentence vector and a plurality of preset intention labels and similarity between the word vector and the preset intention labels after multi-head attention calculation, and specifically includes: and the electronic equipment calculates the vector distance between the first text sequence vector and the preset intention labels to obtain an intention probability vector of the text sequence.

With reference to the first aspect, in a possible implementation manner, after the electronic device calculates vector distances between the first text sequence vector and the preset intent tags, before the obtaining the intent probability vector of the text sequence, the method further includes: the electronic equipment normalizes the vector distances between the first text sequence vector and the preset intention labels to obtain the intention probability vector of the text sequence.

With reference to the first aspect, in a possible implementation manner, the extracting, by the electronic device, one or more intention tags in the text sequence based on the similarity between the sentence vector and the preset intention tags and the similarity between the word vector after the multi-head attention calculation and the preset intention tags specifically includes: and the electronic equipment outputs the intention labels corresponding to one or more intention probabilities which are greater than a first preset probability in the intention probability vector, and determines the one or more intention labels in the text sequence.

With reference to the first aspect, in one possible implementation manner, the one or more intention tags are all intention tags of the text sequence in a preset field.

With reference to the first aspect, in a possible implementation manner, the determining, by the electronic device, a similarity between the word vector and a slot tag template corresponding to each of the one or more intention tags specifically includes: the electronic equipment performs multi-head attention calculation on the word vector and the intention label coding vector to obtain a second text sequence vector; the electronic equipment splices the second text sequence vector, the word vector and the intention probability vector to obtain a third text sequence vector; and the electronic equipment calculates the vector distance between the third text sequence vector and the slot label template corresponding to the one or more intention labels respectively to obtain the similarity between the word vector and the slot label template corresponding to the one or more intention labels respectively.

With reference to the first aspect, in one possible implementation manner, after the electronic device calculates a vector distance between the third text sequence vector and each of the one or more intent tags corresponding to a slot tag template, the method further includes: and the electronic equipment normalizes the vector distance between the third text sequence vector and the slot label template corresponding to the one or more intention labels respectively to obtain a slot probability vector of the text sequence. After the electronic device obtains the slot probability vector for the text sequence, the method further includes: and the electronic equipment outputs the slot position label corresponding to one or more slot position probabilities which are greater than a second set probability in the slot position probability vector, and determines one or more slot position labels corresponding to the one or more intention labels in the text sequence.

With reference to the first aspect, in a possible implementation manner, the splicing, by the electronic device, the sentence vector and the word vector after performing multi-head self-attention calculation to obtain a first text sequence vector specifically includes: and the electronic equipment adds the sentence vector and the word vector subjected to multi-head self-attention calculation to obtain the first text sequence vector.

With reference to the first aspect, in a possible implementation manner, the splicing, by the electronic device, the second text sequence vector, the word vector, and the intention probability vector to obtain a third text sequence vector specifically includes: the electronic device adds the second text sequence vector, the word vector, and the intention probability vector to obtain the third text sequence vector.

In a second aspect, the present application provides an electronic device comprising one or more processors, one or more memories; the one or more memories coupled with the one or more processors, the one or more memories for storing computer program code, the computer program code including computer instructions, the one or more processors invoking the computer instructions to cause the electronic device to perform a method of intended slot identification as described in any one of the possible implementations of the first aspect.

In a third aspect, the present application provides a computer-readable storage medium comprising instructions that, when executed on an electronic device, cause the electronic device to perform a method for intended slot identification as described in any one of the possible implementations of the first aspect.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present disclosure;

fig. 2 is a block diagram of a software structure of an electronic device 100 according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an intent and slot relationship provided by an embodiment of the present application;

fig. 4 is a flowchart illustrating a method for classifying an intended slot according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of a machine translation multi-tag intent classifier as provided in an embodiment of the present application;

fig. 6 is a schematic diagram of slot tags of a text sequence under different intentions according to an embodiment of the present application;

fig. 7 is a schematic diagram of slot tags of another text sequence according to different intentions provided in this application;

fig. 8 is a schematic diagram of a slot tag of another text sequence under different intentions according to an embodiment of the present application;

fig. 9 is a frame diagram of an intended slot position identification method according to an embodiment of the present disclosure;

FIG. 10 is a flowchart of a method for identifying an intended slot according to an embodiment of the present application;

fig. 11 is a schematic diagram of a slot tag of a text sequence in a simple diagram according to an embodiment of the present application;

12-14 are a set of UI diagrams provided by embodiments of the application;

fig. 15-16 are another set of UI diagrams provided by embodiments of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and thoroughly described below with reference to the accompanying drawings. In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" in the text is only an association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: three cases of a alone, a and B both, and B alone exist, and in addition, "a plurality" means two or more than two in the description of the embodiments of the present application.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

Fig. 1 shows a schematic structural diagram of an electronic device 100.

The following describes an embodiment specifically by taking the electronic device 100 as an example. It should be understood that the electronic device 100 shown in fig. 1 is merely an example, and that the electronic device 100 may have more or fewer components than shown in fig. 1, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The electronic device 100 may include: the mobile terminal includes a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. Wherein, the different processing units may be independent devices or may be integrated in one or more processors.

The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the time sequence signal to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, the charger, the flash, the camera 193, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 180K via an I2C interface, such that the processor 110 and the touch sensor 180K communicate via an I2C bus interface to implement the touch functionality of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may communicate audio signals to the wireless communication module 160 via the I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In other embodiments, the power management module 141 may be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.) or displays an image or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional modules, independent of the processor 110.

The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In some embodiments, antenna 1 of electronic device 100 is coupled to mobile communication module 150 and antenna 2 is coupled to wireless communication module 160 so that electronic device 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into analog audio signals for output, and also used to convert analog audio inputs into digital audio signals. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic apparatus 100 receives a call or voice information, it can receive voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and perform directional recording.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used for sensing a pressure signal, and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the intensity of the touch operation according to the pressure sensor 180A. The electronic apparatus 100 may also calculate the touched position from the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyro sensor 180B detects a shake angle of the electronic device 100, calculates a distance to be compensated for the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 through a reverse movement, thereby achieving anti-shake. The gyroscope sensor 180B may also be used for navigation, somatosensory gaming scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude, aiding in positioning and navigation, from barometric pressure values measured by barometric pressure sensor 180C.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip phone, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, taking a picture of a scene, electronic device 100 may utilize range sensor 180F to range for fast focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic apparatus 100 emits infrared light to the outside through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there are no objects near the electronic device 100. The electronic device 100 can utilize the proximity light sensor 180G to detect that the user holds the electronic device 100 close to the ear for talking, so as to automatically turn off the screen to achieve the purpose of saving power. The proximity light sensor 180G may also be used in a holster mode, a pocket mode automatically unlocking and locking the screen.

The ambient light sensor 180L is used to sense the ambient light level. Electronic device 100 may adaptively adjust the brightness of display screen 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket to prevent accidental touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint characteristics to unlock a fingerprint, access an application lock, photograph a fingerprint, answer an incoming call with a fingerprint, and so on.

The temperature sensor 180J is used to detect temperature. In some embodiments, electronic device 100 implements a temperature processing strategy using the temperature detected by temperature sensor 180J. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold, the electronic device 100 performs a reduction in performance of a processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, the electronic device 100 heats the battery 142 when the temperature is below another threshold to avoid the low temperature causing the electronic device 100 to shut down abnormally. In other embodiments, when the temperature is lower than a further threshold, the electronic device 100 performs boosting on the output voltage of the battery 142 to avoid abnormal shutdown due to low temperature.

The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation acting thereon or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, the bone conduction sensor 180M may acquire a vibration signal of the human vocal part vibrating the bone mass. The bone conduction sensor 180M may also contact the human pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 180M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 170 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 180M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 180M, so as to realize the heart rate detection function.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic apparatus 100 may receive a key input, and generate a key signal input related to user setting and function control of the electronic apparatus 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also respond to different vibration feedback effects for touch operations applied to different areas of the display screen 194. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be brought into and out of contact with the electronic apparatus 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. The same SIM card interface 195 can be inserted with multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to implement functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

The software system of the electronic device 100 may employ a hierarchical architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present invention uses an Android system with a layered architecture as an example to exemplarily illustrate a software structure of the electronic device 100.

Fig. 2 is a block diagram of a software configuration of the electronic apparatus 100 according to the embodiment of the present invention.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 2, the application package may include applications such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 2, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to notify download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The following describes exemplary workflow of the software and hardware of the electronic device 100 in connection with capturing a photo scene.

When the touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into an original input event (including touch coordinates, a time stamp of the touch operation, and other information). The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and taking a control corresponding to the click operation as a control of a camera application icon as an example, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera drive by calling a kernel layer, and captures a still image or a video through the camera 193.

Next, some terms and related technologies referred to in the present application are explained so as to be easily understood by those skilled in the art.

One, intention and slot position

(one), intention and slot definition:

intent, means that the electronic device identifies what the user's actual or potential needs are. Fundamentally, the intent is a classifier that classifies the user's needs into some type that is previously defined.

The intent and the slot together constitute a "user action" that the electronic device cannot directly understand the natural language, and thus the intent recognition serves to map the natural language into a machine-understandable structured semantic representation.

The intention recognition is also called suc (spoken utterances classification), and as the name implies, the natural language conversation input by the user is classified into categories (classification), and the classified categories correspond to the user intention. For example, "how today's weather" its intent is to "ask for weather". Naturally, intent recognition can be seen as a typical classification problem. For example, the intended classification and definition may refer to the ISO-24617-2 standard, where there are 56 detailed definitions. The definition of the intention has a great relationship with the location of the system itself and the knowledge base it has, i.e. the definition of the intention has a very strong domain relevance. It is to be understood that in the embodiments of the present application, the intended classification and definition is not limited to the ISO-24617-2 standard.

Slot position, i.e. the parameter with which the intent is taken. An intention may correspond to several slots, for example, when inquiring about a bus route, the necessary parameters of departure place, destination, time, etc. need to be given. The above parameters are slots corresponding to the intention of "inquiring bus route".

For example, the main goal of the semantic slot filling task is to extract the values of predefined semantic slots in a semantic frame (semantic frame) from an input sentence on the premise that the semantic frame is known for a specific domain or a specific intent. The semantic slot filling task can be converted into a sequence labeling task, namely an IOB labeling method is used for labeling the beginning (begin), the continuation (inside) or the non-semantic slot (outside) of a certain semantic slot. To make a system work properly, the intent and slot are first designed. The intent and slot position allow the system to know which particular task should be performed and to give the type of parameters needed to perform the task.

Taking a specific requirement of 'inquiring weather' as an example, the design of intentions and slots in a task-oriented dialog system is introduced:

an example of user input is: "how much the weather is today in the Shanghai";

the user intent defines: ask for Weather, Ask _ Weather;

slot position definition: a first slot position: time, Date; a second slot position: location, Location.

Fig. 3 is a schematic diagram of one intention and slot position relationship in the embodiment of the present application. As shown in fig. 3 (a), in this example, two necessary slots are defined for the "ask for weather" task, which are "time" and "location", respectively. For a single task, the above definition can solve the task requirement. However, in a real business environment, a system is often required to be able to handle several tasks simultaneously, for example, a weather station should be able to answer the question of "asking the weather" as well as the question of "asking the temperature".

For the complex situation that the same system handles multiple tasks, one optimized strategy is to define a higher-level domain, such as to attribute the "inquire weather" intention and the "inquire temperature" intention to the "weather" domain. In this case, the field can be simply understood as a set of intentions. The advantages of defining the domain and performing domain identification first are that the domain knowledge range can be constrained, and the search space for subsequent intention identification and slot filling is reduced. In addition, for each domain, with specific knowledge and characteristics related to tasks and domains, the effect of Natural Language Understanding (NLU) can be improved remarkably. Accordingly, the example of fig. 3 (a) is modified to add to the "weather" field:

an example of user input is:

"how much today's Shanghai weather;

"how much temperature is in the present in Shanghai";

domain definition: weather, Weather;

the user intent defines:

1. ask for Weather, Ask _ Weather;

2. query Temperature, Ask _ Temperature;

slot position definition: a first slot position: time, Date;

a second slot position: location, Location.

The modified "ask for weather" requirement corresponds to the intention and slot position as shown in fig. 3 (b).

(II), intention identification and slot filling: after the intent and slot position are defined, the user intent and the slot value corresponding to the corresponding slot can be identified from the user input.

The goal of intent recognition is to recognize the user intent from the input, and a single task can be modeled simply as a two-class question, such as a "ask weather" intent, which can be modeled as a "ask weather" or "not ask weather" two-class question at the time of intent recognition. When it comes to requiring a system to handle multiple tasks, the system needs to be able to discriminate between the various intents, in which case the two-classification problem translates into a multiple-classification problem. The task of slot filling is to extract information from the data and fill in slots defined in advance, for example, in fig. 3, intentions and corresponding slots have been defined, and for the user input "what is the weather today and shanghai" the system should be able to extract and fill "today" and "shanghai" slots to "time" and "location" slots, respectively.

Second, bert (bidirectional encoder representation from transformations) model: the BERT model is an encoder of bi-directional transformers, where a transformer is a method that relies entirely on self-attention to compute input and output characterizations. BERT utilizes masked model to realize the bi-directionality of the language model, and proves the importance of bi-directionality to the language representation pretraining 12. The BERT model is a true bi-directional language model, and each word can simultaneously utilize context information of the word. BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be fine-tuned through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks.

After the BERT model is added with a full connection layer, training is carried out, and after the training, the BERT model without the full connection layer can be used for carrying out various natural language processing tasks (including a sequence labeling task, a classification task, sentence relation judgment and a generating type task).

Third, an Artificial Intelligence (AI) model is a machine learning model, which is essentially a mathematical model that includes a large number of parameters and mathematical formulas (or mathematical rules). The aim is to learn mathematical expressions which can provide the correlation between the input value x and the output value y, and the mathematical expression which can provide the correlation between x and y is the trained AI model. Generally, an AI model obtained by training an initial AI model using some historical data (i.e., x and y) can be used to obtain a new y from the new x, thereby implementing predictive analysis, a process of which is also referred to as reasoning.

Fig. 4 is a flowchart illustrating a method for intending to sort slots, as shown in fig. 4.

The method is mainly based on a neural network model with a hierarchical structure, and the intended slot position classification is carried out by utilizing a mode of fusing a plurality of small models.

Firstly, the method uses a domain discrimination model to classify the domain of the sentence to be understood to obtain the domain label of each character.

That is, the sentence with understanding is input as the domain discrimination model, and the domain discrimination model outputs the domain label of each word in the sentence to be understood. Illustratively, the domain tags may be navigation, telephone, radio, weather, command control, and the like. Illustratively, if there are M words in the sentence to be understood, the domain discriminant model outputs domain tags of the M words. The domain tags may include: the method comprises the steps of classifying the domain of a word, and obtaining first position information of the word, wherein the first position information of the word is the position information of all the words meeting a first condition, and the first condition is that the word in a sentence to be understood belongs to the same domain classification as the word.

Then, the intention slot classification system performs intention classification using an intention discrimination model corresponding to the field tag model to obtain an intention tag for each word.

It is understood that one or more intent tags are included under each field tag. Illustratively, the intent tags under the navigation domain may include search, location, navigation, route, and the like; the intention tags in the radio field may include play, favorites, etc.

Assuming that N field classifications are preset in the intention slot classification system, and each field classification belongs to different fields, each field classification in the intention slot classification system corresponds to one intention identification type, and there are N intention identification models. And taking the words with the same domain labels in the sentence to be understood as the input of the corresponding intention judging model, and outputting the intention label of each word by the intention judging model. The intention discrimination model outputs M words of intention labels. The intent tag may include: the name of the intention classification to which the word belongs and second position information of the word, the second position information of the word being position information of the word in all words satisfying a second condition, the second condition being a word belonging to the same field classification and the same intention classification as the word in the sentence to be understood.

And finally, the intention slot classification system determines the slot labels of the sentences to be understood according to the field labels and the intention labels of each word in the sentences to be understood.

Because different types of slot tags have characteristics, different extraction modes are adopted by the intended slot classification system in order to better extract the slot tags.

The method I comprises the following steps: and the grammar rule mode is suitable for extracting slot tags with strong inexistible single regularity. Such as time, amount, etc., the extraction slot is analyzed using the following abnf grammar, i.e., regular expression.

The second method comprises the following steps: a dictionary approach that is suitable for the extraction of enumerable slot tags. Illustratively, in the city name example, the dictionary is list (city) ═ beijing, shanghai, guangzhou, shenzhen, and hangzhou … }.

Both of the above two ways can be referred to the prior art, and are not described herein.

The third method comprises the following steps: and for slot position labels which cannot be enumerated and have weak regularity, a slot information discrimination model is adopted.

The slot information discriminant model can be, but is not limited to, an "embedding + bidirectional lstm + crf" neural network model, and each domain classification corresponds to one slot information discriminant model. The inputs to the slot information discrimination model are the domain label for each word and the intention label for each word in the sentence to be understood. And the slot information discrimination model outputs slot position labels of the sentences to be understood.

According to the technical scheme, a plurality of simple classification models are used for modeling from different dimensions, and a plurality of prediction results are fused to obtain a final intention classification result.

On one hand, on the other hand, only intention judging tasks and slot position judging tasks are considered separately in the method, and in the scheme, the intention slot position relation is weak, and the generalization performance is poor; on the other hand, with the increase of services, the intended slot classification system needs more and more classification fields and intention discrimination models, and then the models need to be retrained.

As shown in FIG. 5, FIG. 5 is a block diagram of a machine translation multi-tag intent classifier. The multi-label intent classifier includes a logical tree that is composed of a root node 100 and intent nodes 200 at a first, second, third, and even more slot levels derived from the root node into a tree structure. Each intent node corresponds to an intent tag 400, and the intent tags 400 are colored during the process of filling in the intent slots to form the intent nodes.

Specifically, the multi-tag intention classifier recognizes the voice information input by the user as the text information 300, and then the multi-tag intention classifier vectorizes the text information 300 into a text vector; the multi-label intent classifier outputs the intent labels 400 based on the text vector and stains the intent labels into corresponding intent nodes 200 on the logical tree and finds the corresponding sequence of intent labels 200 under the control of the logical tree that form the search tree in the logical book, i.e., the search tree is a subset of the logical tree, which may be one or more. When one intention label is available, the multi-label intention classifier directly searches a database according to the intention label sequence and outputs commodity recommendation information; when the number of the intention labels is multiple, the multi-label intention classifier judges whether the intention node path corresponding to the intention label is unique, if so, the multi-label intention classifier can search a database according to the intention label and output commodity recommendation information; otherwise, the multi-label intention classifier outputs inquiry information to the user, the inquiry information is re-identified as text information after the reply of the user, the judgment is continued until the requirement is met, and the commodity recommendation information can be output.

The method finds an intent tag sequence under control of a logical tree based on a machine translation multi-tag intent classifier. On one hand, the scheme only supports intention classification and cannot support slot filling; on the other hand, the scheme depends on the intention logic tree, does not support large-scale concurrent operation, and cannot be suitable for large-scale intention slot position classification; in other aspects, the application range of the scheme is narrow, the scheme is only suitable for the commodity recommendation field, and the cost of large-scale migration to other fields is high.

The following embodiments of the present application provide an intention slot position identification method. The method can be applied to the field of human-computer interaction, and specifically comprises the following steps: the electronic equipment identifies the voice of the user as a text sequence, and obtains a sentence vector and a word vector of the text sequence according to the coding model. And then, the electronic equipment performs self-attention calculation on the word vector to obtain the context relationship between each word and the rest of the words in the text sequence. The electronic equipment splices the sentence vector and the word vector after self attention calculation to obtain a first text sequence vector. And the electronic equipment calculates the similarity between the first text sequence vector and the intention label coding vector to obtain the intention probability of the text sequence. And the electronic equipment takes the intention label with the output intention probability larger than a preset threshold value as the intention label of the text sequence. For slot extraction, the electronic device performs multi-head attention calculation on the word vector and the intention label coding vector to obtain a second text sequence vector. And then, the electronic equipment splices the second text sequence vector and the intention probability vector to obtain a third text sequence vector. And the electronic equipment calculates the similarity between the third text sequence vector and the cursive label coding vector to obtain the slot position probability of the text sequence. And the electronic equipment takes the slot position label with the output slot position probability larger than a preset threshold value as the slot position label of the text sequence.

Therefore, when the electronic equipment performs intention recognition according to the first text sequence vector, the first text sequence vector fuses slot position coding information of each word in the text sequence, and the restriction of the slot position coding information on the intention recognition of the text sequence is embodied; meanwhile, when the electronic equipment extracts the slot position according to the second text sequence vector, the second text sequence vector fuses the intention label coding information of each word in the text sequence, and the restriction of the intention label coding information on the slot position of the text sequence is embodied. The accuracy of the electronic equipment for identifying the intended slot position is improved.

Similarly, the method is not only suitable for extracting the multiple-intention slot position, but also suitable for extracting the single-intention slot position, and the specific implementation method is the same as the method for extracting the multiple-intention slot position of the text sequence by the electronic equipment, and is not repeated herein.

As shown in table 1, table 1 exemplarily shows a plurality of intentions under the domain classification and a preset slot corresponding to each intention.

TABLE 1

For example, when the electronic device recognizes that the field of the text sequence is a search, the intention corresponding to the field of "search" may include "search food" and "search place". When the intention is "search for food", the slot extracted by the electronic device for the text sequence includes a location (location), a food name (place). When the intent is "search for a place," the slot extracted by the electronic device for the text sequence includes a location (location).

When the electronic device recognizes the field of the text sequence as play, the intent corresponding to the "play" field may include "play music," play an electronic book, "and" play a video. When the intent is "play music," the slot extracted by the electronic device for the text sequence includes a music name (music _ name). When "play video" is intended, the slot extracted by the electronic device for the text sequence includes a video name (video _ name). When the intent is "play an electronic book," the slot extracted by the electronic device for the text sequence includes the electronic book name (voice _ name).

When the electronic device recognizes the field of text sequences as command control, the intent of the "command control" field may include "open" and "close". When the intent is "open," the slot extracted by the electronic device for the text sequence includes a name (name). When the intent is "close," the slot extracted by the electronic device for the text sequence includes a name (name).

When the electronic device recognizes that the field of the text sequence is predetermined, the intention corresponding to the field of "reservation" may include "reserving a hotel" and "reserving an air ticket". When the intention is "book a hotel," the slot extracted by the electronic device for the text sequence includes time (day), location (location), number (number), hotel name (hotel). When the intention is "reservation ticket", the slot extracted by the electronic device for the text sequence includes a date (day), a starting place (starting), a destination (destination), and a ticket name (ticket).

When the electronic device recognizes the field of the text sequence as being a call, the intent corresponding to the "call" field may include "call. The slot extracted by the electronic device for the text sequence includes a name (name).

Table 1 only exemplarily shows the respective intentions under part of the preset domain classifications and the preset slot labels corresponding to each intention, and may further include more domain classifications, which are not listed here.

It can be understood that the intentions under the partial preset domain classification shown in table 1 and the preset slot tag corresponding to each intention are obtained through sample data training. The electronic equipment can identify the field of the input text data and one or more intention labels under the field, and perform slot extraction on the text data under each intention label.

It is to be understood that each field may include one or more intents, and that each intended slot tag is also preset.

When the text sequence is multi-intent. The electronic device can identify a field of the text sequence and identify a plurality of intent tags of the text sequence under the field, and then the electronic device will slot the text sequence under each intent tag.

When the text sequence is a single intent. The electronic device can identify a field of the text sequence and identify an intent tag of the text sequence under the field, after which the electronic device will slot an intent on the text sequence under the intent tag.

Several cases where the text sequence contains multiple intents will be exemplified below.

For the simplex slot extraction, the text sequence may be "buy tomorrow to shanghai ticket", for example. The electronic device may recognize that the text sequence is intended to be "ticketed". The preset slot position under the 'ticket buying' intention can include a departure place, a destination and time. The electronic equipment uses a trained model or a neural network to extract the slot position of the text sequence, and the extracted slot position label can be the destination 'Shanghai' and the time 'tomorrow'.

In some embodiments, multiple intentions may appear in a text sequence due to multiple intentions in the text sequence, or entity ambiguity in the text sequence, or ambiguity in the text sequence due to the same sentence pattern. When the electronic equipment identifies the multi-intention slot position according to the single-intention model, the slot position label corresponding to the intention with the highest probability is output or the result is not output. However, when the electronic device outputs a slot corresponding to an intention with the highest probability, the intention is not necessarily an intention expressed by the user, and therefore, a deviation occurs when extracting an intention slot for a text sequence containing multiple intentions using a univocal model.

The first situation is as follows: the text sequence is ambiguous due to the same sentence pattern, resulting in the text sequence containing multiple intents.

Illustratively, the text sequence is "search for Kendeki near New street".

As shown in table 1, when the electronic device recognizes that the field of the text sequence is a search, the intention corresponding to the search field may be to search for a food, a search place, or the like.

When the intent is to search for a gourmet, the intended slot may include a location (location) and a gourmet (name). When the intent is to search for a place, the slot of the sub-intent may include a location (location).

The electronic device recognizes that the field of the text sequence is search, and when the electronic device recognizes the intention of the text sequence 'searching for kendir near a new street', the electronic device recognizes two intentions, namely that the electronic device 100 takes the 'kendir near the new street' as the intention of 'searching for a place'; the other is that the electronic device uses "kendirk" as the "search for food" intention.

As shown in fig. 6, fig. 6 illustrates a schematic diagram of slot tags of the text sequence under different intentions.

When the electronic device recognizes that the intention of the text sequence is "search _ location", the electronic device takes "kendiry near the new street" as the location slot. Under the intention of "search _ Location", the labeling result of the text sequence is "search-O", "new-B-Location", "street-I-Location", "mouth-I-Location", "attached-I-Location", "near-I-Location", "de-I-Location", "base-I-Location", and "search-O", "new-B-Location", "street-I-Location", "near-I-Location", "de-I-Location", and "base-I-Location".

When the electronic device recognizes that the intent of the text sequence is "search _ cat," the electronic device takes "new street" as the place slot and "kendyy" as the food slot. Under the intention of 'search _ place', the labeling result of the text sequence is 'search-O', 'New-B-Location', 'street-I-Location', 'mouth-I-Location', 'attached-O', 'near-O', 'Ken-B-place', 'Ded-I-place', 'base-I-place'.

Case two: the text sequence contains multiple intents due to entity ambiguity.

Illustratively, the text sequence is "play your time old".

As shown in table 1, when the electronic device recognizes that the field of the text sequence is playing, the corresponding intention of the playing field may be to play music, play video, play an electronic book, and so on.

The text sequence is played, when the electronic equipment identifies the intention of playing the 'hello time and old time' in the text sequence, the electronic equipment identifies three intentions, one is that the electronic equipment identifies the 'hello time and old time' as the name of a song, and the electronic equipment identifies the 'playing the hello time and old time' as the intention of playing music; the second one is that the electronic equipment identifies the 'hello time light' as a video name, and then the electronic equipment identifies the 'playing hello time light' as the intention of 'playing video'; the third is that the electronic device recognizes "time of your good old" as the name of the electronic book, and then the electronic device recognizes "play time of your good old" as the intention of "play the electronic book".

As shown in fig. 7, fig. 7 exemplarily shows a schematic diagram of the slot tag of the text sequence under different intentions.

When the electronic device recognizes that the intent of the text sequence is "play music (play _ music)", the electronic device marks "hello time light" as the music name slot. Under the intention of playing music (play _ music), the labeling result of the text sequence is "" play-O "" you-B-music "" good-I-music "" old-I-music "" time-I-music "" light-I-music "" and "" music-I-music "". The slot tag extracted by the electronic device is music name (music _ name) -time of you.

When the electronic device recognizes that the intent of the text sequence is "play video," the electronic device marks "hello time light" as the video name slot. Under the intention of playing video (play _ video), the labeling result of the text sequence is "" playing-O "" you-B-video "" good-I-video "" old-I-video "" time-I-video "" light-I-video "" and the like. The slot tag retrieved by the electronic device is the video name (video _ name) -time of your good.

When the electronic device recognizes that the intention of the text sequence is "play electronic book (play _ voice)", the electronic device marks "hello time light" as the electronic book name slot. With the intention of "play electronic book (play _ voice)", the text sequence is labeled as "play-O", "you-B-voice", "good-I-voice", "old-I-voice", "time-I-voice", "light-I-voice". The slot tag extracted by the electronic device is an electronic book name (voice _ name) -time of your good.

Case three: the utterance, i.e., the text sequence, expressed by the user contains a plurality of intentions.

Illustratively, the text sequence is "I want to book a day ticket to Beijing and book a hotel room near the overseas beach".

As shown in table 1, when the electronic device recognizes that the field of the text sequence is a reservation, the intention corresponding to the reservation field may be to reserve an air ticket, reserve a hotel, or the like.

When the field of the text sequence is presetting, and the electronic equipment extracts the slot position of the text sequence in the field of reservation, two intentions of reservation of air tickets and hotel reservation are correspondingly provided in the field of reservation. The electronic device will recognize two intents for the text sequence. One is that the electronic device recognizes the text sequence as a "predetermined ticket" intent; another is that the electronic device recognizes the text sequence as a "book for hotel" intent.

As shown in fig. 8, fig. 8 illustrates a schematic diagram of the slot tag of the text sequence under different intentions.

When the electronic device recognizes that the intention of the text sequence is "book _ ticket", the electronic device marks the text sequence "i want to book a ticket to beijing tomorrow and book a hotel room near shanghai beach" as the date slot and the destination slot and the ticket name slot. Under the intention of a preset air ticket (book _ ticket), "I-O", "want-O", "pre-O", "fix-O", "one-O", "open-On", "bright-B-day", "day-I-day", "remove-O", "go-B-destination", "sea-I-destination", "O of" "machine-B-ticket", "ticket-I-ticket", "merge-O", "pre-O", "fix-O", "go-O", "sea-O", "outer-O", "beach-O", "attached-O", "near-O", "O of-O wine-O" "shop-O" "between-O" "and" "between-O" "are marked.

When the electronic device recognizes that the intention of the text sequence is "book _ hotel", the electronic device marks a clause of "i want to book a flight ticket to beijing tomorrow" as a time slot and a location slot. Under the intention of a predetermined hotel (book _ hotel), the text sequence is labeled as "" I-O "" think-O "" Pre-O "" decide-O "" one-O "" open-O "" Ming-B-day "" day-I-day "" go-O "" go-B-location "" sea-I-location "" O "" machine-O "" Ticket-O "" A-B-O "" A-B "" A-B "" A-B "" A-B "" A-B "" A-B "" A-B "" A. When the electronic device recognizes that the intention of the text sequence is "book _ hotel", the electronic device will discard the intention of the clause one "book _ hotel" because there is no ticket name slot tag in the labeling result of the text sequence. The "book _ ticket" is taken as an intention tag of clause one.

When the electronic device recognizes that the intention of clause two is "book hotel," the electronic device marks the text sequence "i want to book an airline ticket to beijing tomorrow and book a hotel room near the overseas beach" as a location slot, a hotel name slot, and a quantity slot. Under the intention of a reserved hotel (book _ hotel), the labeling result of the text sequence is "" I-O "" think-O "" pre-O "" fixed-O "" one-O "" open-O "" bright-O "" day-O "" go-O "" sea-O "" the-O "" machine-O "" ticket-O "", "parallel-O", "pre-O", "fixed-O", "up-B-location", "sea-I-location", "out-I-location", "beach-I-location", "attached-O", "near-O", "liquor-B-hotel", "shop-I-hotel", "one-B-number", "inter-I-number".

The above embodiments exemplarily illustrate a case where a text sequence includes multiple intents in some scenarios, and the electronic device may recognize multiple intents in the text sequence and perform slot extraction on the text sequence under each intention.

The following describes embodiments of the present application by taking the multi-purpose slot extraction as an example.

As shown in fig. 9, fig. 9 is a frame diagram of an intended slot position identification method according to an embodiment of the present application.

The method includes intent identification and slot extraction.

For intent recognition, the following steps can be divided:

step one, the electronic equipment collects voice of a user, converts the voice into a text sequence, and expresses the text sequence into a sentence vector and a word vector through a bert embedding model.

Illustratively, the text sequence may be "search for Kendeki near New street".

The electronic device inputs a text sequence (sensor) into a Bert embedding model, which outputs a sentence vector (CLS) and a word vector (sequence output) of the text sequence.

Before the electronic device inputs the text sequence into the Bert embedding model, the text sequence needs to be preprocessed. The pre-processing includes entering the text sequence in the correct format as input to the Bert embedding model.

The format of the text sequence entered is: [ CLS ] "" search "" new "" street "" mouth "" attach "" near "" ken "" de "" base "" in the "" near "" street "" near "" street ".

The bert embedding model will output sentence vectors (CLS) and word vectors (sequence output) corresponding to the text sequence. Wherein a sentence vector (CLS) is a semantic representation of the text sequence; a word vector (sequence output) is a vector representation of each word in the text sequence.

And step two, the electronic equipment carries out vector splicing (add operation) on the sentence vector (CLS) and the word vector (sequence output) to obtain a first text sequence vector.

Firstly, the electronic device performs multi-head self-attention calculation on a word vector (sequence output) to obtain an attention value of each word in a text sequence.

The electronic equipment carries out vector splicing (add operation) on the word vector subjected to multi-head self-attention calculation and a sentence vector (CLS) to obtain a first text sequence vector.

The first text sequence vector is a composite representation of a sentence vector and a word vector of the text sequence. The first text sequence vector can represent the semantic representation of the text sequence and can also represent the attention value of each word in the text sequence.

And thirdly, the electronic equipment carries out vector calculation on the first text sequence vector and the intention label coding vector to obtain intention regression vectors (intents), and the intention regression vectors represent the similarity between the text sequence and each intention label.

The electronic device calculates the similarity between the text sequence and each of the intention tags by calculating a vector distance between the text sequence and each of the intention tags.

The intention tag encoding vector is preset. For example, assume that the electronic device can identify 700 intent tags, which 700 intent tags are represented by an intent tag encoding vector.

And step four, normalizing the intention regression vectors (intent logits) by the electronic equipment to obtain intention probability vectors (intent probs).

The probability vector represents the probability that the text sequence contains each of the intent tags. The electronic device normalizes intent regression vectors (intent locations), that is, converts a vector distance between the text sequence and each intent tag into a probability from the text sequence to each intent tag, to obtain an intent probability vector.

And fifthly, outputting the intention labels (intent labels) corresponding to the intention probability larger than a preset threshold (for example, 0.5) by the electronic equipment.

In this way, when the electronic device performs intent recognition on the text sequence through the first text sequence vector, the first text sequence vector takes into account both the sentence vector (CLS) and the word vector (sequence output), and when performing intent recognition on the text sequence, the electronic device references the slot position encoding information (i.e., the word vector (sequence output)) of the text sequence, thereby improving the accuracy of the intent recognition of multiple text sequences of the electronic device.

For slot extraction, the following steps can be divided:

step one, the electronic equipment performs multi-head attention calculation on a word vector (sequence output) and a slot label coding vector (slotted embedding) to obtain a second text sequence vector, wherein the second text sequence vector represents the similarity from each word to each intention label in a text sequence.

Specifically, the electronic device performs multi-head attention calculation on a word vector (sequence output) and a slot label coding vector (slot label embedding), that is, the electronic device performs vector calculation on each word and each slot label in a text sequence to obtain the similarity between each word and each slot label in the text sequence, and thus the vector distance between each word and each slot label in the text sequence is obtained.

And step two, the electronic equipment carries out vector splicing (add operation) on the second text sequence vector, the word vector (sequence output) and the intention probability vector to obtain a third text sequence vector.

The third text sequence vector is a composite representation of the second text sequence vector, the word vector, and the intent probability vector for the text sequence. The third text sequence vector can reflect the similarity from each word to each slot tag in the text sequence and can also reflect the probability that the text sequence contains each intention tag.

And thirdly, the electronic equipment performs vector calculation on the third text sequence vector and the slot label coding vector to obtain slot regression vectors (slot locations), wherein the slot regression vectors represent the similarity between each word in the text sequence and each slot label.

The electronic equipment calculates the similarity between each word in the text sequence and each slot position label, namely calculates the vector distance between each word in the text sequence and each slot position label.

The slot tag encoding vector is preset. For example, assuming that the electronic device can identify 300 intent tags, the 300 intent tags are represented by a slot tag encoding vector.

And step four, normalizing the slot regression vectors (slot logits) by the electronic equipment to obtain slot probability vectors (slot probs).

The slot probability vector represents the probability of each word in the text sequence to each slot tag. The electronic device normalizes slot regression vectors (slot locations), that is, the vector distance between each word in the text sequence and each slot tag is converted into the probability from each word in the text sequence to each slot tag, so that a slot probability vector is obtained.

And fifthly, outputting the slot labels (intent labels) corresponding to the slot position probability larger than a preset threshold (for example, 0.5) by the electronic equipment.

Therefore, when the electronic equipment extracts the slot positions of the text sequence through the third text sequence vector, the third text sequence vector not only considers the similarity between each word in the text sequence and each intention label, but also considers the intention information of the text sequence, and the electronic equipment refers to the intention information of the text sequence when the electronic equipment performs the slot positions on the text sequence, so that the accuracy of extracting the multiple text sequence slot positions of the electronic equipment is improved.

As shown in fig. 10, fig. 10 is a flowchart of a method for identifying an intended slot according to an embodiment of the present application.

S1001, the electronic equipment inputs the text sequence into a bert embedding model to obtain a sentence vector and a word vector.

Before the electronic device enters the text sequence into the bert embedding model, the electronic device recognizes the collected voice of the user and converts the voice into the text sequence.

Illustratively, the text sequence may be "search for Kendeki near New street".

Before the electronic device inputs the text sequence into the bert embedding model, the text sequence needs to be adjusted into a correct format to be used as the input of the bert embedding model.

The bert embedding model will output sentence vectors (CLS) and word vectors (sequence output) corresponding to the text sequence. Wherein, the sentence vector (CLS) is a semantic representation of the text sequence, i.e. the domain to which the text sequence belongs. Illustratively, as shown in table 1, a sentence vector (CLS) may belong to a search field, a play field, a command control field, a predetermined field, a call field, etc., and the electronic device will perform intention recognition and slot extraction on the text sequence according to the field to which the sentence vector (CLS) belongs; a word vector (sequence output) is a vector representation of each word in the text sequence.

Illustratively, "search for KendyK near New street crossing" for the text sequence. The field to which the sentence vector (CLS) of the text sequence output by the electronic device belongs is a search field. The electronic device will perform intent recognition on the text sequence in the search domain.

Illustratively, "play your time old" for the text sequence. The field to which the sentence vector (CLS) of the text sequence output by the electronic device belongs is the playing field. The electronic device will perform intent recognition on the text sequence in the field of play.

Specifically, the electronic device outputs a sentence vector (CLS) and a word vector (sequence output) according to the text sequence, the sentence vector (CLS) may be represented as a matrix [1, hidden ], and the word vector (sequence output) may be represented as a matrix [ seq _ len, hidden ].

Here, hidden represents the number of hidden layers of the BERT model, and is set to x. For example, the BERT model may be BERT-base model, and then hidden ═ 768. seq _ len represents the length of text data, and if there are p words in the text data, then seq _ len is p.

Illustratively, when the text sequence of the bert embedding model is [ CLS ] "" search "" new "" street "" mouth "" attach "" near "" ken "" de "" base "". If the text sequence has 11 words, seq _ len is 11. The sentence vector (CLS) can be represented as a matrix [1, 768], and the word vector (sequence output) can be represented as a matrix [11, 768 ].

S1002, the electronic equipment performs multi-head self-attention calculation on the word vectors to obtain the attention value of each word in the text sequence.

Specifically, the electronic device calculates similarity between a first word vector in the text sequence and other remaining words in the text sequence to obtain a weight; then, the electronic equipment normalizes the first word vector and the weights of other remaining words in the text sequence; and finally, the electronic equipment performs weighted summation on the first word in the normalized text sequence and the weights of other word vectors remaining in the text sequence to obtain the attention value of the first word vector in the text sequence. Similarly, self-attention calculation is carried out on all the word vectors in the text sequence according to the mode to obtain the attention value of each word vector in the text sequence. Illustratively, the attention value of each word vector in the text sequence may be represented by a matrix [11, 768 ].

S1003, the electronic equipment obtains a first text sequence vector according to the sentence vector and the word vector after multi-head self-attention calculation.

The first text sequence vector is a composite representation of a sentence vector and a word vector of the text sequence. The first text sequence vector can represent the semantic representation of the text sequence and the attention value of each word in the text sequence.

Specifically, the electronic device may perform vector concatenation (add operation) on the word vector subjected to multi-head self-attention calculation and the sentence vector (CLS) in any one of the following manners.

In the first mode, the electronic device performs vector addition calculation on the word vector subjected to multi-head self-attention calculation and a sentence vector (CLS) to obtain a first text sequence vector.

First, the electronic device needs to expand the sentence vector (CLS) into a matrix of the same size as the word vector matrix after performing the multi-headed self-attention calculation. Illustratively, the electronic device expands the sentence vector (CLS) from a matrix [1, 768] to a matrix [11, 768 ].

And then, the electronic equipment adds the sentence vector (CLS) matrix and the word vector matrix subjected to multi-head self-attention calculation to obtain a first text sequence vector. Illustratively, the first text sequence matrix may be represented as a matrix [11, 768 ].

And in the second mode, the electronic equipment splices the word vector subjected to multi-head self-attention calculation with a sentence vector (CLS) to obtain a first text sequence vector.

The electronic equipment directly splices the word vector subjected to multi-head self-attention calculation behind or in front of a sentence vector (CLS) to obtain a first text sequence vector. The method and the device do not limit the splicing sequence of the word vector and the sentence vector after multi-head attention calculation.

First, the electronic device needs to expand the sentence vector (CLS) into a matrix of the same size as the word vector matrix after performing multi-headed self-attention calculation. Illustratively, the electronic device expands the sentence vector (CLS) from a matrix [1, 768] to a matrix [11, 768 ].

The word vector after multi-headed self-attention calculation can be represented as a matrix [11, 768], and the sentence vector (CLS) can be represented as a matrix [11, 768 ]. The electronic device performs vector splicing on the word vector subjected to multi-head self-attention calculation and a sentence vector (CLS) to obtain a first text sequence vector, and the first text sequence vector can be represented as a matrix [22, 768 ].

S1004, the electronic equipment calculates the similarity between the first text sequence vector and the intention label coding vector to obtain the intention probability of the text sequence.

The electronic equipment calculates the similarity between the first text sequence vector and the intention label coding vector, and the obtaining of the intention probability of the text sequence can also be the similarity between the sentence vector and a plurality of preset intention labels determined by the electronic equipment and the similarity between the word vector after multi-head attention calculation and the plurality of preset intention labels.

The intent tag encoding vector is a vector representation of a preset intent tag. Illustratively, when the preset intention tag is 700, for example, "reserve air ticket", "reserve hotel", "play music", "play video", and the like. Each intention tag corresponds to a one hot vector, which can be represented as a matrix [1, 768], and then 700 preset intention tags can be represented as a matrix [700, 768 ].

The electronic device calculates the similarity between the first text sequence vector and the intent tag encoding vector by calculating the vector distance between the text sequence and each of the intent tags.

Specifically, the electronic device performs matrix operation on a transposed matrix of the first text sequence vector and the intention label coding vector to obtain a vector distance between the text sequence and each intention label.

First, for example, when the intention tag encoding vector is represented as a matrix [700, 768], the electronic device transposes the intention tag encoding vector matrix [700, 768] to obtain an intention tag encoding vector matrix [768, 700 ].

Then, the electronic device performs matrix operation on the first text sequence vector matrix [11, 768] and the intention label coding vector matrix [768, 700] to obtain intention regression vectors (intent locations), wherein the intention regression vectors can be expressed as the matrices [11, 700 ].

The intent regression vector represents the similarity of the first text sequence vector to the intent tag encoding vector, i.e., the vector distance between the text sequence and each intent tag.

The electronic device normalizes the intent regression vector (intent logis) to obtain an intent probability vector (intent probs). The intention probability vector may be represented as a matrix [11, 700 ].

The intent probability vector represents the probability that the text sequence contains each of the intent tags. The electronic device normalizes intent regression vectors (intent locations), that is, converts a vector distance between the text sequence and each intent tag into a probability from the text sequence to each intent tag, to obtain an intent probability vector.

The electronic device outputs an intention tag with an intention probability greater than a preset threshold (e.g., 0.5). The intention tags with the intention probability larger than a preset threshold (for example, 0.5) are the intention tags contained in the text sequence recognized by the electronic equipment.

It is understood that there are multiple intent tags corresponding to each predetermined area, and the slot position under each intent tag is predetermined. The electronic equipment extracts the multiple intention slots of the text sequence, and the electronic equipment is in the field of known text sequences and takes all intention labels of the output text sequence in the field as the intention labels of the text sequence recognized by the electronic equipment.

Illustratively, "search for Kendeki near New street" for text sequence. The text sequence belongs to the search field, the electronic device outputs intention probabilities corresponding to the intention labels 'search place' and 'search food' of the text sequence in the search field are both greater than 0.5, and the intention probabilities corresponding to other preset intention labels, such as the intention labels 'play music', 'play video' and 'play electronic book', are both less than 0.5. The electronic device takes the intent tags "search for location" and "search for food" as the intent tags for the text sequence.

Illustratively, for a text sequence "play your time", the field to which the text sequence belongs is a play field, the electronic device outputs intention labels "play music", "play video", and "play electronic book" of the text sequence in the play field, the intention probabilities corresponding to the intention labels "play music", "play video", and "play electronic book" are all greater than 0.5, and the intention probabilities corresponding to other preset intention labels, such as the intention labels "search place", "search food", are all less than 0.5. The electronic device takes the intent tags "play music," "play video," and "play e-book" as the intent tags for the text sequence.

S1005, the electronic device performs multi-head attention calculation on the word vector and the intention label encoding vector (intent label encoding) to obtain a second text sequence vector.

The intent tag encoding vector is a vector representation of a preset intent tag. Illustratively, when the preset intention tag is 700, for example, "reserve air ticket", "reserve hotel", "play music", "play video", and the like. Each intention tag corresponds to a one-hot vector, which can be represented as a matrix [1, 768], and then 700 preset intention tags can be represented as a matrix [700, 768 ].

Specifically, the electronic device calculates similarity between a first word vector in the text sequence and an intention label encoding vector (intent label encoding) to obtain a weight; then, the electronic equipment normalizes the first word vector and an intention label encoding vector (intent label encoding); and finally, the electronic equipment performs weighted summation on the first word vector in the normalized text sequence and the weight of the intention label encoding vector (intention label encoding) to obtain the similarity between the first word in the text sequence and all slot position labels. Similarly, performing multi-head attention calculation on all words in the text sequence according to the above manner to obtain a second text sequence vector, wherein the second text sequence vector represents the similarity between each word vector in the text sequence and all intention bit label vectors. Illustratively, the second text sequence vector may be represented by the matrix [11, 768 ].

And S1006, the electronic equipment carries out vector splicing on the second text sequence vector, the word vector and the intention probability vector to obtain a third text sequence vector.

And the electronic equipment carries out vector splicing (add operation) on the second text sequence vector, the word vector and the intention probability vector to obtain a third text sequence vector.

The third text sequence vector is a composite representation of the second text sequence vector, the word vector, and the intent probability vector for the text sequence. The third text sequence vector can represent the similarity between each word in the text sequence and each slot position label and can also represent the probability that the text sequence contains each intention label.

Specifically, the electronic device may perform vector concatenation (add operation) on the second text sequence vector and the intention probability vector in any one of the following manners.

And in the first mode, the electronic equipment carries out vector addition calculation on the second text sequence vector, the word vector and the intention probability vector to obtain a third text sequence vector.

The second text sequence vector may be represented by a matrix [11, 768], the word vector may be represented by a matrix [11, 768], and the intention probability vector may be represented by a matrix [11, 700 ].

First, the electronic device expands the intention probability vector matrix from [11, 700] to [11, 768 ].

And the electronic equipment adds the second text sequence vector matrix and the intention probability vector quantity matrix to obtain a third text sequence vector. Illustratively, the third text sequence matrix may be represented as a matrix [11, 768 ].

And secondly, the electronic equipment carries out vector splicing on the second text sequence vector, the word vector and the intention probability vector to obtain a third text sequence vector.

Illustratively, the electronic device splices the intention probability vector, the second text sequence vector, directly behind or in front of the word vector, resulting in a third text sequence vector. The order of splicing the intention probability vector, the second text sequence vector and the word vector is not limited.

The second text sequence vector may be represented by a matrix [11, 768], the word vector may be represented by a matrix [11, 768], and the intention probability vector may be represented by a matrix [11, 700 ]. First, the electronic device needs to extend the intention probability vector into a matrix of the same size as the second text sequence vector. Illustratively, the electronic device expands the intention probability vector from a matrix [11, 700] to a matrix [11, 768 ].

And the electronic equipment carries out vector splicing on the second text sequence vector, the word vector and the intention probability vector to obtain a third text sequence vector, and the third text sequence vector can be expressed as a matrix [33, 768 ]. The order of splicing the word vector and the intention probability vector of the second text sequence vector is not limited.

And S1007, the electronic device calculates the similarity between the third text sequence vector and the slot tag vector to obtain the slot probability of the text sequence.

The slot tag vector is a vector representation of a preset slot tag. Illustratively, when the number of the preset slot tags is 300, for example, "time (day)", "destination (destination)", "location (location)", and the like. Each slot tag corresponds to one hot vector, and the one hot vector can be represented as a matrix [1, 768], and then 300 preset slot tags can be represented as a matrix [300, 768 ].

The electronic equipment calculates the similarity between the third text sequence vector and the slot label coding vector, namely calculates the vector distance between each word in the text sequence and each slot label.

Specifically, the electronic device performs matrix operation on the third text sequence vector and the transposed matrix of the slot tag coding vector to obtain a vector distance between each word in the text sequence and each slot tag.

First, for example, when the slot tag encoding vector is represented as a matrix [300, 768], the electronic device transposes the slot tag encoding vector matrix [300, 768] to obtain the slot tag encoding vector matrix [768, 300 ].

And then, the electronic equipment performs matrix operation on the third text sequence vector matrix [11, 768] and the intention label coding vector matrix [768, 300] to obtain slot regression vectors (slot locations), wherein the intention regression vectors can be expressed as the matrices [11, 300 ].

The intent regression vector represents the similarity of the third text sequence vector to the slot tag encoding vector, i.e., the vector distance between each word in the text sequence and each slot tag.

The electronic device normalizes the slot regression vectors (slot logis) to obtain slot probability vectors (slot probs). The slot probability vector may be represented as a matrix [11, 300 ].

The slot probability vector represents the probability of each word in the text sequence to each slot tag. The electronic device normalizes slot regression vectors (slot locations), that is, a vector distance between each word vector in the text sequence and each slot tag vector is converted into a probability from each word vector to each slot tag in the text sequence, so that a slot probability vector is obtained.

The electronic device outputs a slot tag having a slot probability greater than a preset threshold (e.g., 0.5). The slot tag with the slot probability greater than the preset threshold (for example, 0.5) is a slot tag included in the text sequence recognized by the electronic device.

Illustratively, "search for Kendeki near New street" for text sequence. The electronic device will output the intent tags "search for location" and "search for food" according to the search field. Under the intention tag ' search place ', the slot probability corresponding to the slot position tag ' position (location) ' output by the electronic equipment is greater than a preset threshold value (for example, 0.5), and the slot position probabilities corresponding to other slot position tags such as the slot position tag ' cate name ' and/or ' music name ' (music _ name) ' output by the electronic equipment are smaller than the preset threshold value (for example, 0.5); under the intention tag of 'searching for gourmet food', the slot probabilities corresponding to the slot tags of 'position (location)' and 'gourmet food name' (place) 'output by the electronic equipment are both greater than a preset threshold value (for example, 0.5), and the slot probabilities corresponding to other slot tags of' time (day) 'and/or' music name '(music _ name)' output by the electronic equipment are both less than the preset threshold value (for example, 0.5).

As shown in fig. 6, fig. 6 illustrates slot extraction results for the electronic device identifying multiple intentions of the text sequence "search for kendyn near new street".

When the electronic device recognizes that the intention of the text sequence is "search _ location", the electronic device takes "kendyn near the new street corner" as a location slot. Under the intention of "search _ Location", the labeling result of the text sequence is "search-O", "new-B-Location", "street-I-Location", "mouth-I-Location", "attached-I-Location", "near-I-Location", "de-I-Location", "base-I-Location", and "search-O", "new-B-Location", "street-I-Location", "near-I-Location", "de-I-Location", and "base-I-Location".

When the electronic device recognizes that the intent of the text sequence is "search _ cat," the electronic device takes "new street" as the place slot and "kendyy" as the food slot. Under the intention of 'search _ place', the labeling result of the text sequence is 'search-O', 'New-B-Location', 'street-I-Location', 'mouth-I-Location', 'attached-O', 'near-O', 'Ken-B-place', 'Ded-I-place', 'base-I-place'. After the labeling result of the text sequence is obtained, the electronic equipment extracts the slot position label according to the labeling result, wherein the slot position label is position (location) -new street, and cate name (table) -kendiry.

Illustratively, "Play your old time" for a text sequence. The electronic device will output the intention tags "play music," "play video," and "play e-book" according to the intention to play. Under the intention tag 'music playing', the slot probability that the slot output slot tag is the corresponding slot of the 'music name (music _ name)' is larger than a preset threshold (for example, 0.5), and the slot probabilities that other slot tags such as the 'cate name' (cate) 'and/or the' music name '(music _ name)' and the like are all smaller than the preset threshold (for example, 0.5) are output by the electronic equipment; under the intention tag of 'playing video', the slot probability that the slot output slot tag is the corresponding slot of 'video _ name' is greater than a preset threshold (for example, 0.5), and the slot probabilities that other slot tags such as 'cate name' and/or 'music name' and the like are output by the electronic equipment are smaller than the preset threshold (for example, 0.5); under the intention tag "play the electronic book", the slot probability that the slot output slot tag is the slot corresponding to the "electronic book name" (voice _ name) "is greater than a preset threshold (for example, 0.5), and the slot probabilities corresponding to other slot tags such as the slot tag" cate "and/or" music name "(music _ name)" will be all less than the preset threshold (for example, 0.5) by the electronic device.

As shown in fig. 7, fig. 7 exemplarily shows a slot tag in a text sequence "play your old time light" with different intentions.

When the electronic device recognizes that the intention of the text sequence is "play music (play _ music)", the electronic device marks "time of you' best" as the music name slot. Under the intention of playing music (play _ music), the labeling result of the text sequence is "" play-O "" you-B-music "" good-I-music "" old-I-music "" time-I-music "" light-I-music "" and "" music-I-music "". After the labeling result of the text sequence is obtained, the electronic equipment extracts the slot position label according to the labeling result, wherein the slot position label is a music name (music _ name) -time of your good.

When the electronic device recognizes that the intent of the text sequence is "play video," the electronic device marks "hello time light" as the video name slot. Under the intention of playing video (play _ video), the labeling result of the text sequence is "" playing-O "" you-B-video "" good-I-video "" old-I-video "" time-I-video "" light-I-video "" and the like. After the labeling result of the text sequence is obtained, the electronic equipment extracts a slot position label according to the labeling result, wherein the slot position label is a video name (video _ name) -hello time.

When the electronic device recognizes that the intention of the text sequence is "play electronic book (play _ voice)", the electronic device marks "time of your good" as an electronic book name slot. With the intention of "play electronic book" (play _ voice), "play-O", "you-B-voice", "good-I-voice", "old-I-voice", "time-I-voice", "light-I-voice" and "the text sequence is labeled as" play-O "," old-I-voice "and" time-I-voice ". After the labeling result of the text sequence is obtained, the electronic device extracts the slot tag according to the labeling result, wherein the slot tag is an electronic book name (voice _ name) -time of your good.

The above embodiments exemplarily show that the text sequence contains a plurality of intentions, and the electronic device may recognize the plurality of intentions in the text sequence and perform slot extraction on the text sequence according to the preset slot under each intention.

The method can also be applied to extraction of the slot position of the simple sketch, and when only one intention exists in one text sequence, the electronic equipment can recognize the intention and extract the slot position of the text sequence according to the preset slot position under the intention. The principle of single-intent slot extraction is the same as that of multi-intent slot extraction, and specifically, reference may be made to the embodiment of multi-intent slot extraction, which is not described herein again.

As shown in fig. 11, fig. 11 is a diagram illustrating a slot extraction result when the electronic device recognizes a text sequence "call to dad" sketch.

When the electronic device recognizes that the intended tag of the text sequence is "call (call)," dad "is labeled as the name slot. Under the icon label of "call" (call), "the labeling result of the text sequence is" "make-O" "electronic-O" "phone-O" "give-O" "dad-B-name" "dad-I-name" ". After the labeling result of the text sequence is obtained, the electronic equipment extracts the slot position label according to the labeling result, wherein the slot position label is the name (name) -dad.

An intention slot identification method provided by the application in a man-machine conversation scene will be described below.

In a human-machine conversation application scenario, an electronic device converts user speech into a text sequence and identifies one or more intents in the text sequence. When the text sequence only has one intention, the electronic equipment executes the instruction corresponding to the intention; when the text sequence contains a plurality of intentions, the electronic equipment displays an instruction corresponding to each intention in the plurality of intentions to the user, and the user decides which instruction corresponding to the intention needs to be executed. Therefore, the method can improve the accuracy of human-computer interaction and improve the user experience.

The multi-intent slot identification method provided by the embodiment of the present application is described below with reference to an application scenario.

Fig. 12-13 illustrate schematic diagrams of a human-machine dialog scenario.

As shown in fig. 12, the electronic device may capture the user's speech and recognize the speech as a text sequence.

As shown in fig. 13, the electronic device recognizes the collected user speech as a text sequence "play your old time". It can be known from the above embodiments that the text sequence "play your old time" may have three intentions, which are "play music", "play video", and "play electronic book", respectively.

As shown in fig. 13, when the electronic device recognizes an instruction "play your time", the electronic device will display a prompt message "find" your time ", do you play music, video, or e-book? "thus, the electronic device prompts the user with all intentions that the text sequence may contain in order for the user to select the intention that he wants to perform.

As shown in fig. 13, after the electronic device displays the prompt message, the electronic device may capture the user's voice and recognize the voice as a text sequence, which may be "play music".

The electronic device will recognize the user command and in response to the user command to "play music", the electronic device will display a prompt message "good, your time of year" for you to play music ". The electronic device will execute the instruction to play the music "hello time light".

As shown in fig. 14, in response to an instruction to play the music "hello time light", the electronic apparatus displays a music playing user interface 1401 as shown in fig. 14. The music playing user interface 1401 includes a music name icon 1402, a play/pause control 1403, a previous control 1404, a next control 1405, and a like control 1406.

The method for identifying the slot position of the univocal map is described below with reference to an application scenario.

As shown in fig. 15, the electronic device recognizes the collected user speech as the text sequence "call dad", and as can be seen from the above embodiment, the text sequence "call dad" only contains one intention, and this intention is "call".

The electronic device will recognize the user instruction and in response to the user instruction to "call dad", the electronic device will display the prompt "good, on-air dad's phone". The electronic device will execute the instruction to "call dad".

As shown in fig. 16, in response to the instruction to "call dad", the electronic device displays a call user interface 1600 as shown in fig. 16. The call user interface 1600 includes a mute control 1601, a dial-up keypad control 1602, an audio control 1603, an add call control 1604, a video call control 1605, an address book control 1606, and an end call control 1607.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of intended slot identification, the method comprising;

the electronic equipment receives a voice signal input by a user;

the electronic equipment identifies a text sequence in the voice signal;

the electronic equipment extracts sentence vectors and word vectors from the text sequence;

the electronic equipment determines the similarity between the sentence vector and a plurality of preset intention labels and the similarity between the word vector and the preset intention labels after multi-head attention calculation;

extracting one or more intention labels in the text sequence by the electronic equipment based on the similarity between the sentence vector and the preset intention labels and the similarity between the word vector and the preset intention labels after multi-head attention calculation;

the electronic equipment determines the similarity of the word vector and the slot position label template corresponding to the one or more intention labels respectively;

the electronic equipment extracts the slot position labels corresponding to the one or more intention labels from the text sequence based on the similarity between the word vector and the slot position label template corresponding to the one or more intention labels;

and the electronic equipment executes the instruction corresponding to the one or more intention labels according to the slot position labels corresponding to the one or more intention labels respectively.

2. The method of claim 1, wherein before the electronic device determines the similarity between the sentence vector and the plurality of preset intent tags and the similarity between the word vector after performing the multi-head attention calculation and the plurality of preset intent tags, the method further comprises:

the electronic equipment splices the sentence vector and the word vector subjected to multi-head self-attention calculation to obtain a first text sequence vector;

the electronic device determines similarity between the sentence vector and a plurality of preset intention labels and similarity between the word vector and the preset intention labels after multi-head attention calculation, and specifically includes:

and the electronic equipment calculates the vector distance between the first text sequence vector and the preset intention labels to obtain an intention probability vector of the text sequence.

3. The method of claim 2, wherein after the electronic device calculates vector distances of the first text sequence vector from the plurality of preset intent tags, prior to the deriving the intent probability vector for the text sequence, the method further comprises:

the electronic equipment normalizes the vector distances between the first text sequence vector and the preset intention labels to obtain the intention probability vector of the text sequence.

4. The method according to claim 3, wherein the electronic device extracts one or more intention tags in the text sequence based on the similarity between the sentence vector and the preset intention tags and the similarity between the word vector and the preset intention tags after multi-head attention calculation, and specifically comprises:

and the electronic equipment outputs the intention labels corresponding to one or more intention probabilities which are greater than a first preset probability in the intention probability vector, and determines the one or more intention labels in the text sequence.

5. The method of claim 4, wherein the one or more intent tags are all intent tags of the text sequence in a predetermined domain.

6. The method of claim 3, wherein the determining, by the electronic device, the similarity of the word vector and the slot tag template corresponding to each of the one or more intent tags comprises:

the electronic equipment performs multi-head attention calculation on the word vector and the intention label coding vector to obtain a second text sequence vector;

the electronic equipment splices the second text sequence vector, the word vector and the intention probability vector to obtain a third text sequence vector;

and the electronic equipment calculates the vector distance between the third text sequence vector and the slot label template corresponding to the one or more intention labels respectively to obtain the similarity between the word vector and the slot label template corresponding to the one or more intention labels respectively.

7. The method of claim 6, wherein after the electronic device calculates a vector distance of the third text sequence vector from the slot tag template to which the one or more intent tags each correspond, the method further comprises:

the electronic equipment normalizes the vector distance between the third text sequence vector and the slot position label template corresponding to the one or more intention labels respectively to obtain a slot position probability vector of the text sequence;

after the electronic device obtains the slot probability vector for the text sequence, the method further includes:

and the electronic equipment outputs slot position tags corresponding to one or more slot position probabilities which are greater than a second set probability in the slot position probability vector, and determines one or more slot position tags corresponding to one or more intention tags in the text sequence.

8. The method according to claim 2, wherein the electronic device concatenates the sentence vector and the word vector after performing multi-head self-attention calculation to obtain a first text sequence vector, and specifically includes:

and the electronic equipment adds the sentence vector and the word vector subjected to multi-head self-attention calculation to obtain the first text sequence vector.

9. The method according to claim 6, wherein the electronic device concatenates the second text sequence vector, the word vector, and the intention probability vector to obtain a third text sequence vector, specifically comprising:

the electronic device adds the second text sequence vector, the word vector, and the intention probability vector to obtain the third text sequence vector.

10. An electronic device comprising one or more processors, one or more memories; the one or more memories coupled with the one or more processors for storing computer program code, the computer program code comprising computer instructions, the one or more processors to invoke the computer instructions to cause the electronic device to perform the method of any of claims 1-9.

11. A computer-readable storage medium comprising instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-9.