CN111986653B - Voice intention recognition method, device and equipment - Google Patents

Voice intention recognition method, device and equipment Download PDF

Info

Publication number
CN111986653B
CN111986653B CN202010785605.1A CN202010785605A CN111986653B CN 111986653 B CN111986653 B CN 111986653B CN 202010785605 A CN202010785605 A CN 202010785605A CN 111986653 B CN111986653 B CN 111986653B
Authority
CN
China
Prior art keywords
pinyin
phoneme
sample
vector
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010785605.1A
Other languages
Chinese (zh)
Other versions
CN111986653A (en
Inventor
陈展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN202010785605.1A priority Critical patent/CN111986653B/en
Publication of CN111986653A publication Critical patent/CN111986653A/en
Priority to PCT/CN2021/110134 priority patent/WO2022028378A1/en
Application granted granted Critical
Publication of CN111986653B publication Critical patent/CN111986653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice intention recognition method, a device and equipment, wherein the method comprises the following steps: determining a phoneme set to be recognized according to the voice to be recognized; obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized; inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized; the target network model is used for recording the mapping relation between the phoneme vector and the voice intention. By the technical scheme, the voice intention recognition accuracy is high, the voice intention of the user can be accurately recognized, and the voice intention recognition accuracy is effectively improved.

Description

Voice intention recognition method, device and equipment
Technical Field
The present application relates to the field of voice interaction, and in particular, to a method, apparatus, and device for recognizing voice intention.
Background
With the rapid development of artificial intelligence technology and the wide use of artificial intelligence technology in life, voice interaction is an important bridge for communication between people and machines. The robot system is to talk with the user and complete a specific task, wherein one core technology is recognition of voice intention, namely, after the user inputs voice to be recognized into the robot system, the robot system can judge the voice intention of the user through the voice to be recognized.
In the related art, the recognition means of the voice intention includes: a speech recognition phase and an intent recognition phase. In the speech recognition phase, the speech to be recognized is converted into text by means of automatic speech recognition (Automatic Speech Recognition, ASR) technology. Then, in the intention recognition stage, the text is subjected to semantic understanding through natural language processing (Natural Language Processing, NLP) technology to obtain keyword information, and the voice intention of the user is recognized based on the keyword information.
According to the text-based intention recognition mode, the accuracy rate is seriously dependent on the accuracy rate of converting the voice into the text, and the accuracy rate of converting the voice into the text is lower, so that the accuracy rate of voice intention recognition is very low, and the voice intention of a user cannot be accurately recognized. For example, there are "trees" in the speech, but when the speech is converted to text, the text content may be "number", resulting in erroneous recognition of the speech intent.
Disclosure of Invention
The application provides a voice intention recognition method, which comprises the following steps:
Determining a phoneme set to be recognized according to the voice to be recognized;
Obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized;
Inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention.
In one possible implementation, the set of phonemes to be identified includes a plurality of phonemes to be identified, and the obtaining a phoneme vector to be identified corresponding to the set of phonemes to be identified includes:
Determining a phoneme characteristic value corresponding to each phoneme to be recognized; and acquiring a phoneme vector to be recognized corresponding to the phoneme set to be recognized based on the phoneme characteristic value corresponding to each phoneme to be recognized, wherein the phoneme vector to be recognized comprises the phoneme characteristic value corresponding to each phoneme to be recognized.
In one possible implementation manner, before the inputting the phoneme vector to be recognized into the trained target network model, so that the target network model outputs the voice intention corresponding to the phoneme vector to be recognized, the method further includes: acquiring sample voice and sample intention corresponding to the sample voice;
Determining a sample phoneme set according to the sample speech;
acquiring a sample phoneme vector corresponding to the sample phoneme set;
and inputting the sample phoneme vector and the sample intention to an initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain the target network model.
In one possible implementation, the sample phone set includes a plurality of sample phones, and the obtaining a sample phone vector corresponding to the sample phone set includes:
Determining a phoneme characteristic value corresponding to each sample phoneme aiming at each sample phoneme;
And acquiring a sample phoneme vector corresponding to the sample phoneme set based on the phoneme characteristic value corresponding to each sample phoneme, wherein the sample phoneme vector comprises the phoneme characteristic value corresponding to each sample phoneme.
The application provides a voice intention recognition method, which comprises the following steps:
Determining a pinyin set to be recognized according to the voice to be recognized;
acquiring a pinyin vector to be recognized corresponding to the pinyin set to be recognized;
inputting the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
In one possible implementation manner, the pinyin-set to be identified includes a plurality of pinyin-sets to be identified, and the obtaining the pinyin-vectors to be identified corresponding to the pinyin-set to be identified includes:
Determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, a pinyin vector to be identified corresponding to the pinyin to be identified set is obtained, and the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.
In one possible implementation manner, before the inputting the pinyin vector to be recognized into the trained target network model, so that the target network model outputs the voice intent corresponding to the pinyin vector to be recognized, the method further includes: acquiring sample voice and sample intention corresponding to the sample voice;
determining a sample pinyin set according to the sample speech;
Acquiring a sample pinyin vector corresponding to the sample pinyin set;
And inputting the sample pinyin vector and the sample intention to an initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain the target network model.
In one possible implementation, the sample pinyin set includes a plurality of sample pinyin, and the obtaining a sample pinyin vector corresponding to the sample pinyin set includes:
determining a pinyin characteristic value corresponding to each sample pinyin;
And acquiring a sample pinyin vector corresponding to the sample pinyin set based on the pinyin characteristic value corresponding to each sample pinyin, wherein the sample pinyin vector comprises the pinyin characteristic value corresponding to each sample pinyin.
The present application provides a voice intention recognition apparatus, the apparatus comprising:
The determining module is used for determining a phoneme set to be recognized according to the voice to be recognized;
the acquisition module is used for acquiring a phoneme vector to be identified, which corresponds to the phoneme set to be identified;
The processing module is used for inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention.
The present application provides a voice intention recognition apparatus, the apparatus comprising:
the determining module is used for determining a pinyin set to be recognized according to the voice to be recognized;
the acquisition module is used for acquiring the pinyin vectors to be identified corresponding to the pinyin sets to be identified;
The processing module is used for inputting the pinyin vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
The present application provides a voice intention recognition apparatus including: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine-executable instructions to perform the steps of:
Determining a phoneme set to be recognized according to the voice to be recognized;
Obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized;
Inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention;
or determining a pinyin set to be recognized according to the voice to be recognized;
acquiring a pinyin vector to be recognized corresponding to the pinyin set to be recognized;
inputting the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
As can be seen from the above technical solutions, in the embodiments of the present application, the voice intention is recognized based on the phonemes to be recognized, not based on the text recognition voice intention, and the accuracy of converting voice into text is not required to be relied on. Because the phonemes are the minimum speech units divided according to the natural attribute of the speech, the phonemes are analyzed based on pronunciation actions, and one action forms one phoneme, the accuracy of determining the phonemes to be recognized based on the speech to be recognized is high, the accuracy of speech intention recognition is high, the speech intention of a user can be accurately recognized, the accuracy of speech intention recognition is effectively improved, the intention recognition has stronger reliability, a large number of language model algorithm libraries for speech recognition are not needed, and the performance and the memory are greatly optimized.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly describe the drawings required to be used in the embodiments of the present application or the description in the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings of the embodiments of the present application for a person having ordinary skill in the art.
FIG. 1 is a flow chart of a method of speech intent recognition in one embodiment of the present application;
FIG. 2 is a flow chart of a method of speech intent recognition in one embodiment of the present application;
FIG. 3 is a flow chart of a method of speech intent recognition in one embodiment of the present application;
FIG. 4 is a flow chart of a method of speech intent recognition in one embodiment of the present application;
FIG. 5A is a schematic diagram of a voice intent recognition device in accordance with one embodiment of the present application;
FIG. 5B is a schematic diagram of a voice intent recognition device in accordance with one embodiment of the present application;
fig. 6 is a hardware configuration diagram of a voice intention recognition apparatus in one embodiment of the present application.
Detailed Description
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. Depending on the context, furthermore, the word "if" used may be interpreted as "at … …" or "at … …" or "in response to a determination".
Before describing the technical scheme of the application, concepts related to the embodiments of the application are described.
Machine learning: machine learning is a way to implement artificial intelligence to study how computers simulate or implement learning behavior of humans to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their own performance. Deep learning belongs to a subclass of machine learning, and is a process of modeling specific problems in the real world using mathematical models to solve similar problems in the field. Neural networks are implementations of deep learning, and for ease of description, the structure and function of the neural network is described herein by taking neural networks as an example, and for other subclasses of machine learning, the structure and function of the neural network are similar.
Neural network: the neural network includes, but is not limited to, a Convolutional Neural Network (CNN), a cyclic neural network (RNN), a fully connected network, etc., and the structural units of the neural network may include, but are not limited to, a convolutional layer (Conv), a pooling layer (Pool), an excitation layer, a fully connected layer (FC), etc.
In practical applications, the neural network may be constructed by combining one or more convolution layers, one or more pooling layers, one or more excitation layers, and one or more fully-connected layers according to different requirements.
In the convolution layer, the input data features are enhanced by performing convolution operation by using convolution kernel, the convolution kernel can be a matrix with m x n, the input data features of the convolution layer are convolved with the convolution kernel, the output data features of the convolution layer can be obtained, and the convolution operation is actually a filtering process.
In the pooling layer, operations such as maximum value taking, minimum value taking, average value taking and the like are performed on input data features (such as output of a convolution layer), so that the input data features are subsampled by utilizing the principle of local correlation, the processing amount is reduced, the feature invariance is kept, and the pooling layer operation is actually a downsampling process.
In the excitation layer, the input data features may be mapped using an activation function (e.g., a nonlinear function) to introduce a nonlinear factor such that the neural network enhances expression through nonlinear combinations.
The activation function may include, but is not limited to, a ReLU (RECTIFIED LINEAR Units ) function that is used to place features less than 0 at 0, while features greater than 0 remain unchanged.
In the fully-connected layer, all data features input to the fully-connected layer are fully-connected, so that a feature vector is obtained, and the feature vector can comprise a plurality of data features.
Network model: a model constructed using a machine learning algorithm (e.g., a deep learning algorithm), such as a model constructed using a neural network, i.e., the network model may be composed of one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully-connected layers. For convenience of distinction, the untrained network model is referred to as the initial network model, and the trained network model is referred to as the target network model.
In the training process of the initial network model, the sample data is utilized to train all network parameters in the initial network model, such as convolution layer parameters (such as convolution kernel parameters), pooling layer parameters, excitation layer parameters, full connection layer parameters and the like, which are not limited. By training each network parameter in the initial network model, the initial network model is fitted with the mapping relation between the input and the output. After the initial network model training is completed, the initial network model which has completed training is the target network model, and the voice intention is recognized through the target network model.
Phonemes: the phonemes are the minimum phonetic units divided according to the natural attributes of the speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. For example, the Chinese syllable o (a) has only one phoneme (a), ai (ai) has two phonemes (a and i), and dai (dai) has three phonemes (d, a and i), etc. Also for example, the chinese syllable tree (shumu) has five phones (s, h, u, m and u), etc.
Pinyin: pinyin is the combination of more than one phoneme into a composite sound, e.g., the generation (dai) has three phonemes (d, a, and i) that make up a single pinyin (dai). For another example, tree (shumu) has five phonemes (s, h, u, m and u) that make up two pinyins (shu and mu).
In the related art, the recognition means of the voice intention includes: a speech recognition phase and an intent recognition phase. In the voice recognition stage, the voice to be recognized is recognized by an automatic voice recognition technology, and the voice to be recognized is converted into a text. In the intention recognition stage, the text is subjected to semantic understanding through a natural language processing technology to obtain keywords, and the voice intention of the user is recognized based on the keywords. According to the text-based intention recognition mode, the accuracy rate depends on the accuracy rate of converting the voice into the text, and the accuracy rate of converting the voice into the text is low, so that the accuracy rate of voice intention recognition is low, and the voice intention of a user cannot be accurately recognized.
In view of the above findings, in the embodiment of the present application, the voice intent is recognized based on the phonemes to be recognized, instead of recognizing the voice intent based on the text, so that the accuracy of converting the voice into the text does not need to be relied on.
The technical scheme of the embodiment of the application is described below with reference to specific embodiments.
The embodiment of the application provides a voice intention recognition method which can be applied to a man-machine interaction application scene and is mainly used for controlling equipment according to voice intention. The method can be applied to any device that needs to be controlled according to voice intention, such as an access control device, a screen throwing device, an IPC (internet protocol Camera), a server, an intelligent terminal, a robot system, an air conditioning device, and the like, without limitation.
The embodiment of the application relates to a training process of an initial network model and an identification process based on a target network model. In the training process of the initial network model, the initial network model can be trained to obtain a trained target network model. In the recognition process based on the target network model, the voice intent may be recognized based on the target network model. The training process of the initial network model and the recognition process based on the target network model can be realized in the same equipment or in different equipment. For example, a training process of the initial network model is implemented in the device a, a target network model is obtained, and the voice intention is recognized based on the target network model. For another example, a training process of the initial network model is implemented in the device A1, a target network model is obtained, the target network model is deployed to the device A2, and the device A2 recognizes the voice intention based on the target network model.
Referring to fig. 1, for the training process of the initial network model, a method for identifying a voice intention is provided in an embodiment of the present application, where the method may implement training of the initial network model, and the method includes:
step 101, obtaining a sample voice and a sample intention corresponding to the sample voice.
For example, a large number of sample voices may be obtained from the history data, and/or a large number of sample voices input by the user may be received, and the obtaining manner is not limited, and the sample voices represent sounds made while speaking. For example, the sound made when speaking is "turn on air conditioner", and the sample speech is "turn on air conditioner".
For each sample voice, a voice intent corresponding to the sample voice may be obtained, and for convenience of distinction, the voice intent corresponding to the sample voice may be referred to as a sample intent (i.e., a sample voice intent). For example, if the sample voice is "turn air conditioner on", the sample intent may be "turn air conditioner on".
Step 102, determining a sample phoneme set according to the sample speech.
For example, for each sample voice, a sample phone set may be determined according to the sample voice, the sample phone set may include a plurality of sample phones, the process of determining a sample phone according to the sample voice is a process of identifying each phone from the sample voice, and for convenience of distinction, each identified phone is referred to as a sample phone, so a plurality of sample phones may be identified according to the sample voice, and the identification process is not limited as long as a plurality of sample phones can be identified according to the sample voice.
For example, for sample speech "air conditioning on," the sample phone set may include the following sample phone "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i".
Step 103, obtaining a sample phoneme vector corresponding to the sample phoneme set.
For each sample phone in the sample phone set, a phone feature value corresponding to the sample phone is determined, and a sample phone vector corresponding to the sample phone set is obtained based on the phone feature value corresponding to each sample phone, wherein the sample phone vector comprises the phone feature value corresponding to each sample phone.
For example, a mapping relationship between each phoneme in all phonemes and the phoneme feature value is maintained in advance, and assuming that there are 50 phonemes in total, a mapping relationship between the phoneme 1 and the phoneme feature value 1, a mapping relationship between the phoneme 2 and the phoneme feature value 2, a mapping relationship between the phoneme 50 and the phoneme feature value 50 may be maintained.
On this basis, in step 103, for each sample phone in the sample phone set, by querying the mapping relation, a phone feature value corresponding to the sample phone can be obtained, and the phone feature value corresponding to each sample phone in the sample phone set is combined to obtain the sample phone vector.
For example, for the sample phone set "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i", the sample phone vector is a 15-dimensional feature vector, and the feature vector sequentially includes a phone feature value corresponding to "b", a phone feature value corresponding to "a", a phone feature value corresponding to "k", a phone feature value corresponding to "o", a phone feature value corresponding to "n", a phone feature value corresponding to "g", a phone feature value corresponding to "t", a phone feature value corresponding to "i", a phone feature value corresponding to "a", a phone feature value corresponding to "o", a phone feature value corresponding to "d", a phone feature value corresponding to "a", a phone feature value corresponding to "k", a phone feature value corresponding to "a", and a phone feature value corresponding to "i".
In one possible implementation, all phonemes may be ordered, and assuming that there are 50 total phonemes, the sequence numbers of the 50 phonemes are respectively 1-50, and for each phoneme, the phoneme feature value may be a 50-bit numerical value. Assuming that the sequence number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the mth bit is a first value, and the values of the other bits except the mth bit are second values. For example, the phoneme feature value corresponding to the phoneme with the sequence number 1, the 1 st bit value is the first value, and the 2 nd to 50 th bit values are the second values; the phoneme characteristic value corresponding to the phoneme with the sequence number of 2, the value of the 2 nd bit is the first value, the values of the 1 st bit and the 3 rd to 50 th bits are the second values, and so on.
In summary, for the sample phone set "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i", the sample phone vector may be a 15×50-dimensional feature vector, where the feature vector includes 15 rows and 50 columns, and each row represents a phone feature value corresponding to a phone, which is not described herein.
In the above embodiment, the first value and the second value may be empirically configured, and are not limited thereto, for example, the first value is 1, the second value is 0, or the first value is 0, the second value is 1, or the first value is 255, the second value is 0, or the first value is 0, and the second value is 255.
Step 104, inputting the sample phoneme vector and the sample intention corresponding to the sample phoneme vector to the initial network model, so as to train the initial network model through the sample phoneme vector and the sample intention, and obtain a trained target network model. For example, since the initial network model is trained using the sample phoneme vector and the sample intent (i.e., sample voice intent), a trained target network model is obtained, and thus the target network model may be used to record the mapping relationship of the phoneme vector and the voice intent.
Referring to the above embodiment, a large number of sample voices can be acquired, and for each sample voice, a sample intention corresponding to the sample voice is acquired, and a sample phoneme vector corresponding to a sample phoneme set corresponding to the sample voice is obtained, that is, a sample phoneme vector corresponding to the sample voice and a sample intention (as label information of the sample phoneme vector participates in training) are obtained. Based on this, a large number of sample phoneme vectors and sample intentions (i.e., label information) corresponding to each sample phoneme vector can be input to the initial network model, so that each network parameter in the initial network model is trained by using the sample phoneme vectors and the sample intentions, and the training process is not limited. After the initial network model training is completed, the initial network model that has completed training is the target network model.
For example, a large number of sample phoneme vectors and sample intentions may be input to a first network layer of the initial network model, the first network layer processes the data to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the initial network model, and so on until the data is input to a last network layer of the initial network model, the last network layer processes the data to obtain output data, and the output data is recorded as a target feature vector.
After the target feature vector is obtained, it is determined whether the initial network model has converged based on the target feature vector. If the initial network model is converged, determining the converged initial network model as a trained target network model, and completing the training process of the initial network model. And if the initial network model is not converged, adjusting network parameters of the initial network model which is not converged to obtain an adjusted initial network model.
Based on the adjusted initial network model, a large number of sample phoneme vectors and sample intentions can be input to the adjusted initial network model, so that the adjusted initial network model is retrained, and detailed training processes are referred to the above embodiments and are not repeated here. And so on until the initial network model has converged and determining the converged initial network model as the trained target network model.
In the above embodiment, determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: the loss function is constructed in advance, and is not limited and can be empirically set. After the target feature vector is obtained, a loss value of the loss function may be determined according to the target feature vector, for example, the target feature vector may be substituted into the loss function to obtain the loss value of the loss function. After obtaining the loss value of the loss function, determining whether the initial network model is converged according to the loss value of the loss function.
For example, it may be determined whether the initial network model has converged based on a loss value, e.g., obtaining a loss value 1 based on the target feature vector, and if the loss value 1 is not greater than the threshold, determining that the initial network model has converged. If the loss value 1 is greater than the threshold, it is determined that the initial network model does not converge. Or alternatively
Whether the initial network model is converged can be determined according to a plurality of loss values of a plurality of iterative processes, for example, in each iterative process, the initial network model of the last iterative process is adjusted to obtain an adjusted initial network model, and each iterative process can obtain a loss value. And determining a change amplitude curve of a plurality of loss values, and if the change amplitude curve of the loss values is determined that the change amplitude of the loss values is stable (the loss values of the continuous multiple iteration processes are unchanged or the change amplitude is small), and the loss value of the last iteration process is not greater than a threshold value, determining that the initial network model of the last iteration process is converged. Otherwise, determining that the initial network model of the last iteration process is not converged, continuing the next iteration process to obtain a loss value of the next iteration process, and redetermining a variation amplitude curve of a plurality of loss values.
In practical applications, other ways of determining whether the initial network model has converged may be used, without limitation. For example, if the iteration number reaches a preset number threshold, determining that the initial network model has converged; for another example, if the iteration duration reaches a preset duration threshold, it is determined that the initial network model has converged.
In summary, the initial network model may be trained by the sample phoneme vector and the sample intention corresponding to the sample phoneme vector, so as to obtain a trained target network model.
Referring to fig. 2, for a recognition process based on a target network model, a method for recognizing a voice intention is provided in an embodiment of the present application, where the method can implement recognition of a voice intention, and the method includes:
in step 201, a set of phonemes to be recognized is determined from the speech to be recognized.
For example, after the speech to be recognized is obtained, a set of phonemes to be recognized may be determined according to the speech to be recognized, where the process of determining the phonemes to be recognized according to the speech to be recognized is a process of recognizing each phoneme from the speech to be recognized, and for convenience of distinction, each recognized phoneme is referred to as a phoneme to be recognized, so that a plurality of phonemes to be recognized may be recognized according to the speech to be recognized, and the recognition process is not limited as long as a plurality of phonemes to be recognized can be recognized according to the speech to be recognized. For example, for a voice to be recognized to "air-conditioned," the set of phonemes to be recognized may include the following phonemes to be recognized "k, a, i, k, o, n, g, t, i, a, o".
Step 202, obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized. For each phoneme to be identified in the set of phonemes to be identified, a phoneme feature value corresponding to the phoneme to be identified is determined, and a phoneme vector to be identified corresponding to the set of phonemes to be identified is obtained based on the phoneme feature value corresponding to each phoneme to be identified, wherein the phoneme vector to be identified comprises the phoneme feature value corresponding to each phoneme to be identified.
For example, a mapping relationship between each phoneme in all phonemes and the phoneme feature value is maintained in advance, and assuming that there are 50 phonemes in total, a mapping relationship between the phoneme 1 and the phoneme feature value 1, a mapping relationship between the phoneme 2 and the phoneme feature value 2, a mapping relationship between the phoneme 50 and the phoneme feature value 50 may be maintained.
In step 202, for each phoneme to be identified in the set of phonemes to be identified, a phoneme feature value corresponding to the phoneme to be identified can be obtained by querying the mapping relation, and the phoneme feature values corresponding to each phoneme to be identified in the set of phonemes to be identified are combined to obtain the phoneme vector to be identified.
In one possible implementation, all phonemes may be ordered, and assuming that there are 50 total phonemes, the sequence numbers of the 50 phonemes are respectively 1-50, and for each phoneme, the phoneme feature value may be a 50-bit numerical value. Assuming that the sequence number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the mth bit is a first value, and the values of the other bits except the mth bit are second values. For example, the phoneme feature value corresponding to the phoneme with the sequence number 1, the 1 st bit value is the first value, and the 2 nd to 50 th bit values are the second values; the phoneme characteristic value corresponding to the phoneme with the sequence number of 2, the value of the 2 nd bit is the first value, the values of the 1 st bit and the 3 rd to 50 th bits are the second values, and so on.
Step 203, inputting the phoneme vector to be recognized into the trained target network model, so that the target network model outputs the voice intention corresponding to the phoneme vector to be recognized. For example, the target network model is used to record a mapping relationship between a phoneme vector and a voice intention, and after the phoneme vector to be recognized is input to the target network model, the target network model may output the voice intention corresponding to the phoneme vector to be recognized.
For example, the phoneme vector to be identified may be input to a first network layer of the target network model, the first network layer processes the phoneme vector to be identified to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the target network model, and so on until the data is input to a last network layer of the target network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.
Because the target network model is used for recording the mapping relation between the phoneme vector and the voice intention, after the target feature vector is obtained, the mapping relation can be queried based on the target feature vector to obtain the voice intention corresponding to the target feature vector, the voice intention can be the voice intention corresponding to the phoneme vector to be recognized, and the target network model can output the voice intention corresponding to the phoneme vector to be recognized.
After obtaining the voice intention corresponding to the phoneme vector to be recognized, the device can be controlled based on the voice intention, and the control mode is not limited, for example, when the voice intention is "air conditioning on", the air conditioner is turned on.
In one possible implementation, when the target network model outputs the voice intent corresponding to the phoneme vector to be recognized, a probability value (e.g., a probability value between 0 and 1, which may also be referred to as a confidence) corresponding to the voice intent may also be output, for example, the target network model may output the voice intent 1 and the probability value 1 of the voice intent 1 (e.g., 0.8), the voice intent 2 and the probability value 2 of the voice intent 2 (e.g., 0.1), the voice intent 3 and the probability value 3 of the voice intent 3 (e.g., 0.08), and so on.
Based on the output data described above, the speech intention with the largest probability value may be regarded as the speech intention corresponding to the phoneme vector to be recognized, for example, the speech intention 1 with the largest probability value may be regarded as the speech intention corresponding to the phoneme vector to be recognized. Or firstly determining the voice intention with the maximum probability value, and determining whether the probability value (namely the maximum probability value) of the voice intention is larger than a preset probability threshold value, if so, taking the voice intention as the voice intention corresponding to the phoneme vector to be recognized, otherwise, not taking the voice intention corresponding to the phoneme vector to be recognized.
As can be seen from the above technical solutions, in the embodiments of the present application, the voice intention is recognized based on the phonemes to be recognized, not based on the text recognition voice intention, and the accuracy of converting voice into text is not required to be relied on. The accuracy of determining the phonemes to be recognized based on the voices to be recognized is high, the accuracy of recognizing the voice intentions is high, the voice intentions of the user can be accurately recognized, and the accuracy of recognizing the voice intentions is effectively improved.
For example, the user sends out a voice to be recognized "i want to see a photo with a tree", the terminal device (such as IPC, smart phone, etc.) determines that the phoneme determined based on the voice to be recognized is "w, o, x, i, a, n, g, k, a, n, y, o, u, s, h, u, m, u, d, e, z, h, a, o, p, i, a, n", namely, the phoneme corresponding to the "tree" is "s, h, u, m, u", so as to determine the voice intention based on the phonemes, and the "number" or the "tree" does not need to be resolved from the voice to be recognized "i want to see a photo with a tree", so that the voice intention is prevented from being determined by adopting the "number" or the "tree", so that the intention recognition has stronger reliability, a large number of language model algorithm libraries of voice recognition are not needed, and the performance and the memory are greatly optimized.
In another implementation manner of the embodiment of the application, the voice intention is recognized based on the pinyin to be recognized, instead of the text, so that the accuracy of converting the voice into the text is not required to be relied on.
The technical scheme of the embodiment of the application is described below with reference to specific embodiments.
The embodiment of the application provides a voice intention recognition method which can be applied to a man-machine interaction application scene and is mainly used for controlling equipment according to voice intention. The method can be applied to any device that needs to be controlled according to voice intention, such as an access control device, a screen throwing device, an IPC (internet protocol Camera), a server, an intelligent terminal, a robot system, an air conditioning device, and the like, without limitation.
The embodiment of the application can relate to a training process of an initial network model and an identification process based on a target network model. In the training process of the initial network model, the initial network model can be trained to obtain a trained target network model. In the recognition process based on the target network model, the voice intent may be recognized based on the target network model. The training process of the initial network model and the recognition process based on the target network model can be implemented in the same device or in different devices.
Referring to fig. 3, for the training process of the initial network model, a method for identifying a voice intention is provided in an embodiment of the present application, where the method may implement training of the initial network model, and the method includes:
Step 301, a sample voice and a sample intention corresponding to the sample voice are obtained.
For example, a large number of sample voices may be obtained from the history data, and/or a large number of sample voices input by the user may be received, and the obtaining manner is not limited, and the sample voices represent sounds made while speaking. For example, the sound made when speaking is "turn on air conditioner", and the sample speech is "turn on air conditioner".
For each sample voice, a voice intent corresponding to the sample voice may be obtained, and for convenience of distinction, the voice intent corresponding to the sample voice may be referred to as a sample intent (i.e., a sample voice intent). For example, if the sample voice is "turn air conditioner on", the sample intent may be "turn air conditioner on".
Step 302, determining a sample pinyin collection based on the sample speech.
For example, for each sample voice, a sample pinyin set may be determined according to the sample voice, where the sample pinyin set may include a plurality of sample pinyins, and the process of determining the sample pinyin according to the sample voice is a process of identifying each pinyin from the sample voice, and for convenience of distinction, each identified pinyin is referred to as a sample pinyin, so that a plurality of sample pinyins may be identified according to the sample voice, and the identification process is not limited as long as a plurality of sample pinyins can be identified according to the sample voice.
For example, for sample speech "turn air conditioner on", the sample pinyin collection may include the following sample pinyins "ba", "kong", "tiao", "da", "kai".
Step 303, obtaining a sample pinyin vector corresponding to the sample pinyin set.
For each sample pinyin in the sample pinyin set, determining a pinyin feature value corresponding to the sample pinyin, and obtaining a sample pinyin vector corresponding to the sample pinyin set based on the pinyin feature value corresponding to each sample pinyin, where the sample pinyin vector includes the pinyin feature value corresponding to each sample pinyin.
For example, the mapping relationship between each pinyin in all pinyins and the pinyin feature value is maintained in advance, and if there are 400 pinyins in total, the mapping relationship between the pinyin 1 and the pinyin feature value 1, the mapping relationship between the pinyin 2 and the pinyin feature value 2, …, and the mapping relationship between the pinyin 400 and the pinyin feature value 400 can be maintained.
Based on this, in step 303, for each sample pinyin in the sample pinyin set, a pinyin feature value corresponding to the sample pinyin may be obtained by querying the mapping relationship, and the pinyin feature value corresponding to each sample pinyin in the sample pinyin set is combined to obtain the sample pinyin vector.
For example, for the sample pinyin sets "ba", "kong", "tiao", "da" and "kai", the sample pinyin vector may be a 5-dimensional feature vector, and the feature vector may sequentially include the pinyin feature value corresponding to "ba", the pinyin feature value corresponding to "kong", the pinyin feature value corresponding to "tiao", the pinyin feature value corresponding to "da", and the pinyin feature value corresponding to "kai".
In one possible implementation, all pinyins may be ordered, and assuming that there are 400 pinyins in total, the number of the 400 pinyins is 1-400, respectively, for each pinyin a corresponding pinyin feature value, which may be a 400-bit value. And assuming that the sequence number of the pinyin is N, in the pinyin characteristic value corresponding to the pinyin, the value of the Nth bit is a first value, and the values of other bits except the Nth bit are second values. For example, the pinyin with the sequence number of 1 corresponds to the pinyin characteristic value, the value of the 1 st bit is a first value, and the values of the 2 nd to 400 th bits are second values; the pinyin with the sequence number of 2 corresponds to the pinyin characteristic value, the value of the 2 nd bit is the first value, the values of the 1 st bit and the 3 rd to 400 rd bits are the second values, and so on.
In summary, for the sample pinyin sets "ba", "kong", "tiao", "da" and "kai", the sample pinyin vector may be a5×400-dimensional feature vector, where the feature vector includes 5 rows and 400 columns, and each row represents a pinyin feature value corresponding to one pinyin, which is not described herein.
Step 304, inputting the sample pinyin vector and the sample intent corresponding to the sample pinyin vector to the initial network model, so as to train the initial network model through the sample pinyin vector and the sample intent, and obtain a trained target network model. For example, the initial network model is trained by using the sample pinyin vector and the sample intent (i.e., the sample voice intent) to obtain a trained target network model, so that the target network model can be used to record the mapping relationship between the pinyin vector and the voice intent.
Referring to the above embodiment, a large number of sample voices can be obtained, and for each sample voice, a sample intent corresponding to the sample voice is obtained, and a sample pinyin vector corresponding to a sample pinyin set corresponding to the sample voice, that is, a sample pinyin vector corresponding to the sample voice and a sample intent (as label information of the sample pinyin vector participating in training) are obtained. Based on the above, a large number of sample pinyin vectors and sample intentions (i.e., label information) corresponding to each sample pinyin vector can be input into the initial network model, so that each network parameter in the initial network model is trained by using the sample pinyin vectors and the sample intentions, and the training process is not limited. After the initial network model training is completed, the initial network model that has completed training is the target network model.
For example, a large number of sample pinyin vectors and sample intent may be input to a first network layer of the initial network model, the first network layer processes the data to obtain output data for the first network layer, the output data for the first network layer is input to a second network layer of the initial network model, and so on until the data is input to a last network layer of the initial network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.
After the target feature vector is obtained, it is determined whether the initial network model has converged based on the target feature vector. If the initial network model is converged, determining the converged initial network model as a trained target network model, and completing the training process of the initial network model. And if the initial network model is not converged, adjusting network parameters of the initial network model which is not converged to obtain an adjusted initial network model.
Based on the adjusted initial network model, a large number of sample pinyin vectors and sample intentions can be input to the adjusted initial network model, so that the adjusted initial network model is retrained, and detailed training processes are referred to the above embodiments and are not repeated here. And so on until the initial network model has converged and determining the converged initial network model as the trained target network model.
In the above embodiment, determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: the loss function is constructed in advance, and is not limited and can be empirically set. After the target feature vector is obtained, a loss value of the loss function may be determined according to the target feature vector, for example, the target feature vector may be substituted into the loss function to obtain the loss value of the loss function. After obtaining the loss value of the loss function, determining whether the initial network model is converged according to the loss value of the loss function.
In practical applications, other ways of determining whether the initial network model has converged may be used, without limitation. For example, if the iteration number reaches a preset number threshold, determining that the initial network model has converged; for another example, if the iteration duration reaches a preset duration threshold, it is determined that the initial network model has converged.
In summary, the initial network model may be trained by the sample pinyin vector and the sample intent corresponding to the sample pinyin vector, so as to obtain a trained target network model.
Referring to fig. 4, for a recognition process based on a target network model, a method for recognizing a voice intention is provided in an embodiment of the present application, where the method can implement recognition of a voice intention, and the method includes:
Step 401, determining a pinyin set to be recognized according to the voice to be recognized.
For example, after the voice to be recognized is obtained, a set of pinyin to be recognized may be determined according to the voice to be recognized, where the set of pinyin to be recognized may include a plurality of pinyin to be recognized, and the process of determining the pinyin to be recognized according to the voice to be recognized is a process of recognizing each pinyin from the voice to be recognized. For example, for the voice to be recognized to "turn on the air conditioner", the set of pinyin to be recognized may include the following pinyin to be recognized "kai", "kong", "tiao".
Step 402, obtaining a pinyin vector to be recognized corresponding to the pinyin set to be recognized. For each pinyin to be identified in the set of pinyin to be identified, a pinyin characteristic value corresponding to the pinyin to be identified is determined, and based on the pinyin characteristic value corresponding to each pinyin to be identified, a pinyin vector to be identified corresponding to the set of pinyin to be identified is obtained, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.
For example, the mapping relationship between each pinyin in all pinyins and the pinyin feature value is maintained in advance, and if there are 400 pinyins in total, the mapping relationship between the pinyin 1 and the pinyin feature value 1, the mapping relationship between the pinyin 2 and the pinyin feature value 2, …, and the mapping relationship between the pinyin 400 and the pinyin feature value 400 can be maintained.
In step 402, for each pinyin to be identified in the pinyin set to be identified, a pinyin feature value corresponding to the pinyin to be identified may be obtained by querying the mapping relationship, and the pinyin feature values corresponding to each pinyin to be identified in the pinyin set to be identified may be combined to obtain the pinyin vector to be identified.
In one possible implementation, all pinyins may be ordered, and assuming that there are 400 pinyins in total, the number of the 400 pinyins is 1-400, respectively, for each pinyin a corresponding pinyin feature value, which may be a 400-bit value. And assuming that the sequence number of the pinyin is N, in the pinyin characteristic value corresponding to the pinyin, the value of the Nth bit is a first value, and the values of other bits except the Nth bit are second values. For example, the pinyin with the sequence number of 1 corresponds to the pinyin characteristic value, the value of the 1 st bit is a first value, and the values of the 2 nd to 400 th bits are second values; the pinyin with the sequence number of 2 corresponds to the pinyin characteristic value, the value of the 2 nd bit is the first value, the values of the 1 st bit and the 3 rd to 400 rd bits are the second values, and so on.
Step 403, inputting the pinyin vector to be recognized to the trained target network model, so that the target network model outputs the voice intention corresponding to the pinyin vector to be recognized. The target network model is used for recording mapping relation between pinyin vectors and voice intents, and after the pinyin vectors to be recognized are input into the target network model, the target network model can output the voice intents corresponding to the pinyin vectors to be recognized.
For example, the pinyin vector to be identified may be input to a first network layer of the target network model, the first network layer processes the pinyin vector to be identified to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the target network model, and so on until the data is input to a last network layer of the target network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.
Because the target network model is used for recording the mapping relation between the pinyin vector and the voice intention, after the target feature vector is obtained, the mapping relation can be queried based on the target feature vector to obtain the voice intention corresponding to the target feature vector, the voice intention can be the voice intention corresponding to the pinyin vector to be recognized, and the target network model can output the voice intention corresponding to the pinyin vector to be recognized.
After the voice intention corresponding to the pinyin vector to be recognized is obtained, the equipment can be controlled based on the voice intention, and the control mode is not limited, if the voice intention is "air conditioning on", the air conditioner is turned on.
In one possible implementation, when the target network model outputs the voice intent corresponding to the pinyin vector to be recognized, a probability value (e.g., a probability value between 0-1, which may also be referred to as a confidence) corresponding to the voice intent may also be output, for example, the target network model may output the voice intent 1 and the probability value 1 of the voice intent 1 (e.g., 0.8), the voice intent 2 and the probability value 2 of the voice intent 2 (e.g., 0.1), the voice intent 3 and the probability value 3 of the voice intent 3 (e.g., 0.08), and so on.
Based on the output data described above, the voice intention with the largest probability value may be regarded as the voice intention corresponding to the pinyin vector to be recognized, for example, the voice intention 1 with the largest probability value may be regarded as the voice intention corresponding to the pinyin vector to be recognized. Or firstly determining the voice intention with the maximum probability value, and determining whether the probability value (namely the maximum probability value) of the voice intention is larger than a preset probability threshold value, if so, taking the voice intention as the voice intention corresponding to the pinyin vector to be recognized, otherwise, not taking the voice intention corresponding to the pinyin vector to be recognized.
As can be seen from the above technical solutions, in the embodiments of the present application, the voice intent is recognized based on the pinyin to be recognized, rather than recognizing the voice intent based on the text, and the accuracy of converting the voice into the text is not required. The accuracy of determining the pinyin to be recognized based on the voice to be recognized is high, and the accuracy of recognizing the voice intention is high, so that the voice intention of the user can be accurately recognized, and the accuracy of recognizing the voice intention is effectively improved. For example, the user sends out the voice to be recognized "i want to see the photo with the tree", the terminal equipment (such as IPC, smart phone, etc.) determines that the pinyin based on the voice to be recognized is "wo, xiang, kan, you, shu, mu, de, zhao, pian", namely the pinyin corresponding to the "tree" is "shu, mu", so as to determine the voice intention based on the pinyin, and the number or the tree is not needed to be analyzed from the voice to be recognized "i want to see the photo with the tree", so that the voice intention is not determined by adopting the number or the tree ", so that the intention recognition has stronger reliability, a large number of language model algorithm libraries of voice recognition are not needed, and the performance and the memory are greatly optimized.
Based on the same application concept as the above method, an embodiment of the present application provides a voice intention recognition device, as shown in fig. 5A, which is a schematic structural diagram of the device, where the device may include:
a determining module 511 for determining a set of phonemes to be recognized from the speech to be recognized;
an obtaining module 512, configured to obtain a phoneme vector to be recognized corresponding to the phoneme set to be recognized;
a processing module 513, configured to input the phoneme vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention.
In one possible implementation, the phone set to be recognized includes a plurality of phones to be recognized, and the obtaining module 512 is specifically configured to, when obtaining the phone vector to be recognized corresponding to the phone set to be recognized:
Determining a phoneme characteristic value corresponding to each phoneme to be recognized; and acquiring a phoneme vector to be recognized corresponding to the phoneme set to be recognized based on the phoneme characteristic value corresponding to each phoneme to be recognized, wherein the phoneme vector to be recognized comprises the phoneme characteristic value corresponding to each phoneme to be recognized.
In one possible implementation, the determining module 511 is further configured to: acquiring sample voice and sample intention corresponding to the sample voice; determining a sample phoneme set according to the sample speech; the acquisition module 512 is further configured to: acquiring a sample phoneme vector corresponding to the sample phoneme set; the processing module 513 is further configured to: and inputting the sample phoneme vector and the sample intention to an initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain the target network model.
In one possible implementation, the sample phone set includes a plurality of sample phones, and the obtaining module 51 is specifically configured to, when obtaining a sample phone vector corresponding to the sample phone set:
Determining a phoneme characteristic value corresponding to each sample phoneme aiming at each sample phoneme;
And acquiring a sample phoneme vector corresponding to the sample phoneme set based on the phoneme characteristic value corresponding to each sample phoneme, wherein the sample phoneme vector comprises the phoneme characteristic value corresponding to each sample phoneme.
Based on the same application concept as the above method, an embodiment of the present application provides a voice intention recognition device, as shown in fig. 5B, which is a schematic structural diagram of the device, where the device may include:
a determining module 521, configured to determine a pinyin set to be recognized according to the speech to be recognized;
an obtaining module 522, configured to obtain a pinyin vector to be identified corresponding to the pinyin set to be identified;
A processing module 523, configured to input the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intent corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
In one possible implementation manner, the pinyin-group to be identified includes a plurality of pinyin-groups to be identified, and the obtaining module 522 is specifically configured to, when obtaining the pinyin vectors to be identified corresponding to the pinyin-group to be identified: determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, a pinyin vector to be identified corresponding to the pinyin to be identified set is obtained, and the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.
In one possible implementation, the determining module 521 is further configured to: acquiring sample voice and sample intention corresponding to the sample voice; determining a sample pinyin set according to the sample speech; the acquisition module 522 is further configured to: acquiring a sample pinyin vector corresponding to the sample pinyin set; the processing module 523 is further configured to: and inputting the sample pinyin vector and the sample intention to an initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain the target network model.
In one possible implementation, the sample pinyin set includes a plurality of sample pinyin, and the acquiring module 522 is specifically configured to, when acquiring the sample pinyin vectors corresponding to the sample pinyin set:
determining a pinyin characteristic value corresponding to each sample pinyin;
And acquiring a sample pinyin vector corresponding to the sample pinyin set based on the pinyin characteristic value corresponding to each sample pinyin, wherein the sample pinyin vector comprises the pinyin characteristic value corresponding to each sample pinyin.
Based on the same application concept as the above method, an embodiment of the present application provides a voice intention recognition apparatus, as shown in fig. 6, including: a processor 61 and a machine-readable storage medium 62, the machine-readable storage medium 62 storing machine-executable instructions executable by the processor 61; the processor 61 is configured to execute machine executable instructions to implement the following steps:
Determining a phoneme set to be recognized according to the voice to be recognized;
Obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized;
Inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention;
or determining a pinyin set to be recognized according to the voice to be recognized;
acquiring a pinyin vector to be recognized corresponding to the pinyin set to be recognized;
inputting the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
Based on the same application concept as the above method, the embodiment of the present application further provides a machine-readable storage medium, where a number of computer instructions are stored, where the computer instructions can implement the voice intent recognition method disclosed in the above example of the present application when the computer instructions are executed by a processor.
Wherein the machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state disk, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Moreover, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (9)

1. A method of speech intent recognition, the method comprising:
Determining a phoneme set to be recognized according to the voice to be recognized; the set of phonemes to be identified includes a plurality of phonemes to be identified; the phonemes to be recognized are determined according to the voice to be recognized;
Determining a phoneme characteristic value corresponding to each phoneme to be recognized; based on the phoneme characteristic value corresponding to each phoneme to be identified, obtaining a phoneme vector to be identified corresponding to the phoneme set to be identified, wherein the phoneme vector to be identified comprises a phoneme characteristic value corresponding to each phoneme to be identified;
Inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
Before the phoneme vector to be recognized is input to the trained target network model so that the target network model outputs the voice intention corresponding to the phoneme vector to be recognized, the method further comprises:
Acquiring sample voice and sample intention corresponding to the sample voice;
Determining a sample phoneme set according to the sample speech;
acquiring a sample phoneme vector corresponding to the sample phoneme set;
and inputting the sample phoneme vector and the sample intention to an initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain the target network model.
3. The method of claim 2, wherein the sample phone set includes a plurality of sample phones, the obtaining sample phone vectors corresponding to the sample phone set includes:
Determining a phoneme characteristic value corresponding to each sample phoneme aiming at each sample phoneme;
And acquiring a sample phoneme vector corresponding to the sample phoneme set based on the phoneme characteristic value corresponding to each sample phoneme, wherein the sample phoneme vector comprises the phoneme characteristic value corresponding to each sample phoneme.
4. A method of speech intent recognition, the method comprising:
Determining a pinyin set to be recognized according to the voice to be recognized; the pinyin set to be identified comprises a plurality of pinyin to be identified; the pinyin to be identified is determined according to the speech to be identified;
Determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, acquiring a pinyin vector to be identified corresponding to the pinyin to be identified set, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified;
inputting the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,
Before the pinyin vector to be recognized is input to the trained target network model so that the target network model outputs the voice intention corresponding to the pinyin vector to be recognized, the method further comprises:
Acquiring sample voice and sample intention corresponding to the sample voice;
determining a sample pinyin set according to the sample speech;
Acquiring a sample pinyin vector corresponding to the sample pinyin set;
And inputting the sample pinyin vector and the sample intention to an initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain the target network model.
6. The method of claim 5, wherein the set of sample pinyins includes a plurality of sample pinyins, the obtaining a sample pinyin vector corresponding to the set of sample pinyins comprising:
determining a pinyin characteristic value corresponding to each sample pinyin;
And acquiring a sample pinyin vector corresponding to the sample pinyin set based on the pinyin characteristic value corresponding to each sample pinyin, wherein the sample pinyin vector comprises the pinyin characteristic value corresponding to each sample pinyin.
7. A voice intent recognition device, the device comprising:
The determining module is used for determining a phoneme set to be recognized according to the voice to be recognized; the phoneme set to be recognized comprises a plurality of voices to be recognized; the voice to be recognized is determined according to the voice to be recognized;
The acquisition module is used for determining a phoneme characteristic value corresponding to each phoneme to be identified; based on the phoneme characteristic value corresponding to each phoneme to be identified, obtaining a phoneme vector to be identified corresponding to the phoneme set to be identified, wherein the phoneme vector to be identified comprises a phoneme characteristic value corresponding to each phoneme to be identified;
The processing module is used for inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention.
8. A voice intent recognition device, the device comprising:
The determining module is used for determining a pinyin set to be recognized according to the voice to be recognized; the pinyin set to be identified comprises a plurality of pinyin to be identified; the pinyin to be identified is determined according to the speech to be identified;
The acquisition module is used for determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, acquiring a pinyin vector to be identified corresponding to the pinyin to be identified set, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified;
The processing module is used for inputting the pinyin vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
9. A voice intent recognition device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine-executable instructions to perform the steps of:
Determining a phoneme set to be recognized according to the voice to be recognized; the set of phonemes to be identified includes a plurality of phonemes to be identified; the phonemes to be recognized are determined according to the voice to be recognized;
Determining a phoneme characteristic value corresponding to each phoneme to be recognized; based on the phoneme characteristic value corresponding to each phoneme to be identified, obtaining a phoneme vector to be identified corresponding to the phoneme set to be identified, wherein the phoneme vector to be identified comprises a phoneme characteristic value corresponding to each phoneme to be identified;
Inputting the phoneme vector to be recognized into a trained target network model so that the target network model outputs a voice intention corresponding to the phoneme vector to be recognized;
The target network model is used for recording the mapping relation between the phoneme vector and the voice intention;
or determining a pinyin set to be recognized according to the voice to be recognized; the pinyin set to be identified comprises a plurality of pinyin to be identified; the pinyin to be identified is determined according to the speech to be identified;
Determining a pinyin characteristic value corresponding to each pinyin to be identified; based on the pinyin characteristic value corresponding to each pinyin to be identified, acquiring a pinyin vector to be identified corresponding to the pinyin to be identified set, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified;
inputting the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;
the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.
CN202010785605.1A 2020-08-06 2020-08-06 Voice intention recognition method, device and equipment Active CN111986653B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010785605.1A CN111986653B (en) 2020-08-06 2020-08-06 Voice intention recognition method, device and equipment
PCT/CN2021/110134 WO2022028378A1 (en) 2020-08-06 2021-08-02 Voice intention recognition method, apparatus and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010785605.1A CN111986653B (en) 2020-08-06 2020-08-06 Voice intention recognition method, device and equipment

Publications (2)

Publication Number Publication Date
CN111986653A CN111986653A (en) 2020-11-24
CN111986653B true CN111986653B (en) 2024-06-25

Family

ID=73444526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010785605.1A Active CN111986653B (en) 2020-08-06 2020-08-06 Voice intention recognition method, device and equipment

Country Status (2)

Country Link
CN (1) CN111986653B (en)
WO (1) WO2022028378A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986653B (en) * 2020-08-06 2024-06-25 杭州海康威视数字技术股份有限公司 Voice intention recognition method, device and equipment
CN113836945B (en) * 2021-09-23 2024-04-16 平安科技(深圳)有限公司 Intention recognition method, device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081219A (en) * 2020-01-19 2020-04-28 南京硅基智能科技有限公司 End-to-end voice intention recognition method

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08227410A (en) * 1994-12-22 1996-09-03 Just Syst Corp Learning method of neural network, neural network, and speech recognition device utilizing neural network
US6408271B1 (en) * 1999-09-24 2002-06-18 Nortel Networks Limited Method and apparatus for generating phrasal transcriptions
CN107357875B (en) * 2017-07-04 2021-09-10 北京奇艺世纪科技有限公司 Voice search method and device and electronic equipment
CN109754789B (en) * 2017-11-07 2021-06-08 北京国双科技有限公司 Method and device for recognizing voice phonemes
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN110767214A (en) * 2018-07-27 2020-02-07 杭州海康威视数字技术股份有限公司 Speech recognition method and device and speech recognition system
CN110808050B (en) * 2018-08-03 2024-04-30 蔚来(安徽)控股有限公司 Speech recognition method and intelligent device
CN110931000B (en) * 2018-09-20 2022-08-02 杭州海康威视数字技术股份有限公司 Method and device for speech recognition
CN109829153A (en) * 2019-01-04 2019-05-31 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and medium based on convolutional neural networks
KR20200091738A (en) * 2019-01-23 2020-07-31 주식회사 케이티 Server, method and computer program for detecting keyword
CN110223680B (en) * 2019-05-21 2021-06-29 腾讯科技(深圳)有限公司 Voice processing method, voice recognition device, voice recognition system and electronic equipment
CN110349567B (en) * 2019-08-12 2022-09-13 腾讯科技(深圳)有限公司 Speech signal recognition method and device, storage medium and electronic device
KR102321798B1 (en) * 2019-08-15 2021-11-05 엘지전자 주식회사 Deeplearing method for voice recognition model and voice recognition device based on artifical neural network
CN110610707B (en) * 2019-09-20 2022-04-22 科大讯飞股份有限公司 Voice keyword recognition method and device, electronic equipment and storage medium
CN110674314B (en) * 2019-09-27 2022-06-28 北京百度网讯科技有限公司 Sentence recognition method and device
CN111243603B (en) * 2020-01-09 2022-12-06 厦门快商通科技股份有限公司 Voiceprint recognition method, system, mobile terminal and storage medium
CN111274797A (en) * 2020-01-13 2020-06-12 平安国际智慧城市科技股份有限公司 Intention recognition method, device and equipment for terminal and storage medium
CN111986653B (en) * 2020-08-06 2024-06-25 杭州海康威视数字技术股份有限公司 Voice intention recognition method, device and equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081219A (en) * 2020-01-19 2020-04-28 南京硅基智能科技有限公司 End-to-end voice intention recognition method

Also Published As

Publication number Publication date
CN111986653A (en) 2020-11-24
WO2022028378A1 (en) 2022-02-10

Similar Documents

Publication Publication Date Title
US10032463B1 (en) Speech processing with learned representation of user interaction history
CN107134279B (en) Voice awakening method, device, terminal and storage medium
Zeng et al. Effective combination of DenseNet and BiLSTM for keyword spotting
Sak et al. Fast and accurate recurrent neural network acoustic models for speech recognition
US9842585B2 (en) Multilingual deep neural network
CN107358951A (en) A kind of voice awakening method, device and electronic equipment
US9653093B1 (en) Generative modeling of speech using neural networks
WO2017099936A1 (en) System and methods for adapting neural network acoustic models
EP3948852A1 (en) Contextual biasing for speech recognition
CN108711421A (en) A kind of voice recognition acoustic model method for building up and device and electronic equipment
CN109754789B (en) Method and device for recognizing voice phonemes
CN109523014B (en) News comment automatic generation method and system based on generative confrontation network model
CN111986653B (en) Voice intention recognition method, device and equipment
CN111179917B (en) Speech recognition model training method, system, mobile terminal and storage medium
US20170228643A1 (en) Augmenting Neural Networks With Hierarchical External Memory
CN114830139A (en) Training models using model-provided candidate actions
CN110069611B (en) Topic-enhanced chat robot reply generation method and device
US20230096805A1 (en) Contrastive Siamese Network for Semi-supervised Speech Recognition
WO2021135457A1 (en) Recurrent neural network-based emotion recognition method, apparatus, and storage medium
CN112509560B (en) Voice recognition self-adaption method and system based on cache language model
CN109637527A (en) The semantic analytic method and system of conversation sentence
US20220399013A1 (en) Response method, terminal, and storage medium
CN113590078A (en) Virtual image synthesis method and device, computing equipment and storage medium
KR102120751B1 (en) Method and computer readable recording medium for providing answers based on hybrid hierarchical conversation flow model with conversation management model using machine learning
Chen et al. Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant