CN116978362A - Training and predicting method, device, equipment and storage medium for slot prediction model - Google Patents

Training and predicting method, device, equipment and storage medium for slot prediction model Download PDF

Info

Publication number
CN116978362A
CN116978362A CN202310033015.7A CN202310033015A CN116978362A CN 116978362 A CN116978362 A CN 116978362A CN 202310033015 A CN202310033015 A CN 202310033015A CN 116978362 A CN116978362 A CN 116978362A
Authority
CN
China
Prior art keywords
audio
feature
initial
prediction
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310033015.7A
Other languages
Chinese (zh)
Inventor
林炳怀
王丽园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310033015.7A priority Critical patent/CN116978362A/en
Publication of CN116978362A publication Critical patent/CN116978362A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the field of computers, in particular to the field of artificial intelligence, and provides a training and predicting method, device and equipment for a slot prediction model and a storage medium. The method comprises the following steps: and when each iteration is performed, based on the relation between prosody information in the sample audio and the channel position prediction in the sample text, the prosody prediction task is used as an auxiliary training task, prosody information in the voice is mined, the effective prosody information is explicitly extracted to assist in training the channel position prediction task in the field of spoken language understanding, the effect of the channel position prediction model on spoken language understanding is effectively improved, and meanwhile, for samples in a new field, the training method provided by the embodiment of the application can also enable the model to achieve a good generalization effect.

Description

Training and predicting method, device, equipment and storage medium for slot prediction model
Technical Field
The application relates to the field of computers, in particular to the field of artificial intelligence, and provides a training and predicting method, device and equipment for a slot prediction model and a storage medium.
Background
Spoken language understanding (Spoken Language Understanding, SLU) as a core component of a human-machine spoken dialog system, its performance merits have a decisive impact on the system. At present, a deep neural network based on text modeling is often used for performing a slot prediction task, and the specific process is as follows: text representations are extracted from the text of the audio conversion, and then, based on the extracted text representations, slot prediction is performed to obtain a slot prediction result for generating an operation instruction.
However, in the related art, the slot prediction model only performs slot prediction on the text, and does not have the capability of mining prosodic information in audio, so that effective pronunciation characteristics cannot be extracted, and the human-computer spoken dialogue system cannot correctly understand semantic information which the sentence itself wants to express, and further cannot provide corresponding services.
In view of this, the embodiment of the application provides a new training method for a slot prediction model.
Disclosure of Invention
The embodiment of the application provides a training and predicting method, device and equipment of a groove position predicting model and a storage medium, which are used for solving the problem that the model predicting accuracy is low because the model cannot mine rhythm information of audio.
In a first aspect, an embodiment of the present application provides a training method for a slot prediction model, including:
training the to-be-trained groove position prediction model based on a plurality of sample pairs in a training set by adopting a cyclic iteration mode until a trained target groove position prediction model is output, wherein each iteration comprises the following steps:
extracting features of each audio frame contained in sample audio in a read sample pair to obtain corresponding initial audio features, and extracting features of each word contained in sample text in the sample pair to obtain corresponding initial text features;
for each obtained initial text feature, the following operations are respectively executed: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
based on each weighted audio feature and each initial text feature, obtaining a multi-modal fusion feature;
and respectively carrying out groove position prediction and rhythm prediction on the one sample pair based on the multi-mode fusion characteristics, and adjusting model parameters of the groove position prediction model based on the obtained groove position prediction result and rhythm prediction result.
In a second aspect, an embodiment of the present application further provides a method for predicting a slot, including:
extracting features of each audio frame contained in the acquired audio information to obtain corresponding initial audio features, and extracting features of each word contained in text information identified based on the audio information to obtain corresponding initial text features;
for each obtained initial text feature, the following operations are respectively executed: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
based on each weighted audio feature and each initial text feature, a multi-mode fusion feature is obtained, and a slot prediction result of the audio information is obtained by carrying out slot prediction on the multi-mode fusion feature.
In a third aspect, an embodiment of the present application further provides a training device for a slot prediction model, including:
the model training unit is used for training the to-be-trained groove position prediction model based on a plurality of sample pairs in a training set in a cyclic iteration mode until a trained target groove position prediction model is output, wherein each iteration comprises:
The coding unit is used for extracting the characteristics of each audio frame contained in the sample audio in the read sample pair to obtain corresponding initial audio characteristics, and extracting the characteristics of each word contained in the sample text in the sample pair to obtain corresponding initial text characteristics;
an attention unit for performing the following operations for each of the obtained initial text features, respectively: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
the fusion unit is used for obtaining multi-mode fusion characteristics based on each weighted audio characteristic and each initial text characteristic;
and the parameter adjustment unit is used for respectively carrying out the groove position prediction and the rhythm prediction on the one sample pair based on the multi-mode fusion characteristic and adjusting the model parameters of the groove position prediction model based on the obtained groove position prediction result and rhythm prediction result.
Optionally, the parameter adjusting unit is configured to:
determining the groove classification loss generated by the groove prediction model in the training process based on the obtained groove prediction result and the corresponding groove actual result;
Determining prosody prediction loss generated by the groove position prediction model in the training process based on the obtained prosody prediction result and the corresponding prosody actual result;
and determining the model total loss generated in the training process of the model based on the groove classification loss and the prosody prediction loss, and adjusting the model parameters of the groove prediction model based on the model total loss.
Optionally, the fusion unit performs any one of the following modes:
splicing the weighted audio features with the corresponding initial text features in sequence to obtain multi-mode fusion features fused with the audio features and the text features;
and splicing the weighted audio feature set containing the weighted audio features with the initial text feature set containing the initial text features to obtain the multi-mode fusion feature fusing the audio features and the text features.
Optionally, the encoding unit is configured to:
extracting the characteristics of each audio frame contained in the sample audio in the read sample pair to obtain corresponding low-dimensional audio characteristics;
for each obtained low-dimensional audio feature, the following operations are performed: performing attention interaction on each obtained low-dimensional audio feature and one low-dimensional audio feature to obtain a contextual audio feature representing the attention degree of each low-dimensional audio feature to the one low-dimensional audio feature;
And carrying out linear processing on the contextual audio features of each audio frame to obtain corresponding initial audio features.
Optionally, the encoding unit is configured to:
multiplying each obtained low-dimensional audio feature by the one low-dimensional audio feature respectively to obtain the context weight of each low-dimensional audio feature to the one low-dimensional audio feature;
and carrying out weighted summation on each low-dimensional audio feature and the corresponding context weight to obtain the context audio feature which characterizes the attention degree of each low-dimensional audio feature to the one low-dimensional audio feature.
In a fourth aspect, an embodiment of the present application further provides a slot prediction apparatus, including:
the encoding unit is used for extracting the characteristics of each audio frame contained in the acquired audio information to obtain corresponding initial audio characteristics, and extracting the characteristics of each word contained in the text information identified based on the audio information to obtain corresponding initial text characteristics;
an attention unit for performing the following operations for each of the obtained initial text features, respectively: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
The channel prediction unit is used for obtaining multi-mode fusion characteristics based on each weighted audio characteristic and each initial text characteristic, and carrying out channel prediction on the multi-mode fusion characteristics to obtain a channel prediction result of the audio information.
In a fifth aspect, an embodiment of the present application further provides a computer device, including a processor and a memory, where the memory stores program code, and when the program code is executed by the processor, causes the processor to execute any one of the training method and the slot prediction method of the slot prediction model.
In a sixth aspect, embodiments of the present application also provide a computer readable storage medium comprising program code for causing a computer device to perform the steps of any one of the training of the above-described slot prediction model and the slot prediction method, when the program product is run on the computer device.
In a seventh aspect, an embodiment of the present application further provides a computer program product, including computer instructions, where the computer instructions are executed by a processor to perform any one of the training method and the slot prediction method of the slot prediction model.
The application has the following beneficial effects:
the embodiment of the application provides a training and predicting method, a device, equipment and a storage medium of a slot prediction model, wherein the method comprises the following steps: and performing cyclic iterative training on the to-be-trained slot prediction model by using a plurality of sample pairs in the training set until a trained target slot prediction model is output, wherein each iteration comprises:
extracting features of each audio frame contained in sample audio in a read sample pair to obtain corresponding initial audio features, and extracting features of each word contained in sample text in the sample pair to obtain corresponding initial text features;
for each obtained initial text feature, the following operations are respectively executed: performing attention interaction on each obtained initial audio feature and an initial text feature to obtain weighted audio features representing the attention degree of each initial audio feature to the initial text feature;
based on each weighted audio feature and each initial text feature, obtaining a multi-modal fusion feature;
based on the multi-mode fusion characteristics, the sample pair is subjected to groove position prediction and rhythm prediction respectively, and model parameters of a groove position prediction model are adjusted based on the obtained groove position prediction result and rhythm prediction result.
Aiming at a groove prediction task, the application considers the relation between prosody information in voice and groove prediction, and provides a method for fusing prosody information into a spoken language understanding task.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic diagram of a slot prediction result output by a conventional slot prediction model;
FIG. 2 is an alternative schematic diagram of an application scenario in an embodiment of the present application;
FIG. 3A is a schematic diagram of a structure of a trench prediction model in a model training stage according to an embodiment of the present application;
FIG. 3B is a schematic diagram of the wav2vec2.0 model in the model training stage according to the embodiment of the present application;
FIG. 3C is a schematic structural diagram of a CPC model according to an embodiment of the present application;
FIG. 3D is a schematic flow chart of a training slot prediction model according to an embodiment of the present application;
FIG. 3E is a logic diagram of a training slot prediction model according to an embodiment of the present application;
FIG. 3F is a schematic diagram of a model structure of wav2vec2.0 in a model application stage according to an embodiment of the present application;
FIG. 3G is a schematic flow chart of the embodiment of the application using wav2vec2.0 extracted features;
FIG. 3H is a schematic diagram of a logic diagram for generating an attention matrix according to an embodiment of the present application;
FIG. 3I is a schematic diagram of a logic diagram for generating an attention matrix according to an embodiment of the present application;
FIG. 3J is a schematic diagram of logic for generating a contextual audio feature according to an embodiment of the present application;
FIG. 3K is a logic diagram of a splicing weighted audio feature and an initial text feature according to an embodiment of the present application;
FIG. 3L is a logic diagram of another exemplary embodiment of a stitching weighted audio feature and an initial text feature;
FIG. 3M is a schematic diagram illustrating a relationship between slot prediction and reread tags according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a process for performing a slot prediction using a target slot prediction model according to an embodiment of the present application;
FIG. 5A is a schematic flow chart of semantic parsing performed by a voice assistant client according to an embodiment of the present application;
FIG. 5B is a schematic diagram illustrating the semantic parsing performed by a voice assistant client according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a training device for a slot prediction model according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a slot prediction apparatus according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a hardware configuration of a computer device to which embodiments of the present application are applied;
fig. 9 is a schematic diagram of a hardware composition structure of another computer device to which the embodiment of the present application is applied.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.
Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.
1. Artificial intelligence (Artificial Intelligence, AI):
artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, electromechanical integration, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
With research and progress of artificial intelligence technology, artificial intelligence is developed in various fields such as common smart home, intelligent customer service, virtual assistant, smart speaker, smart marketing, unmanned, automatic driving, robot, smart medical, etc., and it is believed that with the development of technology, artificial intelligence will be applied in more fields and become more and more important value.
2. Machine learning:
machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance.
Machine learning is the core of artificial intelligence and is the fundamental way for computers to have intelligence, and is applied in various fields of artificial intelligence, including deep learning, reinforcement learning, transfer learning, induction learning, teaching learning and other technologies.
3. Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.
4. Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
5. Pre-training is a concept of transfer learning, which refers to the process of pre-training a model. Fine tuning refers to the process of applying a pre-training model to a training set with a small number of training samples and adapting parameters to the training set.
In fact, in the field of convolutional neural networks, it is a very difficult matter to construct training sets on the order of tens of millions. Because the training set itself contains a small number of training samples, if a high performance of the model is pursued, the problem of fitting is easy to occur in the process of training the model.
In order to solve the problem of over fitting, a model is usually pre-trained in a large training set, and then model parameters of the model are finely tuned by using a small training set, so that training time and calculation resources of the model are saved, a better training effect can be achieved for the model, and the model has excellent and good usability.
6. The bi-directional coded representation (Bidirectional Encoder Representation from Transformers, BERT) from the transformer is a pre-trained language characterization model that emphasizes that the pre-training is no longer performed as before using a conventional unidirectional language model or a method of shallow stitching of two unidirectional language models, but using a new masking language model (masked language model, MLM) to enable deep bi-directional language characterization to be generated.
Among other things, the BERT model has several advantages:
(1) The BERT is used as a pre-training model, and when the model is used in a specific scene, a large number of corpus training models are not needed, so that the training time of the model is saved, and the training efficiency and generalization capability of the model are improved;
(2) BERT is an end-to-end model, and the network structure is not required to be adjusted, and only an output layer for a specific downstream task is required to be added at last;
(3) The BERT is based on an internal transducer module, so that the rapid parallelization can be realized, the network depth is increased, the characteristics of a DNN model are fully developed, and the model accuracy is improved;
(4) BERT is a bi-directional model, with better performance than other pre-trained models such as ELMO, GPT, etc., trained in conjunction with context.
7. The mechanism of attention stems from the study of human vision. In cognitive sciences, due to bottlenecks in information processing, humans may selectively focus on a portion of all information while ignoring other visible information. The above mechanism is often referred to as an attention mechanism. Different parts of the human retina have different degrees of information processing capabilities, i.e. Acuity (Acuity), with only the foveal part having the strongest Acuity. In order to reasonably utilize limited visual information processing resources, a human needs to select a specific part in the visual area and then concentrate on it. For example, people typically only have a small number of words to be read that are of interest and processing when reading. In summary, the attention mechanism has two main aspects: deciding which part of the input needs to be focused on; the limited information processing resources are allocated to the important parts.
The so-called attention mechanism is to calculate the attention weight of each audio frame in the encoding process of the sample audio through some operation, and calculate the implicit vector representation of the whole audio through a weighted summation mode.
However, when audio information of a current audio frame is encoded using a self-attention mechanism, attention is excessively focused on its own position, and thus a multi-head attention mechanism is proposed to solve this problem. The multi-head attention mechanism is to build different projection information in a plurality of different projection spaces, perform different projections on input matrixes, obtain a plurality of output matrixes, and splice the output matrixes together.
The following briefly describes the design concept of the embodiment of the present application:
the SLU is used as a core component of a man-machine spoken dialogue system, and the performance quality of the SLU has a decisive influence on the system.
The SLU includes a speech recognition task and a slot prediction task. The voice recognition task is used for recognizing the audio data as corresponding text content, and the slot prediction task is used for predicting the slot positions (such as no slot 0, a slot starting position B, a slot middle position I and the like) and the slot types (such as person names, object names, colors, houses and the like) of all words in the text content.
At present, a deep neural network based on text modeling is often used for performing a slot prediction task, and the specific process is as follows: text representations are extracted from the text of the audio conversion, and then, based on the extracted text representations, slot prediction is performed to obtain a slot prediction result for generating an operation instruction.
However, in the related art, the slot prediction model only performs slot prediction on the text, and does not have the capability of mining prosodic information in audio, so that effective pronunciation characteristics cannot be extracted, and the prediction performance of the model is affected.
For example: and carrying out slot prediction on the sentence 'Xiaoming Tianshang variety program' to obtain a slot prediction result shown in fig. 1. The method comprises the steps of judging that a slot position prediction result of a bright word in the tomorrow is I-name, wherein the slot position of the bright word is represented as a slot position middle position, and the slot position type is a name type. Because the position and the type of the slot are predicted incorrectly, the man-machine spoken dialog system cannot understand the semantic information which the sentence itself wants to express correctly, and thus cannot provide corresponding services.
In view of this, the embodiment of the application provides a training method, a training device, training equipment and a storage medium of a slot prediction model, wherein the method comprises the following steps:
And performing cyclic iterative training on the to-be-trained slot prediction model by using a plurality of sample pairs in the training set until a trained target slot prediction model is output, wherein each iteration comprises:
extracting features of each audio frame contained in sample audio in a read sample pair to obtain corresponding initial audio features, and extracting features of each word contained in sample text in the sample pair to obtain corresponding initial text features;
for each obtained initial text feature, the following operations are respectively executed: performing attention interaction on each obtained initial audio feature and an initial text feature to obtain weighted audio features representing the attention degree of each initial audio feature to the initial text feature;
based on each weighted audio feature and each initial text feature, obtaining a multi-modal fusion feature;
based on the multi-mode fusion characteristics, the sample pair is subjected to groove position prediction and rhythm prediction respectively, and model parameters of a groove position prediction model are adjusted based on the obtained groove position prediction result and rhythm prediction result.
Aiming at a groove prediction task, the application considers the relation between prosody information in voice and groove prediction, and provides a method for fusing prosody information into a spoken language understanding task.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and embodiments of the present application and features of the embodiments may be combined with each other without conflict.
The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.
Fig. 2 shows one application scenario in which two terminal devices 210 are included with one server 230, and each terminal device establishes a communication connection with the server 230 through a wired network or a wireless network.
The terminal device 210 in the embodiment of the present application is a user terminal used by a user. User terminals include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent home appliances, vehicle terminals, aircraft, and the like.
The server 230 in the embodiment of the present application may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligent platform.
The terminal device 210 has installed therein a voice control client 220 for controlling downstream applications to perform corresponding operations through voice commands input by a user. A speech recognition model for recognizing audio data in a speech instruction as text content and a slot prediction model for performing a slot prediction task for predicting a slot position (e.g., no slot 0, a slot start position B, a slot middle position I, etc.) and a slot type (e.g., a person name, an article name, a color, a house, etc.) of each word in the text content are deployed in the server 230. The specific process of voice control is as follows:
the terminal device 210 initiates the local voice control client 220 in response to a user-triggered wake-up instruction. The voice control client 220 collects voice commands input by the user by calling an audio collector of the terminal device 210, and transmits the collected voice commands to the server 230 through a communication channel established between the terminal device 210 and the server 230.
The server 230 inputs the voice command into the voice recognition model to perform voice recognition, obtains corresponding text content, and inputs the text content into the slot prediction model to perform a slot prediction task, so as to predict the slot position and the slot type of each word in the text content.
The voice control client 220 generates a corresponding control instruction based on the slot prediction result fed back by the server, and sends the control instruction to the downstream application program to control the downstream application program to execute a corresponding operation.
However, in the related art, the slot prediction model performs slot prediction only for text, and does not have the capability of mining prosodic information in audio, and cannot extract effective pronunciation characteristics, which affects the prediction performance of the model. In order to solve the problem of low model performance caused by incapability of excavating rhythm information of audio, the embodiment of the application provides a novel training method of a groove prediction model, which constructs rhythm information in rhythm prediction task deep excavation audio to obtain a corresponding rhythm prediction result, and then uses the rhythm prediction result to assist in training the groove prediction task so as to achieve the aim of improving the model performance.
Thus, in the model training phase, as shown in FIG. 3A, the slot prediction model includes the following: the audio encoder is used for extracting features of sample audio, the text encoder is used for extracting features of sample text, the attention mechanism module is used for mining the context relation between the audio and the text, the slot position prediction module is used for predicting the slot position and the slot type of each word in text content, and the prosody prediction module is used for mining prosody information in the audio.
The audio encoder is a self-supervision pre-training acoustic model based on a large amount of unlabeled data, such as vq-wav2vec, wav2vec2.0 and other acoustic models. By inputting the original audio data into an audio encoder, an audio representation of each audio frame can be obtained.
Compared with other acoustic models, the wav2vec2.0 is an ASR model, audio features do not need to be input into a downstream ASR model for training, the wav2vec2.0 provides an end-to-end model architecture, and a Gumbel softmax quantization module in the vq-wav2vec is combined with BERT, and then the model is obtained through further pre-training and fine tuning. Therefore, the embodiment of the application takes wav2vec2.0 as a better implementation mode.
As shown in fig. 3B-3C, the network structure between wav2vec2.0 and the comparative predictive coding (Contrastive Predictive Coding, CPC) is very similar, both consisting of an encoder and an autoregressive network, and the inputs are also one-dimensional audio signals. The difference is that: the wav2vec2.0 uses a transducer to replace a cyclic neural network (Recurrent Neural Network, RNN), and is updated to a global attention mechanism from a unidirectional information transmission mechanism of the original RNN, so that the output of the model at the time t+k is not only dependent on historical information, but also dependent on global information; meanwhile, the product quantization operation is introduced into the wav2vec2.0, and the infinite feature expression space is collapsed into a limited discrete space, so that the robustness of the feature is stronger, and the feature cannot be influenced by a small amount of disturbance.
The text encoder is also a self-supervising pre-trained text model based on a large amount of unlabeled data, whereby word representations of each word in the text content are obtained by inputting the text content into the text encoder. In an alternative embodiment, the BERT is used as a text encoder to extract characteristics of the input text content.
Next, please refer to the flow chart shown in fig. 3D and the logic diagram shown in fig. 3E, and further describe a training method of the new slot detection model according to the embodiment of the present application.
S301: a sample pair is read from the training set.
Before training can begin, a training set needs to be constructed. The specific construction process is as follows:
firstly, a large amount of text content and audio data are obtained from a network, a public data set and the like, invalid data containing a large amount of word and exclamation and repeated redundant data are removed through data cleaning, and a plurality of sample texts and a plurality of sample audios are obtained.
For each sample text, the following operations are performed: a sample is recited manually, corresponding sample audio is generated, and the sample text and the generated sample audio are taken as a sample pair.
However, manual recitation is time-consuming and labor-consuming, and is difficult to meet the audio generation requirement of a large-scale text, and in order to improve the audio generation efficiency, an AI-based dubbing model can be adopted to perform AI dubbing on each input sample text, so that corresponding sample audio is obtained.
The model performs the following operations for each sample audio: a sample audio is manually identified, a corresponding sample text is generated, and the sample audio and the corresponding sample text are used as a sample pair.
Similarly, manual recognition is time-consuming and labor-consuming, accuracy is difficult to guarantee, and in order to improve processing efficiency, a voice recognition model based on an AI technology can be adopted to recognize each input sample audio frequency, so that a corresponding sample text is obtained.
And finally, forming a training set for training the slot prediction model by using the obtained plurality of sample pairs.
S302: and extracting the characteristics of each audio frame contained in the sample audio in the read sample pair to obtain corresponding initial audio characteristics, and extracting the characteristics of each word contained in the sample text in the sample pair to obtain corresponding initial text characteristics.
The model structure of the audio encoder taking wav2vec2.0 as an embodiment of the application is shown in fig. 3F, and is composed of a feature encoding layer for extracting low-dimensional audio features and a transducer layer for mining the context relation between the low-dimensional audio features.
The feature encoding layer is comprised of multiple convolutional neural networks, each comprising a temporal convolution, a layer normalization (layer normalization), and an activation function. The transducer layer is composed mainly of the attention mechanism ((Attention Mechanism)) and the feedforward neural network (feedforwoard neural network).
As shown in fig. 3G, the process of extracting the initial audio features using wav2vec2.0 is as follows:
s3021: and extracting the characteristics of each audio frame contained in the sample audio in the read sample pair to obtain the corresponding low-dimensional audio characteristics.
S3022: for each obtained low-dimensional audio feature, the following operations are performed: and performing attention interaction on each obtained low-dimensional audio feature and one low-dimensional audio feature to obtain a context audio feature which characterizes the attention degree of each low-dimensional audio feature to the low-dimensional audio feature.
The output of the feature encoding layer is fed into a transducer layer, through which attention interaction operations are performed separately for each obtained low-dimensional audio feature, obtaining a corresponding contextual audio feature.
To learn the expression of multiple meanings, the attention mechanism assigns three weights, query weight W Q Key weight W K And value weight W V . As shown in fig. 3H, the low-dimensional audio feature set is multiplied by three weights, respectively, to generate a Query matrix (Q), a Key matrix (Key, K), and a Value matrix (Value, V).
Using the attention mechanism formulaContextual audio features of the respective low-dimensional audio features are calculated. The vector dimensions of Q and K are d k ,/>Is a transformation operation that changes the attention moment array into a standard normal distribution.
The process of obtaining a contextual audio feature is as follows: multiplying each obtained low-dimensional audio feature by one low-dimensional audio feature respectively to obtain the context weight of each low-dimensional audio feature to the low-dimensional audio feature; and then carrying out weighted summation on each low-dimensional audio feature and the corresponding context weight to obtain the context audio feature for representing the attention degree of each low-dimensional audio feature to the low-dimensional audio feature.
Specifically, as shown in fig. 3I, the low-dimensional audio feature of the first audio frame (row c1 of Q) is multiplied by the low-dimensional audio feature of the first audio frame (column c1 of the K transposed matrix) to obtain the context weight of the low-dimensional audio feature of the first audio frame and the low-dimensional audio feature of the first audio frame.
Then, the low-dimensional audio features (the c1 th row of Q) of the first audio frame are multiplied by the low-dimensional audio features (the c2 th to c6 th columns of the K transposed matrix) of the 2 nd to 6 th audio frames in turn to obtain corresponding context weights, so as to form the first row of the attention matrix. Wherein the respective context weights in the first row of the attention matrix indicate: the low-dimensional audio features of the first audio frame are highly correlated with which low-dimensional audio features.
As shown in fig. 3J, each row of V represents a mathematical representation of each low-dimensional audio feature, and thus, the respective low-dimensional audio features (columns c 1-c 3 of V) are weighted together with the corresponding context weights (first row of the attention matrix) such that the contextual audio feature of the first low-dimensional audio feature contains audio information of all audio frames.
S3023: and carrying out linear processing on the contextual audio features of each audio frame to obtain corresponding initial audio features.
And inputting the obtained contextual audio features into a feedforward neural network, and obtaining corresponding initial audio features through linear processing of the network.
After the operation of the audio encoder is known, the operation of the text encoder is continued.
And inputting the sample text into a text encoder, and extracting the characteristics of the encoder to obtain the initial text characteristics of each word in the sample text. Wherein, the term "a word" herein refers to a word composed of at least one word, such as "mountain", "water", etc.
For example, the sentence "company has organized a mountain climbing activity" is input to the text encoder, and the initial text characteristics shown in table 1 are output.
TABLE 1
Company (Corp) Tissue of A kind of electronic device with a high-pressure air-conditioning system Once-through Mountain climbing Activity
[0 1 1 1 0 1] [1 1 1 1 0 1] [0 0 1 1 0 1] [0 1 0 1 0 1] [0 1 0 1 1 1] [0 1 1 0 0 1]
S303: for each obtained initial text feature, the following operations are respectively executed: and performing attention interaction on each obtained initial audio feature and one initial text feature to obtain weighted audio features which characterize the attention degree of each initial audio feature to the initial text feature.
In step 303, the initial text feature set is multiplied by the query weights to generate a query matrix, and the initial audio feature set is multiplied by two other weights to generate a key matrix and a value matrix, respectively.
The process of obtaining a weighted audio feature is as follows: multiplying each obtained initial audio feature by an initial text feature to obtain the attention weight of each initial audio feature to the initial text feature; and then carrying out weighted summation on each initial audio feature and the corresponding attention weight to obtain weighted audio features representing the attention degree of each initial audio feature to the initial text feature.
For example, the initial text feature set of the sample text isAttention interaction is performed based on an attention mechanism, and finally the text of the whole sentence is expressed as +.>Wherein the formula for calculating the j-th weighted audio feature is: />
S304: based on each weighted audio feature and each initial text feature, a multimodal fusion feature is obtained.
In order to obtain a multi-modal fusion feature fused with audio features and text features, the embodiment of the application provides the following two splicing modes:
(1) As shown in fig. 3K, each weighted audio feature is spliced with each corresponding initial text feature in sequence to obtain a multi-mode fusion feature fused with the audio feature and the text feature;
(2) As shown in fig. 3L, the weighted audio feature set including each weighted audio feature is spliced with the initial text feature set including each initial text feature to obtain a multi-modal fusion feature in which the audio feature and the text feature are fused.
S305: based on the multi-mode fusion characteristics, the sample pair is subjected to groove position prediction and rhythm prediction respectively, and model parameters of a groove position prediction model are adjusted based on the obtained groove position prediction result and rhythm prediction result.
FIG. 3M shows the relationship between slot prediction and rereading tags, and it can be seen that the slots of B-room and I-room are rereaded. Considering the relation between the two, in order to display the digging prosody information from the voice, a prosody prediction task is constructed on the basis of the attention mechanism, and the training of the channel prediction task is assisted, so that the purpose of improving the model performance is achieved.
Therefore, in step 305, the multi-modal fusion feature is first subjected to slot prediction, and a slot prediction result of the sample text in the sample pair is determined, where the slot prediction result includes slot prediction labels of each word in the sample text; and performing prosody prediction on the multi-modal fusion characteristics, and determining a reread prediction result of the sample text, wherein the reread prediction result comprises reread labels of words in the sample text corresponding to each audio frame of the sample audio in the sample pair.
Determining the groove classification loss L generated by the groove prediction model in the training process based on the obtained groove prediction result and the corresponding groove actual result SF The method comprises the steps of carrying out a first treatment on the surface of the Determining prosody prediction loss L generated by the groove position prediction model in the training process based on the obtained prosody prediction result and the corresponding prosody actual result PD
Finally, as shown in formula 1, based on the channel classification loss and the prosody prediction loss, determining a model total loss L generated in the training process of the model, and adjusting model parameters of the channel prediction model based on the model total loss. Wherein, alpha is a manually set super parameter, and the value range is between 0 and 1.
L=α×L SF +(1-α)×L PD Equation 1;
s306: judging whether the model is trained, if so, outputting a trained target slot prediction model; otherwise, return to step 301.
In the model training stage, when the total loss of the iteration does not exceed a set loss value or the iteration turns reach the set turns, determining that the model training is finished, and outputting a trained target slot prediction model; otherwise, returning to step 301, the iterative training of the next round is continued.
Aiming at a groove prediction task, the application considers the relation between prosody information in voice and groove prediction, and provides a method for fusing prosody information into a spoken language understanding task.
In the model application stage, the trained target slot prediction model comprises an audio encoder, a text encoder, an attention mechanism module and a slot prediction module. As shown in fig. 4, the target slot prediction model is used to perform the slot prediction, which includes the following three steps, and since most of the steps performed by the target slot prediction model are the same as those performed during the model training phase, the steps are not described in detail herein.
S401: extracting features of each audio frame contained in the acquired audio information to obtain corresponding initial audio features, and extracting features of each word contained in the text information identified based on the audio information to obtain corresponding initial text features;
s402: for each obtained initial text feature, the following operations are respectively executed: performing attention interaction on each obtained initial audio feature and an initial text feature to obtain weighted audio features representing the attention degree of each initial audio feature to the initial text feature;
s403: based on each weighted audio feature and each initial text feature, a multi-mode fusion feature is obtained, and a slot prediction result of the audio information is obtained by carrying out slot prediction on the multi-mode fusion feature.
The target slot prediction model is applied to a voice assistant scene, as shown in fig. 5A-5B, and the voice assistant client generates a corresponding control instruction based on the analyzed slot prediction result, and transmits the instruction to a downstream reference scene to execute a corresponding operation.
S501: the voice assistant client responds to the recording operation triggered by the user, and records a voice instruction of the user, namely, a sound instruction of ' Hao, please play the song of ' remmber me ', by using recording equipment;
S502: after completing audio recording, the voice assistant client inputs a voice command into a voice recognition model to perform voice recognition, so as to obtain corresponding text content of ' hello, please play the song of ' rememberme ';
s503: respectively inputting a voice instruction and text content into a target slot prediction model to obtain corresponding slot prediction results of O, B-action, B-name, I-name and B-type;
s504: the voice assistant client generates a control instruction of 'music player playing' based on the slot prediction result, and sends the control instruction to the music player to control the music player to play corresponding songs.
Based on the same inventive concept as the method embodiment, the embodiment of the application also provides a training device of the slot position prediction model. As shown in fig. 6, the training apparatus 600 of the slot prediction model may include:
the model training unit 601 is configured to train a to-be-trained slot prediction model based on a plurality of sample pairs in a training set by adopting a loop iteration mode until a trained target slot prediction model is output, where each iteration includes:
the encoding unit 602 is configured to perform feature extraction on each audio frame included in the sample audio in the read sample pair to obtain a corresponding initial audio feature, and perform feature extraction on each word included in the sample text in the sample pair to obtain a corresponding initial text feature;
An attention unit 603, configured to perform the following operations for each obtained initial text feature: performing attention interaction on each obtained initial audio feature and an initial text feature to obtain weighted audio features representing the attention degree of each initial audio feature to the initial text feature;
a fusion unit 604, configured to obtain a multi-modal fusion feature based on each weighted audio feature and each initial text feature;
and a parameter adjustment unit 605 for performing a groove prediction and a prosody prediction on a pair of samples based on the multimodal fusion feature, respectively, and adjusting model parameters of the groove prediction model based on the obtained groove prediction result and prosody prediction result.
Optionally, the attention unit 603 is configured to:
multiplying each obtained initial audio feature by an initial text feature to obtain the attention weight of each initial audio feature to the initial text feature;
and carrying out weighted summation on each initial audio feature and the corresponding attention weight to obtain weighted audio features representing the attention degree of each initial audio feature to one initial text feature.
Optionally, the parameter adjusting unit 605 is configured to:
Carrying out slot prediction on the multi-mode fusion characteristics, and determining a slot prediction result of a sample text in a sample pair, wherein the slot prediction result comprises slot prediction labels of all words in the sample text;
and performing prosody prediction on the multi-modal fusion characteristics to determine a reread prediction result of the sample text, wherein the reread prediction result comprises reread labels of words in the sample text corresponding to each audio frame of the sample audio in a sample pair.
Optionally, the parameter adjusting unit 605 is configured to:
determining the classification loss of the slot position generated in the training process of the slot position prediction model based on the obtained slot position prediction result and the corresponding slot position actual result;
determining prosody prediction loss generated by the groove position prediction model in the training process based on the obtained prosody prediction result and the corresponding prosody actual result;
and determining the total model loss generated by the model in the training process based on the groove classification loss and the prosody prediction loss, and adjusting model parameters of the groove prediction model based on the total model loss.
Optionally, the fusion unit 604 performs any one of the following ways:
splicing each weighted audio feature with each corresponding initial text feature in sequence to obtain a multi-mode fusion feature fused with the audio feature and the text feature;
And splicing the weighted audio feature set containing each weighted audio feature with the initial text feature set containing each initial text feature to obtain the multi-mode fusion feature fusing the audio feature and the text feature.
Optionally, the encoding unit 602 is configured to:
extracting the characteristics of each audio frame contained in the sample audio in the read sample pair to obtain corresponding low-dimensional audio characteristics;
for each obtained low-dimensional audio feature, the following operations are performed: performing attention interaction on each obtained low-dimensional audio feature and one low-dimensional audio feature to obtain a context audio feature for representing the attention degree of each low-dimensional audio feature to one low-dimensional audio feature;
and carrying out linear processing on the contextual audio features of each audio frame to obtain corresponding initial audio features.
Optionally, the encoding unit 602 is configured to:
multiplying each obtained low-dimensional audio feature by one low-dimensional audio feature respectively to obtain the context weight of each low-dimensional audio feature to one low-dimensional audio feature;
and carrying out weighted summation on each low-dimensional audio feature and the corresponding context weight to obtain the context audio feature which characterizes the attention degree of each low-dimensional audio feature to one low-dimensional audio feature.
Based on the same inventive concept as the above method embodiment, the embodiment of the present application further provides a slot position prediction apparatus. As shown in fig. 7, the slot prediction apparatus 700 may include:
the encoding unit 701 is configured to perform feature extraction on each audio frame included in the obtained audio information to obtain a corresponding initial audio feature, and perform feature extraction on each word included in the text information identified based on the audio information to obtain a corresponding initial text feature;
an attention unit 702, configured to perform the following operations, for each obtained initial text feature: performing attention interaction on each obtained initial audio feature and an initial text feature to obtain weighted audio features representing the attention degree of each initial audio feature to the initial text feature;
the slot prediction unit 703 is configured to obtain a multi-mode fusion feature based on each weighted audio feature and each initial text feature, and perform slot prediction on the multi-mode fusion feature to obtain a slot prediction result of the audio information.
For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.
Having described the method and apparatus for training and predicting a defect detection model according to an exemplary embodiment of the present application, a computer device according to another exemplary embodiment of the present application is described next.
Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
Based on the same inventive concept as the above-mentioned method embodiment, a computer device is also provided in the embodiment of the present application. In one embodiment, the computer device may be a server, such as server 230 shown in FIG. 2. In this embodiment, the structure of the computer device 800 is shown in fig. 8, and may include at least a memory 801, a communication module 803, and at least one processor 802.
A memory 801 for storing a computer program for execution by the processor 802. The memory 801 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
The memory 801 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 801 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 801, is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 801 may be a combination of the above memories.
The processor 802 may include one or more central processing units (central processing unit, CPU) or digital processing units, etc. And a processor 802 for implementing any one of the training method and the slot prediction method of the slot prediction model when the computer program stored in the memory 801 is called.
The communication module 803 is used for communicating with a terminal device and other servers.
The specific connection medium between the memory 801, the communication module 803, and the processor 802 is not limited in the embodiment of the present application. The embodiment of the present application is illustrated in fig. 8 by a bus 804 between the memory 801 and the processor 802, where the bus 804 is illustrated in fig. 8 by a bold line, and the connection between other components is merely illustrative, and not limiting. The bus 804 may be classified as an address bus, a data bus, a control bus, or the like. For ease of description, only one thick line is depicted in fig. 8, but only one bus or one type of bus is not depicted.
The memory 801 stores a computer storage medium in which computer executable instructions are stored for implementing any one of the training method and the slot prediction method of the slot prediction model according to the embodiment of the present application. The processor 802 is configured to perform any of the training method and the slot prediction method of the slot prediction model described above, as shown in fig. 3D and 4.
In another embodiment, the computer device may also be other computer devices, such as the physical terminal device 210 shown in FIG. 2. In this embodiment, the structure of the computer device may include, as shown in fig. 9: communication component 910, memory 920, display unit 930, camera 940, sensor 950, audio circuit 960, bluetooth module 990, processor 980, and so forth.
The communication component 910 is configured to communicate with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the object to send and receive information through the WiFi module.
Memory 920 may be used to store software programs and data. The processor 980 performs various functions and data processing by operating software programs or data stored in the memory 920, for example. Memory 920 may include high-speed random access memory, but may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The memory 920 stores an operating system that enables the terminal device 210 to operate. The memory 920 may store an operating system and various application programs, and may also store a computer program for executing any one of the training method and the slot prediction method of the slot prediction model according to the embodiment of the present application.
The display unit 930 may also be used to display information input by an object or information provided to the object and a graphical user interface (graphical user interface, GUI) of various menus of the terminal device 210. In particular, the display unit 930 may include a display 932 provided on the front surface of the terminal device 210. The display 932 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 930 may be used to display a slot prediction interface, a model training interface, and the like in an embodiment of the present application.
The display unit 930 may also be used to receive input digital or character information, generate signal inputs related to object settings and function control of the physical terminal device 210, and in particular, the display unit 930 may include a touch screen 931 disposed on the front surface of the terminal device 210, and may collect touch operations on or near the object, such as clicking a button, dragging a scroll box, and the like.
The touch screen 931 may cover the display screen 932, or the touch screen 931 may be integrated with the display screen 932 to implement the input and output functions of the physical terminal device 210, and the integrated touch screen may be simply referred to as a touch screen. The display unit 930 may display the application program and the corresponding operation steps in the present application.
The camera 940 may be used to capture still images, and the subject may post the images captured by the camera 940 through the application. The number of cameras 940 may be one or more. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal to an electrical signal, which is then passed to processor 980 for conversion to a digital image signal.
The physical terminal device may further comprise at least one sensor 950, such as an acceleration sensor 951, a distance sensor 952, a fingerprint sensor 953, a temperature sensor 954. The terminal device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.
Audio circuitry 960, speaker 961, microphone 962 may provide an audio interface between the object and terminal device 210. Audio circuit 960 may transmit the received electrical signal converted from audio data to speaker 961, where it is converted to a sound signal output by speaker 961. The physical terminal device 210 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, microphone 962 converts the collected sound signals into electrical signals, which are received by audio circuit 960 and converted into audio data, which are output to communication component 910 for transmission to, for example, another physical terminal device 210, or to memory 920 for further processing.
The bluetooth module 990 is used for exchanging information with other bluetooth devices having a bluetooth module through a bluetooth protocol. For example, the physical terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that also has a bluetooth module through the bluetooth module 990, so as to perform data interaction.
The processor 980 is a control center of the physical terminal device, connects various parts of the entire terminal using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 920, and calling data stored in the memory 920. In some embodiments, processor 980 may include one or more processing units; processor 980 may also integrate an application processor primarily handling operating systems, user interfaces, applications programs, etc., with a baseband processor primarily handling wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 980. The processor 980 in the present application may run any one of an operating system, an application, a user interface display, and a touch response, and a training method and a slot prediction method of the slot prediction model in the embodiments of the present application. In addition, the processor 980 is coupled to the display unit 930.
It should be noted that, in the specific embodiment of the present application, the object data related to the slot prediction model and the like is referred to, and when the above embodiment of the present application is applied to a specific product or technology, the object permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
In some possible embodiments, aspects of the training method of the slot prediction model provided by the present application may also be implemented in the form of a program product, which includes a computer program for causing a computer device to perform the steps of any one of the training method and the slot prediction method of the slot prediction model according to the various exemplary embodiments of the present application described above when the program product is run on the computer device, for example, the computer device may perform the steps as shown in fig. 3D and fig. 4.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may take the form of a portable compact disc read only memory (CD-ROM) and comprise a computer program and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the user's computer device, partly on the user's computer device, as a stand-alone software package, partly on the user's computer device and partly on a remote computer device or entirely on the remote computer device. In the case of remote computer devices, the remote computer device may be connected to the user computer device through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer device (for example, through the Internet using an Internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having a computer-usable computer program embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program commands may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the commands executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program commands may also be stored in a computer readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the commands stored in the computer readable memory produce an article of manufacture including command means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (15)

1. The training method of the slot position detection model is characterized by comprising the following steps of:
training the to-be-trained groove position prediction model based on a plurality of sample pairs in a training set by adopting a cyclic iteration mode until a trained target groove position prediction model is output, wherein each iteration comprises the following steps:
extracting features of each audio frame contained in sample audio in a read sample pair to obtain corresponding initial audio features, and extracting features of each word contained in sample text in the sample pair to obtain corresponding initial text features;
for each obtained initial text feature, the following operations are respectively executed: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
Based on each weighted audio feature and each initial text feature, obtaining a multi-modal fusion feature;
and respectively carrying out groove position prediction and rhythm prediction on the one sample pair based on the multi-mode fusion characteristics, and adjusting model parameters of the groove position prediction model based on the obtained groove position prediction result and rhythm prediction result.
2. The method of claim 1, wherein said performing an attention interaction of each of the obtained initial audio features with one initial text feature to obtain a weighted audio feature characterizing the attention of each of the initial audio features to the one initial text feature comprises:
multiplying each obtained initial audio feature by the one initial text feature respectively to obtain the attention weight of each initial audio feature to the one initial text feature;
and carrying out weighted summation on each initial audio feature and the corresponding attention weight to obtain weighted audio features which characterize the attention degree of each initial audio feature to the initial text feature.
3. The method of claim 1, wherein the performing a groove prediction and a prosody prediction on the one sample pair based on the multimodal fusion feature, respectively, comprises:
Carrying out slot prediction on the multi-mode fusion characteristics, and determining a slot prediction result of the sample text in the sample pair, wherein the slot prediction result comprises slot prediction labels of all words in the sample text;
and performing prosody prediction on the multi-modal fusion feature, and determining a reread prediction result of the sample text, wherein the reread prediction result comprises reread labels of words in the sample text corresponding to the audio frames of the sample audio in the sample pair.
4. The method of claim 3, wherein adjusting model parameters of the channel prediction model based on the obtained channel prediction result and prosody prediction result comprises:
determining the groove classification loss generated by the groove prediction model in the training process based on the obtained groove prediction result and the corresponding groove actual result;
determining prosody prediction loss generated by the groove position prediction model in the training process based on the obtained prosody prediction result and the corresponding prosody actual result;
and determining the model total loss generated in the training process of the model based on the groove classification loss and the prosody prediction loss, and adjusting the model parameters of the groove prediction model based on the model total loss.
5. The method of claim 1, wherein the obtaining a multi-modal fusion feature based on each weighted audio feature and each initial text feature comprises any one of:
splicing the weighted audio features with the corresponding initial text features in sequence to obtain multi-mode fusion features fused with the audio features and the text features;
and splicing the weighted audio feature set containing the weighted audio features with the initial text feature set containing the initial text features to obtain the multi-mode fusion feature fusing the audio features and the text features.
6. The method of claim 1, wherein the feature extraction of each audio frame included in the sample audio in the read sample pair to obtain the corresponding initial audio feature comprises:
extracting the characteristics of each audio frame contained in the sample audio in the read sample pair to obtain corresponding low-dimensional audio characteristics;
for each obtained low-dimensional audio feature, the following operations are performed: performing attention interaction on each obtained low-dimensional audio feature and one low-dimensional audio feature to obtain a contextual audio feature representing the attention degree of each low-dimensional audio feature to the one low-dimensional audio feature;
And carrying out linear processing on the contextual audio features of each audio frame to obtain corresponding initial audio features.
7. The method of claim 6, wherein the performing the attention interaction of each of the obtained low-dimensional audio features with one of the low-dimensional audio features to obtain contextual audio features that characterize the degree of attention of each of the low-dimensional audio features to the one of the low-dimensional audio features comprises:
multiplying each obtained low-dimensional audio feature by the one low-dimensional audio feature respectively to obtain the context weight of each low-dimensional audio feature to the one low-dimensional audio feature;
and carrying out weighted summation on each low-dimensional audio feature and the corresponding context weight to obtain the context audio feature which characterizes the attention degree of each low-dimensional audio feature to the one low-dimensional audio feature.
8. A method of slot prediction, comprising:
extracting features of each audio frame contained in the acquired audio information to obtain corresponding initial audio features, and extracting features of each word contained in text information identified based on the audio information to obtain corresponding initial text features;
For each obtained initial text feature, the following operations are respectively executed: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
based on each weighted audio feature and each initial text feature, a multi-mode fusion feature is obtained, and a slot prediction result of the audio information is obtained by carrying out slot prediction on the multi-mode fusion feature.
9. A training device for a slot prediction model, comprising:
the model training unit is used for training the to-be-trained groove position prediction model based on a plurality of sample pairs in a training set in a cyclic iteration mode until a trained target groove position prediction model is output, wherein each iteration comprises:
the coding unit is used for extracting the characteristics of each audio frame contained in the sample audio in the read sample pair to obtain corresponding initial audio characteristics, and extracting the characteristics of each word contained in the sample text in the sample pair to obtain corresponding initial text characteristics;
an attention unit for performing the following operations for each of the obtained initial text features, respectively: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
The fusion unit is used for obtaining multi-mode fusion characteristics based on each weighted audio characteristic and each initial text characteristic;
and the parameter adjustment unit is used for respectively carrying out the groove position prediction and the rhythm prediction on the one sample pair based on the multi-mode fusion characteristic and adjusting the model parameters of the groove position prediction model based on the obtained groove position prediction result and rhythm prediction result.
10. The apparatus of claim 9, wherein the attention unit is to:
multiplying each obtained initial audio feature by the one initial text feature respectively to obtain the attention weight of each initial audio feature to the one initial text feature;
and carrying out weighted summation on each initial audio feature and the corresponding attention weight to obtain weighted audio features which characterize the attention degree of each initial audio feature to the initial text feature.
11. The apparatus of claim 9, wherein the parameter adjustment unit is to:
carrying out slot prediction on the multi-mode fusion characteristics, and determining a slot prediction result of the sample text in the sample pair, wherein the slot prediction result comprises slot prediction labels of all words in the sample text;
And performing prosody prediction on the multi-modal fusion feature, and determining a reread prediction result of the sample text, wherein the reread prediction result comprises reread labels of words in the sample text corresponding to the audio frames of the sample audio in the sample pair.
12. A slot prediction apparatus, comprising:
the encoding unit is used for extracting the characteristics of each audio frame contained in the acquired audio information to obtain corresponding initial audio characteristics, and extracting the characteristics of each word contained in the text information identified based on the audio information to obtain corresponding initial text characteristics;
an attention unit for performing the following operations for each of the obtained initial text features, respectively: performing attention interaction on each obtained initial audio feature and one initial text feature to obtain a weighted audio feature representing the attention degree of each initial audio feature to the one initial text feature;
the channel prediction unit is used for obtaining multi-mode fusion characteristics based on each weighted audio characteristic and each initial text characteristic, and carrying out channel prediction on the multi-mode fusion characteristics to obtain a channel prediction result of the audio information.
13. A computer device comprising a processor and a memory, wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 7 or the steps of the method of claim 8.
14. A computer readable storage medium, characterized in that it comprises a program code for causing a computer device to perform the steps of the method of any one of claims 1 to 7 or the steps of the method of claim 8 when said program code is run on the computer device.
15. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7 or the steps of the method of claim 8.
CN202310033015.7A 2023-01-10 2023-01-10 Training and predicting method, device, equipment and storage medium for slot prediction model Pending CN116978362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310033015.7A CN116978362A (en) 2023-01-10 2023-01-10 Training and predicting method, device, equipment and storage medium for slot prediction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310033015.7A CN116978362A (en) 2023-01-10 2023-01-10 Training and predicting method, device, equipment and storage medium for slot prediction model

Publications (1)

Publication Number Publication Date
CN116978362A true CN116978362A (en) 2023-10-31

Family

ID=88480334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310033015.7A Pending CN116978362A (en) 2023-01-10 2023-01-10 Training and predicting method, device, equipment and storage medium for slot prediction model

Country Status (1)

Country Link
CN (1) CN116978362A (en)

Similar Documents

Publication Publication Date Title
WO2020182153A1 (en) Method for performing speech recognition based on self-adaptive language, and related apparatus
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
CN111312245B (en) Voice response method, device and storage medium
JP6797240B2 (en) Methods and systems for generating multi-turn conversational responses using deep learning generative models and multimodal distributions
JP2021067939A (en) Method, apparatus, device and medium for interactive voice control
CN113421547B (en) Voice processing method and related equipment
CN108520741A (en) A kind of whispering voice restoration methods, device, equipment and readable storage medium storing program for executing
CN113823262B (en) Voice recognition method and device, electronic equipment and storage medium
CN105096935A (en) Voice input method, device, and system
CN112837669B (en) Speech synthesis method, device and server
WO2023207541A1 (en) Speech processing method and related device
CN116720004A (en) Recommendation reason generation method, device, equipment and storage medium
CN114707513A (en) Text semantic recognition method and device, electronic equipment and storage medium
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN113822076A (en) Text generation method and device, computer equipment and storage medium
CN113948060A (en) Network training method, data processing method and related equipment
CN111161724B (en) Method, system, equipment and medium for Chinese audio-visual combined speech recognition
CN115688937A (en) Model training method and device
CN117633198A (en) Training method of role dialogue model, dialogue generation method, device and equipment
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN113571063B (en) Speech signal recognition method and device, electronic equipment and storage medium
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN113393841A (en) Training method, device and equipment of speech recognition model and storage medium
CN115101075B (en) Voice recognition method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication