WO2024023946A1 - Dispositif, procédé et programme de traitement de la parole - Google Patents

Dispositif, procédé et programme de traitement de la parole Download PDF

Info

Publication number
WO2024023946A1
WO2024023946A1 PCT/JP2022/028843 JP2022028843W WO2024023946A1 WO 2024023946 A1 WO2024023946 A1 WO 2024023946A1 JP 2022028843 W JP2022028843 W JP 2022028843W WO 2024023946 A1 WO2024023946 A1 WO 2024023946A1
Authority
WO
WIPO (PCT)
Prior art keywords
loss function
model
learning
context
becomes smaller
Prior art date
Application number
PCT/JP2022/028843
Other languages
English (en)
Japanese (ja)
Inventor
智大 田中
亮 増村
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/028843 priority Critical patent/WO2024023946A1/fr
Publication of WO2024023946A1 publication Critical patent/WO2024023946A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to an audio processing device, an audio processing method, and an audio processing program.
  • the latter task refers to a task that uses voice as input, such as voice recognition.
  • voice such as voice recognition.
  • self-supervised learning parameters are learned so that a context representation can be obtained from speech that takes into account previous and subsequent input.
  • Transformer is known as a neural network that can acquire a context representation (see, for example, Non-Patent Document 2).
  • the conventional technology has a problem in that the accuracy of tasks subsequent to self-supervised learning may decrease.
  • the self-supervised learning model for speech may overfit the learning data of the self-supervised learning.
  • a mismatch occurs between the self-supervised learning model and the data used in the subsequent task, and an effective representation for the subsequent task cannot be obtained.
  • a speech processing device has a first method in which a vector obtained by quantizing speech features by a model becomes smaller as it approaches the context representation acquired by the model from the features. and a second loss function that becomes smaller as the accuracy with which the model identifies meta information of the speech based on the context representation increases, and the first loss function and an updating unit that updates parameters of the model so that the loss function becomes small and the second loss function becomes large.
  • FIG. 1 is a diagram showing an example of the configuration of a first learning device.
  • FIG. 2 is a diagram showing an example of the configuration of the second learning device.
  • FIG. 3 is a diagram showing a configuration example of an estimation device.
  • FIG. 4 is a flowchart showing the overall flow of the learning process.
  • FIG. 5 is a flowchart showing the flow of self-supervised learning processing.
  • FIG. 6 is a flowchart showing the flow of relearning processing.
  • FIG. 7 is a flowchart showing the flow of inference processing.
  • FIG. 8 is a diagram showing an example of a computer that executes a learning program.
  • the model is, for example, a neural network and includes a speech encoder, a context network, a quantization network, a classification network and an additional network. Details of each network will be described later.
  • the additional network is a neural network for calculating the final output in the latter task described above.
  • Tasks include classification tasks, generation tasks, prediction tasks, and the like.
  • Tasks targeting speech include speech recognition to obtain text from speech, speech classification to classify speech into predetermined types (e.g. speaker attributes, emotions), speaker identification to identify the speaker of speech, etc. It will be done.
  • the process of optimizing model parameters to improve task accuracy is called learning process.
  • the process of actually executing a task using one or more models including additional networks that have been trained through the learning process is called inference process.
  • the learning process of this embodiment is comprised of two steps: self-supervised learning process and relearning process.
  • the speech encoder and context network are called a self-supervised learning model.
  • Self-supervised learning models can be used for several different tasks targeting speech.
  • additional networks are models specialized for specific tasks.
  • a self-supervised learning model is trained.
  • additional network learning is performed using the self-supervised learning model that has been trained in the self-supervised learning process.
  • the first learning device 10 performs self-supervised learning processing. Further, the second learning device 20 performs relearning processing. Further, the inference device 30 performs inference processing.
  • the first learning device 10, the second learning device 20, and the inference device 50 may be realized by different computers, or may be realized by one computer.
  • the first learning device 10 is an example of a speech processing device. Moreover, any one or more of the first learning device 10, the second learning device 20, and the reasoning device 50 can function as a speech processing device.
  • FIG. 1 is a diagram showing an example of the configuration of a first learning device.
  • the first learning device 10 has a set of an acoustic feature sequence X and a classification label l of meta information ⁇ (X 1 , l 1 ), ..., (X M , l M ); l M ⁇ l 1 ,..., l L ⁇ is input as learning data.
  • the classification label l is the correct label.
  • M is the number of pairs of audio feature series and classification labels included in the learning data, and is an integer of 1 or more.
  • l l is the l-th type of classification label.
  • L is the number of types of classification labels prepared, and is an integer of 2 or more.
  • meta information is information representing the domain of audio (call center conversation audio, online conference audio, reading audio, etc.), language, gender, etc.
  • the acoustic features are, for example, log Mel filter bank coefficients (FBANK).
  • acoustic features are not limited to logarithmic mel filter vans, but include MFCC (Mel frequency cepstral coefficient), ⁇ MFCC (first derivative of MFCC), ⁇ MFCC (second derivative of MFCC), logarithmic power, ⁇ logarithmic power (logarithmic power first-order differential), etc.
  • the acoustic feature amount may be a sample of raw speech.
  • classification label may be represented by an L-dimensional 1-hot vector.
  • the first learning device 10 includes a speech encoder section 11, a context network section 12, a quantization network section 13, a classification network section 14, a classification learning loss calculation section 15, and a context representation learning loss calculation section 16. , a learning parameter updating unit 17, and model information 10a.
  • the model information 10a is the parameter of the model used by the first learning device 10. Parameters include neural network weights and biases. Furthermore, in the learning process, the model information 10a is updated as appropriate.
  • I is the sequence length of the acoustic feature, and is an integer of 1 or more.
  • T is the sequence length of the voice intermediate feature vector sequence, and is an integer of 1 or more.
  • the audio encoder unit 11 calculates the audio intermediate feature vector sequence Z as shown in equation (1).
  • SpeechEncoder() is a function that has the function of a neural network, for example, a convolutional neural network.
  • ⁇ se1 is a parameter of the audio encoder and can be learned. ⁇ se1 is read from the model information 10a.
  • the context network unit 12 applies masking to the intermediate feature vector sequence Z, which is the output of the audio encoder unit 11, as shown in equation (2).
  • Masking() is a function that performs masking in the time direction.
  • ContextNetwork() (context network) is a function that has the function of a neural network, and is, for example, the Transformer described in Non-Patent Document 2.
  • ⁇ se2 is a parameter of the context network and can be learned. ⁇ se2 is read from the model information 10a.
  • QuantizationNetwork() is a function that has the function of a neural network, and is composed of, for example, a fully connected neural network and a Gumbel softmax function.
  • the Gumbel softmax function is a differentiable function for propagating the output of a classifier (for example, a fully connected neural network) to a subsequent network.
  • the Gumbel softmax function is described in Reference 1, for example.
  • ⁇ qn is a parameter of the quantization network and can be learned. ⁇ qn is read from the model information 10a.
  • the number of dimensions of the probability sequence O is L.
  • Each element of the probability series O corresponds to each element ⁇ l 1 ,...,l L ⁇ of the classification label l.
  • GRL( ) is a function representing a gradient reversal layer (for example, see reference document 1), and is a function that inverts the sign of the gradient during Backward in the error backpropagation method.
  • ClassNetwork() (classification network) is a function that has the function of a neural network, and is composed of, for example, a fully connected neural network and a softmax function.
  • ⁇ qn is a parameter of the quantization network and can be learned. ⁇ qn is read from the model information 10a.
  • the classification learning loss calculation unit 15 calculates the classification learning loss L class for the classification label l as shown in equation (7).
  • ClassLoss( ) is a function that calculates the loss for identifying the classification label l, for example, cross entropy loss.
  • the context expression learning loss calculation unit 16 calculates a loss L context (context expression learning loss) for learning a context expression as shown in equation (8).
  • ContextLoss() is a function that calculates loss for learning context expressions, for example, Contrastive loss.
  • Sim() in equation (8) is a function that calculates the similarity between two vectors, and is, for example, a cosine similarity.
  • ⁇ Q ( ⁇ above Q) represents the set of negative examples of the quantized vector.
  • is a temperature parameter set in advance.
  • a pair (positive example) of an element of the quantized expression vector sequence Q and an element of the corresponding context expression vector sequence is used.
  • q t and c t are a pair of corresponding elements.
  • the learning parameter updating unit 17 updates the parameters of the model based on the classification learning loss L class and the context expression learning loss L context .
  • the learning parameter updating unit 17 updates the parameters for each mini-batch.
  • the learning parameter updating unit 17 uses equations (9), (10), and ( 11) Update the parameters using the formula.
  • represents the learning rate
  • represents the weight for context representation learning loss
  • represents the weight for classification learning loss.
  • represents a weight, and the influence of loss is adjusted by changing it significantly each time learning progresses (updating is repeated in mini-batch units).
  • Equation (5) the function GRL( ) is introduced in the calculation of the classification network unit 14, so the sign of the term with ⁇ in equation (11) is inverted.
  • Equation (11) the learning parameter is updated so that it becomes smaller as the vector quantized for the input by the model approaches the acquired context representation, and the accuracy of identifying meta information based on the acquired context representation increases. be done.
  • the learning parameter updating unit 17 uses the first loss function (context representation learning loss L context ) and a second loss function (classification learning loss L class ) that becomes smaller as the accuracy with which the model identifies speech meta information based on the context expression increases. Then, the learning parameter updating unit 17 updates the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger (Equation (11)).
  • the learning parameter update section 17 corresponds to a loss function calculation section and an update section.
  • the parameter by the update unit is transferred from the first term (the term with ⁇ ) which is the first loss function to the second loss function.
  • This is a loss function (third loss function) obtained by subtracting a second term (term with ⁇ ) attached with a weight ⁇ that increases as the number of times the update of is repeated.
  • ANN adversarial neural network
  • the ANN performs learning so that the context network does not identify meta information regarding audio. This allows the context network to obtain a universal representation without overfitting the learning data.
  • the context network will operate robustly even for speech in an unknown domain.
  • the meta information includes the domain of the voice, the characteristics of the voice (language, etc.), the attributes of the speaker of the voice (gender, age), etc., and is information different from the content of the utterance expressed in text or the like. Moreover, the content of the utterance may be rephrased as the content of information transmitted by voice.
  • the process is further repeated using the updated parameters. Furthermore, when a predetermined condition (for example, the number of repetitions) is satisfied, the iterative process ends.
  • the context network unit 12 is an example of a context expression calculation unit that inputs voice features into a model and calculates a context expression.
  • the classification network unit 14 is an example of a meta-information label calculation unit that inputs a context expression into a model and calculates a label that specifies audio meta-information.
  • the quantization network unit 13 is an example of a quantization vector calculation unit that inputs a voice feature amount into a model and calculates a quantized vector.
  • the learning parameter update unit 17 calculates the first loss function so that the vector calculated by the quantization vector calculation unit becomes smaller as it approaches the context expression calculated by the context expression calculation unit, and the meta information It can be said that the second loss function is calculated so that it becomes smaller as the label calculation accuracy of the label calculation unit increases.
  • FIG. 2 is a diagram showing an example of the configuration of the second learning device.
  • the second learning device 20 uses the parameters updated by the first learning device 10 to perform learning of a relearning model for performing tasks related to speech.
  • the model in the first learning device 10 is a model that combines a speech encoder, a context network, a quantization network, and a classification network.
  • the relearning model is a model that combines a speech encoder, a context network, and an additional network.
  • a set of an acoustic feature sequence X and a subsequent task label l' is input to the second learning device 20 as learning data.
  • the subsequent task label l' is the correct label.
  • the subsequent task label l' corresponds to information according to the task, and does not need to indicate meta information.
  • the subsequent task label l' is the text corresponding to the speech. Further, the subsequent task label l' may indicate meta information like the classification label l. Note that the processing unit of the text corresponding to the voice in the subsequent task label l' may be a phoneme, a character, or a word.
  • the second learning device 20 includes a speech encoder section 21, a context network section 22, an additional network section 23, a subsequent task learning loss calculation section 24, a learning parameter update section 25, and model information 20a.
  • the model information 20a is parameters of a model trained by the first learning device 10.
  • the model information 20a includes at least parameters ⁇ se1 and ⁇ se2 . Furthermore, the model information 20a includes a parameter ⁇ add of an additional network depending on the task.
  • the audio encoder unit 21 calculates the audio intermediate feature vector sequence Z as shown in equation (1).
  • ⁇ se1 is a parameter of the audio encoder that has been updated by the first learning device 10, and is read from the model information 20a.
  • the context network unit 22 converts the intermediate feature vector sequence Z, which is the output of the audio encoder unit 21, into a context expression C as shown in equation (12). However, unlike the context network unit 12, the context network unit 22 does not perform masking.
  • ⁇ se2 is a parameter of the context network that has been updated by the first learning device 10, and is read from the model information 10a.
  • the additional network unit 23 calculates a probability sequence P (sequence of predicted probabilities) for the subsequent task label from the context expression vector sequence C that is the output of the context network unit 22, as shown in equation (13).
  • the ClassNetwork() (classification network) in equation (13) is different from the classification network of the first learning device 10, and learning is performed in the second learning device 20.
  • the classification network of the second learning device 20 is a function having the function of a neural network, and is composed of, for example, a bidirectional LSTM and a softmax function.
  • ⁇ add is a parameter of the classification network of the subsequent task and can be learned.
  • ⁇ addn is read from the model information 20a.
  • the subsequent task learning loss calculation unit 24 calculates the subsequent task learning loss L down for the subsequent task label l' as shown in equation (14).
  • Loss( ) is a function that calculates the loss of the subsequent task (for example, classification loss), for example, cross-entropy loss. Note that Loss( ) is changed as appropriate depending on the type of subsequent task (classification task, generation task, prediction task, etc.).
  • the learning parameter updating unit 25 updates the parameters of the model based on the loss L down of the subsequent task.
  • the learning parameter update unit 25 may fix some parameters and update other parameters. For example, the learning parameter updating unit 25 updates the parameter ⁇ add without updating the parameters ⁇ se1 and ⁇ se2 .
  • the calculation of the loss L down of the subsequent task is performed for each mini-batch. Therefore, the learning parameter updating unit 25 updates the parameters for each mini-batch.
  • the process is further repeated using the updated parameters. Furthermore, when a predetermined condition (for example, the number of repetitions) is satisfied, the iterative process ends.
  • FIG. 3 is a diagram showing a configuration example of an estimation device.
  • the inference device 50 uses the relearning model to execute a task.
  • the acoustic feature series X is input to the inference device 50 as learning data.
  • the inference device 50 estimates a label corresponding to the acoustic feature sequence X.
  • the inference device 50 includes a speech encoder section 51, a context network section 52, an additional network section 53, and model information 50a.
  • the model information 50a is the parameters of each model learned by the first learning device 10 and the second learning device 20.
  • the model information 50a includes a learned speech encoder parameter ⁇ se1 and a learned context network parameter ⁇ se2 . Furthermore, the model information 50a includes the learned additional network parameter ⁇ add .
  • the context network unit 52 converts the intermediate feature vector sequence Z, which is the output of the audio encoder unit 51, into a context representation C.
  • the additional network unit 23 calculates a probability sequence P (sequence of predicted probabilities) for the subsequent task label from the context expression vector sequence C that is the output of the context network unit 52.
  • the additional network unit 53 outputs classification results based on the probability sequence P.
  • the additional network unit 53 may output the probability sequence P, or may output information specifying the subsequent task label corresponding to the element with the largest value among the elements of the probability sequence P.
  • FIG. 4 is a flowchart showing the overall flow of the learning process. As shown in FIG. 4, first, the first learning device 10 performs preliminary learning of a speech encoder, a context network, a quantization network, and a classification network (step S1).
  • the second learning device 20 uses the learned speech encoder and context network to learn an additional network (step S2). At this time, it is also possible to relearn the audio encoder and context network.
  • FIG. 5 is a flowchart showing the flow of self-supervised learning processing.
  • the self-supervised learning process corresponds to the process of step S1 in FIG.
  • the first learning device 10 inputs the acoustic feature sequence to the audio encoder and calculates the intermediate expression vector sequence (step S101).
  • the first learning device 10 applies masking to the intermediate expression vector sequence, inputs it to the context network, and calculates a context expression vector sequence (step S102).
  • the first learning device 10 inputs the intermediate representation vector sequence to the quantization network and calculates a quantized representation vector sequence (step S103).
  • the first learning device 10 applies GRL to the context expression vector sequence, inputs it to the classification network, and calculates a probability sequence for the classification label of the meta information (step S104).
  • the first learning device 10 calculates the classification learning loss based on the calculated probability sequence and the correct classification label of the meta information (step S105).
  • the first learning device 10 calculates a context expression learning loss based on the context expression vector sequence and the quantized expression vector sequence (step S106).
  • the first learning device 10 updates the parameters of the audio encoder, context network, quantization network, and classification network based on the classification learning loss and context representation learning (step S107).
  • step S108 Yes
  • step S108, No the first learning device 10 terminates the process.
  • step S101 the end condition is not satisfied (step S108, No)
  • step S101 the end condition is not satisfied
  • the termination conditions include, for example, that the process has been repeated a certain number of times, that the amount of parameter updates has converged, etc.
  • FIG. 6 is a flowchart showing the flow of the relearning process.
  • the relearning process corresponds to the process of step S2 in FIG.
  • the second learning device 20 first inputs the acoustic feature sequence to the audio encoder and calculates the intermediate expression vector sequence (step S201).
  • the second learning device 20 inputs the intermediate expression vector sequence to the context network and calculates the context expression vector sequence (step S202).
  • the second learning device 20 inputs the context expression vector sequence to the additional network and calculates a probability sequence for the classification label according to the task (step S203).
  • the second learning device 20 calculates additional learning loss based on the calculated probability sequence and the correct classification label according to the task (step S204).
  • the second learning device 20 updates the parameters of the additional network based on the additional learning loss (step S205). At this time, it is also possible to relearn the audio encoder and context network.
  • step S206 Yes
  • step S206, No the second learning device 20 terminates the process.
  • step S206, No the second learning device 20 returns to step S201 and repeats the process using the model with updated parameters.
  • the termination conditions include, for example, that the process has been repeated a certain number of times, that the amount of parameter updates has converged, etc.
  • FIG. 7 is a flowchart showing the flow of inference processing.
  • the inference device 50 inputs the acoustic feature sequence to the audio encoder and calculates the intermediate representation vector sequence (step S501).
  • the inference device 50 inputs the intermediate representation vector sequence to the context network and calculates the context expression vector sequence (step S502).
  • the inference device 50 inputs the context expression vector sequence to the additional network and calculates a probability sequence for the classification label according to the task (step S503).
  • the inference device 50 outputs a classification result based on the calculated probability series (step S504).
  • the first learning device 10 has a first loss function that decreases as the vector obtained by quantizing the voice feature amount by the model becomes closer to the context expression acquired by the model from the feature amount; A second loss function is calculated that becomes smaller as the accuracy with which the model identifies audio meta information based on the context expression increases.
  • the first learning device 10 updates the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger. This prevents the context network from overfitting to the learning data, and prevents the accuracy of tasks subsequent to self-supervised learning from decreasing.
  • the first learning device 10 converts the first term, which is the first loss function, into a second term, which is weighted with a weight that increases as the number of repeated parameter updates increases.
  • a third loss function is calculated by subtracting , and the model parameters are updated so that the third loss function becomes smaller.
  • the first learning device 10 uses the updated parameters to learn a relearning model that executes tasks related to speech. Thereby, subsequent tasks by the additional network can be executed with high accuracy.
  • the first learning device 10 inputs the voice features into the model, calculates a context expression, inputs the context expression into the model, calculates a label that specifies the meta information of the voice, and inputs the voice features into the model. and calculate the quantized vector.
  • the first learning device 10 calculates a first loss function such that the calculated vector becomes smaller as it approaches the calculated context expression, and calculates a second loss function such that the calculated vector becomes smaller as the calculation accuracy of the label increases. calculate. Thereby, the first learning device 10 can consistently perform calculations using the model and update parameters.
  • the speech processing device provides a specific improvement over the conventional machine learning method as described in Non-Patent Document 1, and is applicable to the technical field related to speech tasks using machine learning models. This shows an improvement in
  • each component of each device shown in the drawings is functionally conceptual, and does not necessarily need to be physically configured as shown in the drawings.
  • the specific form of distributing and integrating each device is not limited to what is shown in the diagram, and all or part of the devices may be functionally or physically distributed or integrated in arbitrary units depending on various loads and usage conditions. Can be integrated and configured.
  • each processing function performed by each device is realized in whole or in part by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU, or by hardware using wired logic. It can be realized as Note that the program may be executed not only by the CPU but also by another processor such as a GPU.
  • a CPU Central Processing Unit
  • the speech processing device installs a program that executes the above processing as packaged software or online software on a desired computer. It can be implemented by For example, by causing the information processing device to execute the above program, the information processing device can be made to function as an audio processing device.
  • the information processing device referred to here includes a desktop or notebook personal computer.
  • information processing devices include mobile communication terminals such as smartphones, mobile phones, and PHSs (Personal Handyphone Systems), as well as slate terminals such as PDAs (Personal Digital Assistants).
  • the audio processing device can also be implemented as a learning server device that uses a terminal device used by a user as a client and provides services related to the above-mentioned learning processing to the client.
  • a learning server device is implemented as a server device that provides a learning service that takes learning data as input and outputs parameters of a trained model.
  • the learning server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above-mentioned learning processing by outsourcing.
  • FIG. 8 is a diagram showing an example of a computer that executes a learning program.
  • Computer 1000 includes, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These parts are connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
  • the ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System).
  • Hard disk drive interface 1030 is connected to hard disk drive 1090.
  • Disk drive interface 1040 is connected to disk drive 1100.
  • Serial port interface 1050 is connected to, for example, mouse 1110 and keyboard 1120.
  • Video adapter 1060 is connected to display 1130, for example.
  • the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the learning device 5 is implemented as a program module 1093 in which computer-executable code is written.
  • Program module 1093 is stored in hard disk drive 1090, for example.
  • a program module 1093 for executing processing similar to the functional configuration of the learning device 5 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the embodiment described above is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processing of the embodiment described above.
  • program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like.
  • the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from another computer via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network
  • the processor includes: a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; A second loss function that becomes smaller as the accuracy of identification is higher, and A speech processing device that updates parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger.
  • a third term is obtained by subtracting a second term, which is weighted with a weight that increases as the number of times parameter updates are repeated, from the first term, which is the first loss function. Calculate the loss function, A speech processing device that updates parameters of the model so that the third loss function becomes smaller.
  • the audio processing device comprises: A speech processing device that uses updated parameters to train a relearning model to perform speech-related tasks.
  • the audio processing device comprises: inputting the voice features into the model and calculating a context representation; inputting the context representation into the model and calculating a label identifying meta information of the speech; inputting the feature amount of the voice into the model and calculating a quantized vector;
  • the first loss function is calculated such that the calculated vector becomes smaller as it approaches the calculated context expression, and the second loss function is calculated such that the calculated vector becomes smaller as the calculation precision of the label increases.
  • the audio processing device comprises: A speech processing device that executes the task using the relearning model.
  • a non-transitory storage medium storing a program executable by a computer to perform audio processing,
  • the audio processing includes: a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; A second loss function that becomes smaller as the accuracy of identification is higher, and The parameters of the model are updated such that the first loss function becomes smaller and the second loss function becomes larger.
  • An inference device comprising: an inference unit that performs inference processing regarding speech using a re-learning model that is trained using parameters of the model that has been trained through pre-learning processing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

Un premier dispositif d'apprentissage (10) calcule une première fonction de perte qui devient plus petite tandis qu'un vecteur dans lequel un modèle a quantifié la quantité de caractéristiques de parole devient plus proche d'une représentation de contexte que le modèle a acquis depuis la quantité de caractéristiques, et une seconde fonction de perte qui devient plus petite tandis qu'une précision avec laquelle le modèle identifie des méta-informations de la parole sur la base de la représentation de contexte devient plus élevée. Le premier dispositif d'apprentissage (10) met à jour le paramètre du modèle de telle sorte que la première fonction de perte devient plus petite et la seconde fonction de perte devient plus grande.
PCT/JP2022/028843 2022-07-26 2022-07-26 Dispositif, procédé et programme de traitement de la parole WO2024023946A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/028843 WO2024023946A1 (fr) 2022-07-26 2022-07-26 Dispositif, procédé et programme de traitement de la parole

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/028843 WO2024023946A1 (fr) 2022-07-26 2022-07-26 Dispositif, procédé et programme de traitement de la parole

Publications (1)

Publication Number Publication Date
WO2024023946A1 true WO2024023946A1 (fr) 2024-02-01

Family

ID=89705831

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/028843 WO2024023946A1 (fr) 2022-07-26 2022-07-26 Dispositif, procédé et programme de traitement de la parole

Country Status (1)

Country Link
WO (1) WO2024023946A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951213A (zh) * 2021-02-09 2021-06-11 中国科学院自动化研究所 端到端的在线语音检测与识别方法、系统及设备
CN113327595A (zh) * 2021-06-16 2021-08-31 北京语言大学 发音偏误检测方法、装置及存储介质
WO2022044243A1 (fr) * 2020-08-28 2022-03-03 日本電信電話株式会社 Dispositif de formation, dispositif d'interférence, procédés associés et programme

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022044243A1 (fr) * 2020-08-28 2022-03-03 日本電信電話株式会社 Dispositif de formation, dispositif d'interférence, procédés associés et programme
CN112951213A (zh) * 2021-02-09 2021-06-11 中国科学院自动化研究所 端到端的在线语音检测与识别方法、系统及设备
CN113327595A (zh) * 2021-06-16 2021-08-31 北京语言大学 发音偏误检测方法、装置及存储介质

Similar Documents

Publication Publication Date Title
US10643602B2 (en) Adversarial teacher-student learning for unsupervised domain adaptation
JP6637078B2 (ja) 音響モデル学習装置、音響モデル学習方法及びプログラム
JP6712642B2 (ja) モデル学習装置、その方法、及びプログラム
JP6222821B2 (ja) 誤り修正モデル学習装置、及びプログラム
CN111602148A (zh) 正则化神经网络架构搜索
CN111930914A (zh) 问题生成方法和装置、电子设备以及计算机可读存储介质
CN109308316B (zh) 一种基于主题聚类的自适应对话生成系统
JP7329393B2 (ja) 音声信号処理装置、音声信号処理方法、音声信号処理プログラム、学習装置、学習方法及び学習プログラム
CN115803806A (zh) 用于训练双模式机器学习言语识别模型的系统和方法
WO2023134067A1 (fr) Procédé et appareil d'entraînement de modèle de classification de parole, dispositif et support de stockage
JP7212596B2 (ja) 学習装置、学習方法および学習プログラム
Silva et al. Intelligent genetic fuzzy inference system for speech recognition: An approach from low order feature based on discrete cosine transform
JP7112348B2 (ja) 信号処理装置、信号処理方法及び信号処理プログラム
CN115066690A (zh) 搜索归一化-激活层架构
WO2024023946A1 (fr) Dispositif, procédé et programme de traitement de la parole
Long et al. Domain adaptation of lattice-free MMI based TDNN models for speech recognition
JP2018031812A (ja) 音声データ処理装置、音声データ処理方法および音声データ処理プログラム
JP2021039216A (ja) 音声認識装置、音声認識方法及び音声認識プログラム
WO2020162240A1 (fr) Dispositif de calcul de score de modèle linguistique, dispositif de création de modèle linguistique, procédés, programme et support d'enregistrement associés
Shinozaki et al. Automated development of dnn based spoken language systems using evolutionary algorithms
JP7170594B2 (ja) 同一事象に対して時系列に発生した異なるメディアデータを統合した学習モデルを構築するプログラム、装置及び方法
WO2020044755A1 (fr) Dispositif de reconnaissance vocale, procédé de reconnaissance vocale et programme
Ratajczak et al. Virtual Adversarial Training Applied to Neural Higher-Order Factors for Phone Classification.
JP2021039218A (ja) 学習装置、学習方法及び学習プログラム
CN112951270A (zh) 语音流利度检测的方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22953041

Country of ref document: EP

Kind code of ref document: A1