WO2019019667A1 - Speech processing method and apparatus, storage medium and processor - Google Patents

Speech processing method and apparatus, storage medium and processor Download PDF

Info

Publication number
WO2019019667A1
WO2019019667A1 PCT/CN2018/079848 CN2018079848W WO2019019667A1 WO 2019019667 A1 WO2019019667 A1 WO 2019019667A1 CN 2018079848 W CN2018079848 W CN 2018079848W WO 2019019667 A1 WO2019019667 A1 WO 2019019667A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
model
vector
preset
processing
Prior art date
Application number
PCT/CN2018/079848
Other languages
French (fr)
Chinese (zh)
Inventor
刘若鹏
陈�峰
Original Assignee
深圳光启合众科技有限公司
深圳光启创新技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳光启合众科技有限公司, 深圳光启创新技术有限公司 filed Critical 深圳光启合众科技有限公司
Publication of WO2019019667A1 publication Critical patent/WO2019019667A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present invention relates to the field of data processing, and in particular to a voice processing method and apparatus, a storage medium, and a processor.
  • Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between users and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use every day, so it is closely related to the study of linguistics, but there are important differences.
  • conditional random field CRF conditional random field
  • hidden Markov model HMM hidden Markov model
  • RNN recurrent neural network model
  • LSTM long-term and short-term memory model
  • the embodiment of the invention provides a voice processing method and device, a storage medium and a processor, so as to at least solve the technical problem of low processing efficiency of the voice processing method in the prior art.
  • a voice processing method includes: acquiring a voice vector at a plurality of times in a preset time period; and processing a voice vector at multiple times by using a preset voice model to obtain and The plurality of text information corresponding to the speech vector of the moment, wherein the preset speech model processes the speech vectors of the plurality of moments based on the parameter vectors of the plurality of pre-stored moments; and outputs the plurality of text information.
  • the preset speech model includes: a speech processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of a plurality of moments, and the speech processing model is configured to process the speech vectors of the plurality of moments based on the parameter vectors of the plurality of moments. , obtaining a plurality of text information corresponding to the speech vectors of the plurality of times.
  • processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments comprising: acquiring the first plurality of moments from the parameter matrix according to the reading operation
  • the parameter vector is obtained by modifying the speech processing model by using the first parameter vector at a plurality of times to obtain a modified speech processing model; and processing the speech vector at a plurality of times by using the modified speech processing model to obtain a plurality of text information.
  • the method further includes: obtaining the second parameter of the plurality of times by using the modified speech processing model Vector; writes a second parameter vector of multiple moments to the parameter matrix according to a write operation.
  • the second parameter vector of the plurality of times is obtained by using the modified speech processing model, including: updating the first parameter vector of the plurality of times by using the modified speech processing model to obtain the second parameter of the plurality of times vector.
  • the method before processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, the method further includes: establishing an initial preset model, and initializing the preset
  • the model includes: a speech processing model and an initial parameter matrix; acquiring training data, wherein the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector; and training the initial preset model according to the training data, Get the default speech model.
  • the initial preset model is trained according to the training data, and the preset voice model is obtained by: inputting the training data into the voice processing model to obtain a preset parameter vector; and writing the preset parameter vector into the initial parameter matrix by using a write operation, Parameter matrix.
  • the speech processing model is an LSTM model
  • the parameter matrix is a memory matrix
  • the preset time period is determined according to the processing capability of the preset voice model.
  • a voice processing apparatus including: a first acquiring module, configured to acquire a voice vector of a plurality of times in a preset time period; and a processing module, configured to use the preset voice
  • the model processes the speech vectors at multiple moments to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the pre-stored parameter vectors of the plurality of moments to the speech vectors of the plurality of moments Processing; output module for outputting multiple text information.
  • the preset speech model includes: a speech processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of a plurality of moments, and the speech processing model is configured to process the speech vectors of the plurality of moments based on the parameter vectors of the plurality of moments. , obtaining a plurality of text information corresponding to the speech vectors of the plurality of times.
  • the processing module includes: an obtaining submodule, configured to acquire a first parameter vector of the plurality of moments from the parameter matrix according to the reading operation; and a correction submodule, configured to perform the speech processing model by using the first parameter vector of the multiple moments Correction, the corrected speech processing model is obtained; the first processing sub-module is configured to process the speech vectors of the plurality of moments by using the modified speech processing model to obtain a plurality of text information.
  • the processing module further includes: a second processing submodule, configured to obtain a second parameter vector of the plurality of moments by using the modified speech processing model; and the first storage submodule configured to use the plurality of moments according to the writing operation The second parameter vector is written to the parameter matrix.
  • the second processing sub-module is further configured to update the first parameter vector of the multiple moments by using the modified speech processing model to obtain a second parameter vector of the multiple moments.
  • the foregoing apparatus further includes: an establishing module, configured to establish an initial preset model, where the initial preset model includes: a voice processing model and an initial parameter matrix; and a second acquiring module, configured to acquire training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector; and a training module, configured to train the initial preset model according to the training data to obtain a preset speech model.
  • an establishing module configured to establish an initial preset model, where the initial preset model includes: a voice processing model and an initial parameter matrix
  • a second acquiring module configured to acquire training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector
  • a training module configured to train the initial preset model according to the training data to obtain a preset speech model.
  • the training module includes: a third processing sub-module, configured to input training data into the speech processing model to obtain a preset parameter vector; and a second storage sub-module configured to write the preset parameter vector into the initial parameter matrix by using a write operation. , get the default speech model.
  • the speech processing model is an LSTM model
  • the parameter matrix is a memory matrix
  • the foregoing apparatus further includes: a determining module, configured to determine a preset time period according to a processing capability of the preset voice model.
  • a storage medium comprising a stored program, wherein the device in which the storage medium is located is controlled to execute the voice processing method in the above embodiment while the program is running.
  • a processor for executing a program, wherein the program is executed to execute the voice processing method in the above embodiment.
  • a speech vector of a plurality of times in a preset time period is acquired, and a speech vector of a plurality of times is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the multiple moments. , output multiple text information to achieve natural language processing.
  • the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.
  • FIG. 1 is a flow chart of a voice processing method according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of an optional preset speech model according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a repeating module of an optional speech processing model in accordance with an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a voice processing device in accordance with an embodiment of the present invention.
  • an embodiment of a speech processing method is provided, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although The logical order is shown in the flowcharts, but in some cases the steps shown or described may be performed in a different order than the ones described herein.
  • FIG. 1 is a flowchart of a voice processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:
  • Step S102 Acquire a speech vector of a plurality of times in a preset time period.
  • the preset time period is determined according to the processing capability of the preset voice model.
  • the preset time period may be set according to the processing capability of the actual voice processing model, and the multiple times may be multiple sampling times with equal intervals, for example, the preset time period is 100s, and the sampling interval is 10s. , within 100s, you can get the speech vector of 10 moments.
  • Step S104 processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the parameter vectors of the plurality of times stored in advance The speech vectors at multiple times are processed.
  • the preset voice model includes: a voice processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of multiple moments, and the voice processing model is used for parameter vectors based on multiple moments.
  • the speech vectors at a plurality of times are processed to obtain a plurality of text information corresponding to the speech vectors of the plurality of times.
  • the voice processing model is an LSTM model
  • the parameter matrix is a memory matrix
  • the foregoing preset speech model may be a neuroturing machine.
  • the neuroturing machine includes two components: a controller (ie, the above-described speech processing model) and a memory matrix (ie, the parameter matrix described above).
  • the memory matrix is an external storage matrix, and stores a parameter vector required by the speech processing model for speech processing.
  • the controller can read and write the parameter vector in the memory matrix;
  • the above speech processing model can be an LSTM model.
  • LSTM avoids long-term dependency problems by deliberate design.
  • LSTM has the same chain form of repeated neural network modules as other RNNs. Unlike a single neural network layer, a repetitive module has a different structure, as shown in Figure 3, which can be composed of input gates, forgotten gates, output gates, and interacts in a very special way, thus solving the RNN Gradient disappearance and gradient explosion problems.
  • step S106 a plurality of text information is output.
  • the natural voice data of the plurality of sampling moments in the preset time period may be acquired according to the time-series feature of the natural voice, and the voice vector of the multiple time points in the preset time period is obtained, and the pre-training is obtained.
  • a good neural Turing machine uses a neural Turing machine to recognize speech vectors at multiple times, obtain corresponding text information, and output the recognized text information.
  • a speech vector of a plurality of time instants in a preset time period is acquired, and a speech vector of a plurality of time instants is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of time instants. , output multiple text information to achieve natural language processing.
  • the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.
  • step S104 the voice vector of the multiple moments is processed by using the preset voice model to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, including:
  • Step S1040 Acquire a first parameter vector of a plurality of moments from the parameter matrix according to the read operation.
  • the neural Turing machine may include: a read head and a write head, and the read operation may be read from the memory matrix by the read head to read the W parameter in the LSTM model, and the write operation may be performed by the write head.
  • the new W parameters are written to the memory matrix.
  • step S1042 the speech processing model is corrected by using the first parameter vector at a plurality of times to obtain a modified speech processing model.
  • step S1044 the speech vector of the plurality of times is processed by the corrected speech processing model to obtain a plurality of text information.
  • the W parameter vector can be read from the memory matrix by the read head, and the W parameter vector is input into the LSTM.
  • the model, the LSTM model is modified, and the modified LSTM model is obtained.
  • the speech vector can be input as an input vector to the modified LSTM model, thereby obtaining the output vector of the LSTM model, that is, the text information of the speech vector, in all of the plurality.
  • a plurality of pieces of text information corresponding to the speech vectors of the plurality of times are obtained.
  • step S1044 the voice vector of the plurality of times is processed by using the modified voice processing model to obtain a plurality of text information, and the method further includes:
  • step S1046 the second parameter vector of the plurality of times is obtained by using the modified speech processing model.
  • step S1046 using the modified speech processing model, obtaining a second parameter vector at multiple moments, including:
  • Step S10462 The first parameter vector of the plurality of times is updated by using the modified speech processing model to obtain a second parameter vector of the plurality of times.
  • Step S1048 writing a second parameter vector of the plurality of times into the parameter matrix according to the writing operation.
  • the new W parameter vector is written to the memory matrix by the write head as the W parameter vector for the next moment.
  • step S104 the voice vector of the multiple moments is processed by using the preset voice model to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, the method Also includes:
  • Step S108 establishing an initial preset model, where the initial preset model comprises: a voice processing model and an initial parameter matrix.
  • Step S110 acquiring training data, wherein the training data comprises: a plurality of training speech vectors, and text information corresponding to each training speech vector.
  • Step S112 training the initial preset model according to the training data to obtain a preset voice model.
  • the LSTM model in the neural Turing machine can be pre-established according to the actual processing needs, and the W parameter vector in the memory matrix is set to an initial value, and then the pre-established neural Turing machine is based on the training data. Train and get a highly accurate neuroturbine machine.
  • step S112 the initial preset model is trained according to the training data, and the obtained preset voice model includes:
  • step S1122 the training data is input into the speech processing model to obtain a preset parameter vector.
  • Step S1124 The preset parameter vector is written into the initial parameter matrix by a write operation to obtain a parameter matrix.
  • a plurality of training speech vectors in the training data may be used as an input input vector, and text information corresponding to each training speech vector is used as an output vector.
  • the preset W parameter vector of the LSTM model is obtained, and the preset W parameter vector is written into the memory matrix through the write head, thereby obtaining a highly accurate neuroturing machine.
  • an embodiment of a speech processing apparatus is provided.
  • FIG. 4 is a schematic diagram of a voice processing device according to an embodiment of the present invention. As shown in FIG. 4, the device includes:
  • the first obtaining module 41 is configured to acquire a voice vector at multiple moments in a preset time period.
  • the apparatus further includes: a determining module, configured to determine a preset time period according to a processing capability of the preset voice model.
  • the foregoing preset time period may be set according to a processing capability of the model, where the multiple time points may be multiple sampling moments with equal intervals, for example, the preset time period is 100s, and the sampling interval is 10s, then Within 100s, you can get the speech vector of 10 moments.
  • the processing module 43 is configured to process the speech vectors of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the plurality of pre-stored moments
  • the parameter vector processes the speech vectors at multiple times.
  • the preset voice model includes: a voice processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of multiple moments, and the voice processing model is used for parameter vectors based on multiple moments.
  • the speech vectors at a plurality of times are processed to obtain a plurality of text information corresponding to the speech vectors of the plurality of times.
  • the voice processing model is an LSTM model
  • the parameter matrix is a memory matrix
  • the foregoing preset speech model may be a neuroturing machine.
  • the neuroturing machine includes two components: a controller (ie, the above-described speech processing model) and a memory matrix (ie, the parameter matrix described above).
  • the memory matrix is an external storage matrix, and stores a parameter vector required by the speech processing model for speech processing.
  • the controller can read and write the parameter vector in the memory matrix;
  • the above speech processing model can be an LSTM model.
  • LSTM avoids long-term dependency problems by deliberate design.
  • LSTM has the same chain form of repeated neural network modules as other RNNs. Unlike a single neural network layer, a repetitive module has a different structure, as shown in Figure 3, which can be composed of input gates, forgotten gates, output gates, and interacts in a very special way, thus solving the RNN Gradient disappearance and gradient explosion problems.
  • the output module 45 is configured to output a plurality of text information.
  • the natural voice data of the plurality of sampling moments in the preset time period may be acquired according to the time-series feature of the natural voice, and the voice vector of the multiple time points in the preset time period is obtained, and the pre-training is obtained.
  • a good neural Turing machine uses a neural Turing machine to recognize speech vectors at multiple times, obtain corresponding text information, and output the recognized text information.
  • a speech vector of a plurality of time instants in a preset time period is acquired, and a speech vector of a plurality of time instants is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of time instants. , output multiple text information to achieve natural language processing.
  • the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.
  • the processing module 43 includes:
  • the obtaining submodule is configured to obtain the first parameter vector of the plurality of moments from the parameter matrix according to the read operation.
  • the neural Turing machine may include: a read head and a write head, and the read operation may be read from the memory matrix by the read head to read the W parameter in the LSTM model, and the write operation may be performed by the write head.
  • the new W parameters are written to the memory matrix.
  • the correction submodule is configured to correct the speech processing model by using the first parameter vector at multiple moments to obtain a modified speech processing model.
  • the first processing sub-module is configured to process the speech vector of the plurality of moments by using the modified speech processing model to obtain a plurality of text information.
  • the W parameter vector can be read from the memory matrix by the read head, and the W parameter vector is input into the LSTM.
  • the model, the LSTM model is modified, and the modified LSTM model is obtained.
  • the speech vector can be input as an input vector to the modified LSTM model, thereby obtaining the output vector of the LSTM model, that is, the text information of the speech vector, in all of the plurality.
  • a plurality of pieces of text information corresponding to the speech vectors of the plurality of times are obtained.
  • the processing module 43 further includes:
  • the second processing sub-module is configured to obtain the second parameter vector of the multiple moments by using the modified speech processing model.
  • the second processing sub-module is further configured to update the first parameter vector of the multiple moments by using the modified speech processing model to obtain the second parameter vector of the multiple moments.
  • the first storage submodule is configured to write the second parameter vector of the multiple moments into the parameter matrix according to the write operation.
  • the new W parameter vector is written to the memory matrix by the write head as the W parameter vector for the next moment.
  • the device further includes:
  • the module is established to establish an initial preset model, and the initial preset model includes: a voice processing model and an initial parameter matrix.
  • a second acquiring module configured to acquire training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector.
  • the training module is configured to train the initial preset model according to the training data to obtain a preset voice model.
  • the LSTM model in the neural Turing machine can be pre-established according to the actual processing needs, and the W parameter vector in the memory matrix is set to an initial value, and then the pre-established neural Turing machine is based on the training data. Train and get a highly accurate neuroturbine machine.
  • the training module includes:
  • the third processing sub-module is configured to input the training data into the speech processing model to obtain a preset parameter vector.
  • the second storage submodule is configured to write a preset parameter vector into the initial parameter matrix by a write operation to obtain a parameter matrix.
  • a plurality of training speech vectors in the training data may be used as an input input vector, and text information corresponding to each training speech vector is used as an output vector.
  • the preset W parameter vector of the LSTM model is obtained, and the preset W parameter vector is written into the memory matrix through the write head, thereby obtaining a highly accurate neuroturing machine.
  • an embodiment of a storage medium includes a stored program, wherein the device in which the storage medium is located is controlled to execute the voice processing method in Embodiment 1 above when the program is running.
  • an embodiment of a processor for executing a program wherein the program is executed to execute the voice processing method in Embodiment 1 above.
  • the disclosed technical contents may be implemented in other manners.
  • the device embodiments described above are only schematic.
  • the division of the unit may be a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection according to some interfaces, units or modules, and may be in an electrical or other form.
  • the units described as separate components may or may not be physically separate, and the components shown as unit may or may not be physical units, i.e., may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are a speech processing method and apparatus, a storage medium and a processor. The method comprises: acquiring speech vectors at a plurality of moments in a pre-set time period (S102); processing, by means of a pre-set speech model, the speech vectors at the plurality of moments to obtain a plurality of pieces of text information corresponding to the speech vectors at the plurality of moments (S104), wherein the pre-set speech model processes, based on pre-stored parameter vectors at the plurality of moments, the speech vectors at the plurality of moments; and outputting the plurality of pieces of text information (S106). By means of the present invention, the technical problem in the prior art that a speech processing method has a low processing efficiency is solved.

Description

语音处理方法及装置、存储介质及处理器Voice processing method and device, storage medium and processor 技术领域Technical field
本发明涉及数据处理领域,具体而言,涉及一种语音处理方法及装置、存储介质及处理器。The present invention relates to the field of data processing, and in particular to a voice processing method and apparatus, a storage medium, and a processor.
背景技术Background technique
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现用户与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系,但又有重要的区别。Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between users and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use every day, so it is closely related to the study of linguistics, but there are important differences.
目前常用的自然语言处理方法有:条件随机场CRF,隐马尔科夫模型HMM,递归神经网络模型RNN和长短期记忆模型LSTM等,但是,为了提高处理精度,需要增加模型深度,导致处理复杂度高,处理效率低。The commonly used natural language processing methods are: conditional random field CRF, hidden Markov model HMM, recurrent neural network model RNN and long-term and short-term memory model LSTM, etc. However, in order to improve processing accuracy, it is necessary to increase the model depth, resulting in processing complexity. High, low processing efficiency.
针对现有技术中的语音处理方法的处理效率低的问题,目前尚未提出有效的解决方案。In view of the low processing efficiency of the voice processing method in the prior art, an effective solution has not been proposed yet.
发明内容Summary of the invention
本发明实施例提供了一种语音处理方法及装置、存储介质及处理器,以至少解决现有技术中的语音处理方法的处理效率低的技术问题。The embodiment of the invention provides a voice processing method and device, a storage medium and a processor, so as to at least solve the technical problem of low processing efficiency of the voice processing method in the prior art.
根据本发明实施例的一个方面,提供了一种语音处理方法,包括:获取预设时间段内多个时刻的语音向量;利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,其中,预设语音模型基于预先存储的多个时刻的参数向量对多个时刻的语音向量进行处理;输出多个文本信息。According to an aspect of the embodiments of the present invention, a voice processing method includes: acquiring a voice vector at a plurality of times in a preset time period; and processing a voice vector at multiple times by using a preset voice model to obtain and The plurality of text information corresponding to the speech vector of the moment, wherein the preset speech model processes the speech vectors of the plurality of moments based on the parameter vectors of the plurality of pre-stored moments; and outputs the plurality of text information.
进一步地,预设语音模型包括:语音处理模型和参数矩阵,参数矩阵用于预先存储多个时刻的参数向量,语音处理模型用于基于多个时刻的参数向量对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息。Further, the preset speech model includes: a speech processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of a plurality of moments, and the speech processing model is configured to process the speech vectors of the plurality of moments based on the parameter vectors of the plurality of moments. , obtaining a plurality of text information corresponding to the speech vectors of the plurality of times.
进一步地,利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,包括:根据读操作从参数矩阵中获取多个时刻的第一参数向量;利用多个时刻的第一参数向量对语音处理模型进行修正,得到修正后 的语音处理模型;利用修正后的语音处理模型对多个时刻的语音向量进行处理,得到多个文本信息。Further, processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, comprising: acquiring the first plurality of moments from the parameter matrix according to the reading operation The parameter vector is obtained by modifying the speech processing model by using the first parameter vector at a plurality of times to obtain a modified speech processing model; and processing the speech vector at a plurality of times by using the modified speech processing model to obtain a plurality of text information.
进一步地,在利用修正后的语音处理模型对多个时刻的语音向量进行处理,得到多个文本信息的同时,上述方法还包括:利用修正后的语音处理模型,得到多个时刻的第二参数向量;根据写操作将多个时刻的第二参数向量写入参数矩阵。Further, while processing the speech vector of the plurality of times by using the modified speech processing model to obtain a plurality of text information, the method further includes: obtaining the second parameter of the plurality of times by using the modified speech processing model Vector; writes a second parameter vector of multiple moments to the parameter matrix according to a write operation.
进一步地,利用修正后的语音处理模型,得到多个时刻的第二参数向量,包括:利用修正后的语音处理模型对多个时刻的第一参数向量进行更新,得到多个时刻的第二参数向量。Further, the second parameter vector of the plurality of times is obtained by using the modified speech processing model, including: updating the first parameter vector of the plurality of times by using the modified speech processing model to obtain the second parameter of the plurality of times vector.
进一步地,在利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息之前,上述方法还包括:建立初始预设模型,初始预设模型包括:语音处理模型和初始参数矩阵;获取训练数据,其中,训练数据包括:多个训练语音向量,以及每个训练语音向量相对应的文本信息;根据训练数据对初始预设模型进行训练,得到预设语音模型。Further, before processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, the method further includes: establishing an initial preset model, and initializing the preset The model includes: a speech processing model and an initial parameter matrix; acquiring training data, wherein the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector; and training the initial preset model according to the training data, Get the default speech model.
进一步地,根据训练数据对初始预设模型进行训练,得到预设语音模型包括:将训练数据输入语音处理模型,得到预设参数向量;通过写操作将预设参数向量写入初始参数矩阵,得到参数矩阵。Further, the initial preset model is trained according to the training data, and the preset voice model is obtained by: inputting the training data into the voice processing model to obtain a preset parameter vector; and writing the preset parameter vector into the initial parameter matrix by using a write operation, Parameter matrix.
进一步地,语音处理模型为LSTM模型,参数矩阵为记忆矩阵。Further, the speech processing model is an LSTM model, and the parameter matrix is a memory matrix.
进一步地,根据预设语音模型的处理能力,确定预设时间段。Further, the preset time period is determined according to the processing capability of the preset voice model.
根据本发明实施例的另一方面,还提供了一种语音处理装置,包括:第一获取模块,用于获取预设时间段内多个时刻的语音向量;处理模块,用于利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,其中,预设语音模型基于预先存储的多个时刻的参数向量对多个时刻的语音向量进行处理;输出模块,用于输出多个文本信息。According to another aspect of the present invention, a voice processing apparatus is further provided, including: a first acquiring module, configured to acquire a voice vector of a plurality of times in a preset time period; and a processing module, configured to use the preset voice The model processes the speech vectors at multiple moments to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the pre-stored parameter vectors of the plurality of moments to the speech vectors of the plurality of moments Processing; output module for outputting multiple text information.
进一步地,预设语音模型包括:语音处理模型和参数矩阵,参数矩阵用于预先存储多个时刻的参数向量,语音处理模型用于基于多个时刻的参数向量对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息。Further, the preset speech model includes: a speech processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of a plurality of moments, and the speech processing model is configured to process the speech vectors of the plurality of moments based on the parameter vectors of the plurality of moments. , obtaining a plurality of text information corresponding to the speech vectors of the plurality of times.
进一步地,处理模块包括:获取子模块,用于根据读操作从参数矩阵中获取多个时刻的第一参数向量;修正子模块,用于利用多个时刻的第一参数向量对语音处理模型进行修正,得到修正后的语音处理模型;第一处理子模块,用于利用修正后的语音处理模型对多个时刻的语音向量进行处理,得到多个文本信息。Further, the processing module includes: an obtaining submodule, configured to acquire a first parameter vector of the plurality of moments from the parameter matrix according to the reading operation; and a correction submodule, configured to perform the speech processing model by using the first parameter vector of the multiple moments Correction, the corrected speech processing model is obtained; the first processing sub-module is configured to process the speech vectors of the plurality of moments by using the modified speech processing model to obtain a plurality of text information.
进一步地,处理模块还包括:第二处理子模块,用于利用修正后的语音处理模型, 得到多个时刻的第二参数向量;第一存储子模块,用于根据写操作将多个时刻的第二参数向量写入参数矩阵。Further, the processing module further includes: a second processing submodule, configured to obtain a second parameter vector of the plurality of moments by using the modified speech processing model; and the first storage submodule configured to use the plurality of moments according to the writing operation The second parameter vector is written to the parameter matrix.
进一步地,所述第二处理子模块还用于利用修正后的语音处理模型对多个时刻的第一参数向量进行更新,得到多个时刻的第二参数向量。Further, the second processing sub-module is further configured to update the first parameter vector of the multiple moments by using the modified speech processing model to obtain a second parameter vector of the multiple moments.
进一步地,上述装置还包括:建立模块,用于建立初始预设模型,初始预设模型包括:语音处理模型和初始参数矩阵;第二获取模块,用于获取训练数据,其中,训练数据包括:多个训练语音向量,以及每个训练语音向量相对应的文本信息;训练模块,用于根据训练数据对初始预设模型进行训练,得到预设语音模型。Further, the foregoing apparatus further includes: an establishing module, configured to establish an initial preset model, where the initial preset model includes: a voice processing model and an initial parameter matrix; and a second acquiring module, configured to acquire training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector; and a training module, configured to train the initial preset model according to the training data to obtain a preset speech model.
进一步地,训练模块包括:第三处理子模块,用于将训练数据输入语音处理模型,得到预设参数向量;第二存储子模块,用于通过写操作将预设参数向量写入初始参数矩阵,得到预设语音模型。Further, the training module includes: a third processing sub-module, configured to input training data into the speech processing model to obtain a preset parameter vector; and a second storage sub-module configured to write the preset parameter vector into the initial parameter matrix by using a write operation. , get the default speech model.
进一步地,语音处理模型为LSTM模型,参数矩阵为记忆矩阵。Further, the speech processing model is an LSTM model, and the parameter matrix is a memory matrix.
进一步地,上述装置还包括:确定模块,用于根据预设语音模型的处理能力,确定预设时间段。Further, the foregoing apparatus further includes: a determining module, configured to determine a preset time period according to a processing capability of the preset voice model.
根据本发明实施例的另一方面,还提供了一种存储介质,存储介质包括存储的程序,其中,在程序运行时控制存储介质所在设备执行上述实施例中的语音处理方法。According to another aspect of the embodiments of the present invention, there is also provided a storage medium comprising a stored program, wherein the device in which the storage medium is located is controlled to execute the voice processing method in the above embodiment while the program is running.
根据本发明实施例的另一方面,还提供了一种处理器,处理器用于运行程序,其中,程序运行时执行上述实施例中的语音处理方法。According to another aspect of an embodiment of the present invention, there is further provided a processor for executing a program, wherein the program is executed to execute the voice processing method in the above embodiment.
在本发明实施例中,获取预设时间段内多个时刻的语音向量,利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,输出多个文本信息,从而实现自然语言处理。容易注意到的是,由于获取到的是预设时间段内多个时刻的语音向量,并且预设语音模型基于预先存储的多个时刻的参数向量对多个时刻的语音向量进行处理,从而实现利用自然语音的时序性特征,结合神经图灵机的记忆矩阵和LSTM模型,对自然语音进行处理,进而解决了现有技术中的语音处理方法的处理效率低的技术问题。因此,通过本发明上述实施例提供的方案,可以达到提高处理效率、提高处理准确度、降低处理复杂度、减少处理时间的效果。In the embodiment of the present invention, a speech vector of a plurality of times in a preset time period is acquired, and a speech vector of a plurality of times is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the multiple moments. , output multiple text information to achieve natural language processing. It is easy to notice that since the speech vectors of the plurality of times in the preset time period are acquired, and the preset speech model processes the speech vectors of the plurality of times based on the parameter vectors of the plurality of times stored in advance, thereby realizing By using the temporal characteristics of natural speech, combined with the memory matrix and LSTM model of the neural Turing machine, the natural speech is processed, and the technical problem of low processing efficiency of the speech processing method in the prior art is solved. Therefore, the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.
附图说明DRAWINGS
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:
图1是根据本发明实施例的一种语音处理方法的流程图;1 is a flow chart of a voice processing method according to an embodiment of the present invention;
图2是根据本发明实施例的一种可选的预设语音模型的示意图;2 is a schematic diagram of an optional preset speech model according to an embodiment of the present invention;
图3是根据本发明实施例的一种可选的语音处理模型的重复模块的示意图;以及3 is a schematic diagram of a repeating module of an optional speech processing model in accordance with an embodiment of the present invention;
图4是根据本发明实施例的一种语音处理装置的示意图。4 is a schematic diagram of a voice processing device in accordance with an embodiment of the present invention.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It is to be understood that the terms "first", "second", and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged where appropriate, so that the embodiments of the invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.
实施例1Example 1
根据本发明实施例,提供了一种语音处理方法的实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。In accordance with an embodiment of the present invention, an embodiment of a speech processing method is provided, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although The logical order is shown in the flowcharts, but in some cases the steps shown or described may be performed in a different order than the ones described herein.
图1是根据本发明实施例的一种语音处理方法的流程图,如图1所示,该方法包括如下步骤:FIG. 1 is a flowchart of a voice processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:
步骤S102,获取预设时间段内多个时刻的语音向量。Step S102: Acquire a speech vector of a plurality of times in a preset time period.
可选地,在本发明上述实施例中,根据预设语音模型的处理能力,确定预设时间段。Optionally, in the foregoing embodiment of the present invention, the preset time period is determined according to the processing capability of the preset voice model.
具体地,上述的预设时间段可以根据实际语音处理模型的处理能力进行设定,上述的多个时刻可以是间隔相等的多个采样时刻,例如,预设时间段为100s,采样间隔 为10s,则在100s内,可以获取到10个时刻的语音向量。Specifically, the preset time period may be set according to the processing capability of the actual voice processing model, and the multiple times may be multiple sampling times with equal intervals, for example, the preset time period is 100s, and the sampling interval is 10s. , within 100s, you can get the speech vector of 10 moments.
步骤S104,利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,其中,预设语音模型基于预先存储的多个时刻的参数向量对多个时刻的语音向量进行处理。Step S104: processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the parameter vectors of the plurality of times stored in advance The speech vectors at multiple times are processed.
可选地,在本发明上述实施例中,预设语音模型包括:语音处理模型和参数矩阵,参数矩阵用于预先存储多个时刻的参数向量,语音处理模型用于基于多个时刻的参数向量对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息。Optionally, in the foregoing embodiment of the present invention, the preset voice model includes: a voice processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of multiple moments, and the voice processing model is used for parameter vectors based on multiple moments. The speech vectors at a plurality of times are processed to obtain a plurality of text information corresponding to the speech vectors of the plurality of times.
可选地,在本发明上述实施例中,语音处理模型为LSTM模型,参数矩阵为记忆矩阵。Optionally, in the foregoing embodiment of the present invention, the voice processing model is an LSTM model, and the parameter matrix is a memory matrix.
具体地,上述的预设语音模型可以是神经图灵机,如图2所示,神经图灵机包括两个组成部分:控制器(即上述的语音处理模型)和记忆矩阵(即上述的参数矩阵),记忆矩阵为外部的存储矩阵,存储有语音处理模型进行语音处理所需要的参数向量,控制器可以对记忆矩阵中的参数向量进行读取和写入;上述的语音处理模型可以是LSTM模型,是一种RNN中特殊的类型,可以学习长期依赖信息,LSTM通过刻意的设计来避免长期依赖问题,具体地,LSTM与其他RNN一样,具有一种重复神经网络模块的链式的形式,但是,与单一神经网络层不同,重复的模块拥有一个不同的结构,如图3所示,可以由输入门,忘记门,输出门构成,并且以一种非常特殊的方式进行交互,从而解决了RNN的梯度消失和梯度爆炸的问题。Specifically, the foregoing preset speech model may be a neuroturing machine. As shown in FIG. 2, the neuroturing machine includes two components: a controller (ie, the above-described speech processing model) and a memory matrix (ie, the parameter matrix described above). The memory matrix is an external storage matrix, and stores a parameter vector required by the speech processing model for speech processing. The controller can read and write the parameter vector in the memory matrix; the above speech processing model can be an LSTM model. It is a special type of RNN that can learn long-term dependency information. LSTM avoids long-term dependency problems by deliberate design. Specifically, LSTM has the same chain form of repeated neural network modules as other RNNs. Unlike a single neural network layer, a repetitive module has a different structure, as shown in Figure 3, which can be composed of input gates, forgotten gates, output gates, and interacts in a very special way, thus solving the RNN Gradient disappearance and gradient explosion problems.
步骤S106,输出多个文本信息。In step S106, a plurality of text information is output.
在一种可选的方案中,可以根据自然语音的时序性特征,获取预设时间段内的多个采样时刻的自然语音数据,得到预设时间段内多个时刻的语音向量,获取预先训练好的神经图灵机,利用神经图灵机对多个时刻的语音向量进行识别,得到对应的文本信息,并输出识别出的文本信息。In an optional solution, the natural voice data of the plurality of sampling moments in the preset time period may be acquired according to the time-series feature of the natural voice, and the voice vector of the multiple time points in the preset time period is obtained, and the pre-training is obtained. A good neural Turing machine uses a neural Turing machine to recognize speech vectors at multiple times, obtain corresponding text information, and output the recognized text information.
根据本发明上述实施例,获取预设时间段内多个时刻的语音向量,利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,输出多个文本信息,从而实现自然语言处理。容易注意到的是,由于获取到的是预设时间段内多个时刻的语音向量,并且预设语音模型基于预先存储的多个时刻的参数向量对多个时刻的语音向量进行处理,从而实现利用自然语音的时序性特征,结合神经图灵机的记忆矩阵和LSTM模型,对自然语音进行处理,进而解决了现有技术中的语音处理方法的处理效率低的技术问题。因此,通过本发明上述实施例提供的方案,可以达到提高处理效率、提高处理准确度、降低处理复杂度、减少处理时间的效 果。According to the foregoing embodiment of the present invention, a speech vector of a plurality of time instants in a preset time period is acquired, and a speech vector of a plurality of time instants is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of time instants. , output multiple text information to achieve natural language processing. It is easy to notice that since the speech vectors of the plurality of times in the preset time period are acquired, and the preset speech model processes the speech vectors of the plurality of times based on the parameter vectors of the plurality of times stored in advance, thereby realizing By using the temporal characteristics of natural speech, combined with the memory matrix and LSTM model of the neural Turing machine, the natural speech is processed, and the technical problem of low processing efficiency of the speech processing method in the prior art is solved. Therefore, the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.
可选地,在本发明上述实施例中,步骤S104,利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,包括:Optionally, in the foregoing embodiment of the present invention, in step S104, the voice vector of the multiple moments is processed by using the preset voice model to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, including:
步骤S1040,根据读操作从参数矩阵中获取多个时刻的第一参数向量。Step S1040: Acquire a first parameter vector of a plurality of moments from the parameter matrix according to the read operation.
具体地,如图2所示,神经图灵机可以包括:读头和写头,通过读头进行读操作可以从记忆矩阵中读取到LSTM模型中的W参数,通过写头进行写操作可以将新的W参数写入记忆矩阵中。Specifically, as shown in FIG. 2, the neural Turing machine may include: a read head and a write head, and the read operation may be read from the memory matrix by the read head to read the W parameter in the LSTM model, and the write operation may be performed by the write head. The new W parameters are written to the memory matrix.
步骤S1042,利用多个时刻的第一参数向量对语音处理模型进行修正,得到修正后的语音处理模型。In step S1042, the speech processing model is corrected by using the first parameter vector at a plurality of times to obtain a modified speech processing model.
步骤S1044,利用修正后的语音处理模型对多个时刻的语音向量进行处理,得到多个文本信息。In step S1044, the speech vector of the plurality of times is processed by the corrected speech processing model to obtain a plurality of text information.
在一种可选的方案中,在获取到多个时刻的语音向量之后,针对每个时刻的自然语音处理过程,可以通过读头从记忆矩阵中读取W参数向量,将W参数向量输入LSTM模型,对LSTM模型进行修正,得到修正后的LSTM模型,可以将语音向量作为输入向量,输入至修正后的LSTM模型,从而得到LSTM模型的输出向量,即语音向量的文本信息,在所有多个时刻的语音向量完成处理之后,得到多个时刻的语音向量相对应的多个文本信息。In an optional solution, after acquiring the speech vectors at multiple times, for the natural speech processing process at each moment, the W parameter vector can be read from the memory matrix by the read head, and the W parameter vector is input into the LSTM. The model, the LSTM model is modified, and the modified LSTM model is obtained. The speech vector can be input as an input vector to the modified LSTM model, thereby obtaining the output vector of the LSTM model, that is, the text information of the speech vector, in all of the plurality. After the speech vector of the time is completed, a plurality of pieces of text information corresponding to the speech vectors of the plurality of times are obtained.
可选地,在本发明上述实施例中,在步骤S1044,利用修正后的语音处理模型对多个时刻的语音向量进行处理,得到多个文本信息的同时,该方法还包括:Optionally, in the foregoing embodiment of the present invention, in step S1044, the voice vector of the plurality of times is processed by using the modified voice processing model to obtain a plurality of text information, and the method further includes:
步骤S1046,利用修正后的语音处理模型,得到多个时刻的第二参数向量。In step S1046, the second parameter vector of the plurality of times is obtained by using the modified speech processing model.
可选地,在本发明上述实施例中,步骤S1046,利用修正后的语音处理模型,得到多个时刻的第二参数向量,包括:Optionally, in the foregoing embodiment of the present invention, in step S1046, using the modified speech processing model, obtaining a second parameter vector at multiple moments, including:
步骤S10462,利用修正后的语音处理模型对多个时刻的第一参数向量进行更新,得到多个时刻的第二参数向量。Step S10462: The first parameter vector of the plurality of times is updated by using the modified speech processing model to obtain a second parameter vector of the plurality of times.
步骤S1048,根据写操作将多个时刻的第二参数向量写入参数矩阵。Step S1048, writing a second parameter vector of the plurality of times into the parameter matrix according to the writing operation.
在一种可选的方案中,针对每个时刻的自然语音处理过程,在利用LSTM模型对语音向量进行处理的过程中,不仅可以得到语音向量的文本信息,还可以得到新的W参数向量,通过写头将新的W参数向量写入记忆矩阵,作为下一个时刻的W参数向量。In an optional solution, for the natural speech processing process at each moment, in the process of processing the speech vector by using the LSTM model, not only the text information of the speech vector but also the new W parameter vector can be obtained. The new W parameter vector is written to the memory matrix by the write head as the W parameter vector for the next moment.
可选地,在本发明上述实施例中,在步骤S104,利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息之前,该方法还包括:Optionally, in the foregoing embodiment of the present invention, in step S104, the voice vector of the multiple moments is processed by using the preset voice model to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, the method Also includes:
步骤S108,建立初始预设模型,初始预设模型包括:语音处理模型和初始参数矩阵。Step S108, establishing an initial preset model, where the initial preset model comprises: a voice processing model and an initial parameter matrix.
步骤S110,获取训练数据,其中,训练数据包括:多个训练语音向量,以及每个训练语音向量相对应的文本信息。Step S110, acquiring training data, wherein the training data comprises: a plurality of training speech vectors, and text information corresponding to each training speech vector.
步骤S112,根据训练数据对初始预设模型进行训练,得到预设语音模型。Step S112, training the initial preset model according to the training data to obtain a preset voice model.
在一种可选的方案中,可以根据实际处理需要,预先建立神经图灵机中的LSTM模型,并将记忆矩阵中的W参数向量置为初始值,然后根据训练数据对预先建立的神经图灵机进行训练,得到准确度较高的神经图灵机。In an optional solution, the LSTM model in the neural Turing machine can be pre-established according to the actual processing needs, and the W parameter vector in the memory matrix is set to an initial value, and then the pre-established neural Turing machine is based on the training data. Train and get a highly accurate neuroturbine machine.
可选地,在本发明上述实施例中,步骤S112,根据训练数据对初始预设模型进行训练,得到预设语音模型包括:Optionally, in the foregoing embodiment of the present invention, in step S112, the initial preset model is trained according to the training data, and the obtained preset voice model includes:
步骤S1122,将训练数据输入语音处理模型,得到预设参数向量。In step S1122, the training data is input into the speech processing model to obtain a preset parameter vector.
步骤S1124,通过写操作将预设参数向量写入初始参数矩阵,得到参数矩阵。Step S1124: The preset parameter vector is written into the initial parameter matrix by a write operation to obtain a parameter matrix.
在一种可选的方案中,为了得到准确度较高的神经图灵机,可以将训练数据中的多个训练语音向量作为输入输入向量,每个训练语音向量对应的文本信息作为输出向量,输入至LSTM模型中,得到LSTM模型的预设W参数向量,并通过写头将预设W参数向量写入记忆矩阵,从而得到准确度较高的神经图灵机。In an optional solution, in order to obtain a highly accurate neuroturing machine, a plurality of training speech vectors in the training data may be used as an input input vector, and text information corresponding to each training speech vector is used as an output vector. In the LSTM model, the preset W parameter vector of the LSTM model is obtained, and the preset W parameter vector is written into the memory matrix through the write head, thereby obtaining a highly accurate neuroturing machine.
实施例2Example 2
根据本发明实施例,提供了一种语音处理装置的实施例。In accordance with an embodiment of the present invention, an embodiment of a speech processing apparatus is provided.
图4是根据本发明实施例的一种语音处理装置的示意图,如图4所示,该装置包括:4 is a schematic diagram of a voice processing device according to an embodiment of the present invention. As shown in FIG. 4, the device includes:
第一获取模块41,用于获取预设时间段内多个时刻的语音向量。The first obtaining module 41 is configured to acquire a voice vector at multiple moments in a preset time period.
可选地,在本发明上述实施例中,该装置还包括:确定模块,用于根据预设语音模型的处理能力,确定预设时间段。Optionally, in the foregoing embodiment of the present invention, the apparatus further includes: a determining module, configured to determine a preset time period according to a processing capability of the preset voice model.
具体地,上述的预设时间段可以根据模型的处理能力进行设定,上述的多个时刻可以是间隔相等的多个采样时刻,例如,预设时间段为100s,采样间隔为10s,则在100s内,可以获取到10个时刻的语音向量。Specifically, the foregoing preset time period may be set according to a processing capability of the model, where the multiple time points may be multiple sampling moments with equal intervals, for example, the preset time period is 100s, and the sampling interval is 10s, then Within 100s, you can get the speech vector of 10 moments.
处理模块43,用于利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,其中,预设语音模型基于预先存储的多个时刻的参数向量对多个时刻的语音向量进行处理。The processing module 43 is configured to process the speech vectors of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the plurality of pre-stored moments The parameter vector processes the speech vectors at multiple times.
可选地,在本发明上述实施例中,预设语音模型包括:语音处理模型和参数矩阵,参数矩阵用于预先存储多个时刻的参数向量,语音处理模型用于基于多个时刻的参数向量对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息。Optionally, in the foregoing embodiment of the present invention, the preset voice model includes: a voice processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of multiple moments, and the voice processing model is used for parameter vectors based on multiple moments. The speech vectors at a plurality of times are processed to obtain a plurality of text information corresponding to the speech vectors of the plurality of times.
可选地,在本发明上述实施例中,语音处理模型为LSTM模型,参数矩阵为记忆矩阵。Optionally, in the foregoing embodiment of the present invention, the voice processing model is an LSTM model, and the parameter matrix is a memory matrix.
具体地,上述的预设语音模型可以是神经图灵机,如图2所示,神经图灵机包括两个组成部分:控制器(即上述的语音处理模型)和记忆矩阵(即上述的参数矩阵),记忆矩阵为外部的存储矩阵,存储有语音处理模型进行语音处理所需要的参数向量,控制器可以对记忆矩阵中的参数向量进行读取和写入;上述的语音处理模型可以是LSTM模型,是一种RNN中特殊的类型,可以学习长期依赖信息,LSTM通过刻意的设计来避免长期依赖问题,具体地,LSTM与其他RNN一样,具有一种重复神经网络模块的链式的形式,但是,与单一神经网络层不同,重复的模块拥有一个不同的结构,如图3所示,可以由输入门,忘记门,输出门构成,并且以一种非常特殊的方式进行交互,从而解决了RNN的梯度消失和梯度爆炸的问题。Specifically, the foregoing preset speech model may be a neuroturing machine. As shown in FIG. 2, the neuroturing machine includes two components: a controller (ie, the above-described speech processing model) and a memory matrix (ie, the parameter matrix described above). The memory matrix is an external storage matrix, and stores a parameter vector required by the speech processing model for speech processing. The controller can read and write the parameter vector in the memory matrix; the above speech processing model can be an LSTM model. It is a special type of RNN that can learn long-term dependency information. LSTM avoids long-term dependency problems by deliberate design. Specifically, LSTM has the same chain form of repeated neural network modules as other RNNs. Unlike a single neural network layer, a repetitive module has a different structure, as shown in Figure 3, which can be composed of input gates, forgotten gates, output gates, and interacts in a very special way, thus solving the RNN Gradient disappearance and gradient explosion problems.
输出模块45,用于输出多个文本信息。The output module 45 is configured to output a plurality of text information.
在一种可选的方案中,可以根据自然语音的时序性特征,获取预设时间段内的多个采样时刻的自然语音数据,得到预设时间段内多个时刻的语音向量,获取预先训练好的神经图灵机,利用神经图灵机对多个时刻的语音向量进行识别,得到对应的文本信息,并输出识别出的文本信息。In an optional solution, the natural voice data of the plurality of sampling moments in the preset time period may be acquired according to the time-series feature of the natural voice, and the voice vector of the multiple time points in the preset time period is obtained, and the pre-training is obtained. A good neural Turing machine uses a neural Turing machine to recognize speech vectors at multiple times, obtain corresponding text information, and output the recognized text information.
根据本发明上述实施例,获取预设时间段内多个时刻的语音向量,利用预设语音模型对多个时刻的语音向量进行处理,得到与多个时刻的语音向量相对应的多个文本信息,输出多个文本信息,从而实现自然语言处理。容易注意到的是,由于获取到的是预设时间段内多个时刻的语音向量,并且预设语音模型基于预先存储的多个时刻的参数向量对多个时刻的语音向量进行处理,从而实现利用自然语音的时序性特征,结合神经图灵机的记忆矩阵和LSTM模型,对自然语音进行处理,进而解决了现有技术中的语音处理方法的处理效率低的技术问题。因此,通过本发明上述实施例提供的方案,可以达到提高处理效率、提高处理准确度、降低处理复杂度、减少处理时间的效果。According to the foregoing embodiment of the present invention, a speech vector of a plurality of time instants in a preset time period is acquired, and a speech vector of a plurality of time instants is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of time instants. , output multiple text information to achieve natural language processing. It is easy to notice that since the speech vectors of the plurality of times in the preset time period are acquired, and the preset speech model processes the speech vectors of the plurality of times based on the parameter vectors of the plurality of times stored in advance, thereby realizing By using the temporal characteristics of natural speech, combined with the memory matrix and LSTM model of the neural Turing machine, the natural speech is processed, and the technical problem of low processing efficiency of the speech processing method in the prior art is solved. Therefore, the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.
可选地,在本发明上述实施例中,处理模块43包括:Optionally, in the foregoing embodiment of the present invention, the processing module 43 includes:
获取子模块,用于根据读操作从参数矩阵中获取多个时刻的第一参数向量。The obtaining submodule is configured to obtain the first parameter vector of the plurality of moments from the parameter matrix according to the read operation.
具体地,如图2所示,神经图灵机可以包括:读头和写头,通过读头进行读操作可以从记忆矩阵中读取到LSTM模型中的W参数,通过写头进行写操作可以将新的W参数写入记忆矩阵中。Specifically, as shown in FIG. 2, the neural Turing machine may include: a read head and a write head, and the read operation may be read from the memory matrix by the read head to read the W parameter in the LSTM model, and the write operation may be performed by the write head. The new W parameters are written to the memory matrix.
修正子模块,用于利用多个时刻的第一参数向量对语音处理模型进行修正,得到修正后的语音处理模型。The correction submodule is configured to correct the speech processing model by using the first parameter vector at multiple moments to obtain a modified speech processing model.
第一处理子模块,用于利用修正后的语音处理模型对多个时刻的语音向量进行处理,得到多个文本信息。The first processing sub-module is configured to process the speech vector of the plurality of moments by using the modified speech processing model to obtain a plurality of text information.
在一种可选的方案中,在获取到多个时刻的语音向量之后,针对每个时刻的自然语音处理过程,可以通过读头从记忆矩阵中读取W参数向量,将W参数向量输入LSTM模型,对LSTM模型进行修正,得到修正后的LSTM模型,可以将语音向量作为输入向量,输入至修正后的LSTM模型,从而得到LSTM模型的输出向量,即语音向量的文本信息,在所有多个时刻的语音向量完成处理之后,得到多个时刻的语音向量相对应的多个文本信息。In an optional solution, after acquiring the speech vectors at multiple times, for the natural speech processing process at each moment, the W parameter vector can be read from the memory matrix by the read head, and the W parameter vector is input into the LSTM. The model, the LSTM model is modified, and the modified LSTM model is obtained. The speech vector can be input as an input vector to the modified LSTM model, thereby obtaining the output vector of the LSTM model, that is, the text information of the speech vector, in all of the plurality. After the speech vector of the time is completed, a plurality of pieces of text information corresponding to the speech vectors of the plurality of times are obtained.
可选地,在本发明上述实施例中,处理模块43还包括:Optionally, in the foregoing embodiment of the present invention, the processing module 43 further includes:
第二处理子模块,用于利用修正后的语音处理模型,得到多个时刻的第二参数向量。The second processing sub-module is configured to obtain the second parameter vector of the multiple moments by using the modified speech processing model.
可选地,在本发明上述实施例中,第二处理子模块还用于利用修正后的语音处理模型对多个时刻的第一参数向量进行更新,得到多个时刻的第二参数向量。Optionally, in the foregoing embodiment of the present invention, the second processing sub-module is further configured to update the first parameter vector of the multiple moments by using the modified speech processing model to obtain the second parameter vector of the multiple moments.
第一存储子模块,用于根据写操作将多个时刻的第二参数向量写入参数矩阵。The first storage submodule is configured to write the second parameter vector of the multiple moments into the parameter matrix according to the write operation.
在一种可选的方案中,针对每个时刻的自然语音处理过程,在利用LSTM模型对语音向量进行处理的过程中,不仅可以得到语音向量的文本信息,还可以得到新的W参数向量,通过写头将新的W参数向量写入记忆矩阵,作为下一个时刻的W参数向量。In an optional solution, for the natural speech processing process at each moment, in the process of processing the speech vector by using the LSTM model, not only the text information of the speech vector but also the new W parameter vector can be obtained. The new W parameter vector is written to the memory matrix by the write head as the W parameter vector for the next moment.
可选地,在本发明上述实施例中,该装置还包括:Optionally, in the foregoing embodiment of the present invention, the device further includes:
建立模块,用于建立初始预设模型,初始预设模型包括:语音处理模型和初始参数矩阵。The module is established to establish an initial preset model, and the initial preset model includes: a voice processing model and an initial parameter matrix.
第二获取模块,用于获取训练数据,其中,训练数据包括:多个训练语音向量, 以及每个训练语音向量相对应的文本信息。And a second acquiring module, configured to acquire training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector.
训练模块,用于根据训练数据对初始预设模型进行训练,得到预设语音模型。The training module is configured to train the initial preset model according to the training data to obtain a preset voice model.
在一种可选的方案中,可以根据实际处理需要,预先建立神经图灵机中的LSTM模型,并将记忆矩阵中的W参数向量置为初始值,然后根据训练数据对预先建立的神经图灵机进行训练,得到准确度较高的神经图灵机。In an optional solution, the LSTM model in the neural Turing machine can be pre-established according to the actual processing needs, and the W parameter vector in the memory matrix is set to an initial value, and then the pre-established neural Turing machine is based on the training data. Train and get a highly accurate neuroturbine machine.
可选地,在本发明上述实施例中,训练模块包括:Optionally, in the foregoing embodiment of the present invention, the training module includes:
第三处理子模块,用于将训练数据输入语音处理模型,得到预设参数向量。The third processing sub-module is configured to input the training data into the speech processing model to obtain a preset parameter vector.
第二存储子模块,用于通过写操作将预设参数向量写入初始参数矩阵,得到参数矩阵。The second storage submodule is configured to write a preset parameter vector into the initial parameter matrix by a write operation to obtain a parameter matrix.
在一种可选的方案中,为了得到准确度较高的神经图灵机,可以将训练数据中的多个训练语音向量作为输入输入向量,每个训练语音向量对应的文本信息作为输出向量,输入至LSTM模型中,得到LSTM模型的预设W参数向量,并通过写头将预设W参数向量写入记忆矩阵,从而得到准确度较高的神经图灵机。In an optional solution, in order to obtain a highly accurate neuroturing machine, a plurality of training speech vectors in the training data may be used as an input input vector, and text information corresponding to each training speech vector is used as an output vector. In the LSTM model, the preset W parameter vector of the LSTM model is obtained, and the preset W parameter vector is written into the memory matrix through the write head, thereby obtaining a highly accurate neuroturing machine.
实施例3Example 3
根据本发明实施例,提供了一种存储介质的实施例,存储介质包括存储的程序,其中,在程序运行时控制存储介质所在设备执行上述实施例1中的语音处理方法。According to an embodiment of the present invention, an embodiment of a storage medium is provided. The storage medium includes a stored program, wherein the device in which the storage medium is located is controlled to execute the voice processing method in Embodiment 1 above when the program is running.
实施例4Example 4
根据本发明实施例,提供了一种处理器的实施例,处理器用于运行程序,其中,程序运行时执行上述实施例1中的语音处理方法。According to an embodiment of the present invention, there is provided an embodiment of a processor for executing a program, wherein the program is executed to execute the voice processing method in Embodiment 1 above.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the descriptions of the various embodiments are different, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可根据其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是根据一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed technical contents may be implemented in other manners. The device embodiments described above are only schematic. For example, the division of the unit may be a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection according to some interfaces, units or modules, and may be in an electrical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显 示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate, and the components shown as unit may or may not be physical units, i.e., may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.
Figure PCTCN2018079848-appb-000001
Figure PCTCN2018079848-appb-000001
Figure PCTCN2018079848-appb-000002
Figure PCTCN2018079848-appb-000002
Figure PCTCN2018079848-appb-000003
Figure PCTCN2018079848-appb-000003

Claims (12)

  1. 一种语音处理方法,其特征在于,包括:A voice processing method, comprising:
    获取预设时间段内多个时刻的语音向量;Obtaining a speech vector at multiple times in a preset time period;
    利用预设语音模型对所述多个时刻的语音向量进行处理,得到与所述多个时刻的语音向量相对应的多个文本信息,其中,所述预设语音模型基于预先存储的多个时刻的参数向量对所述多个时刻的语音向量进行处理;Processing the speech vector of the plurality of moments by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on a plurality of pre-stored moments The parameter vector processes the speech vector of the plurality of moments;
    输出所述多个文本信息。Outputting the plurality of text information.
  2. 根据权利要求1所述的方法,其特征在于,所述预设语音模型包括:语音处理模型和参数矩阵,所述参数矩阵用于预先存储所述多个时刻的参数向量,所述语音处理模型用于基于所述多个时刻的参数向量对所述多个时刻的语音向量进行处理,得到与所述多个时刻的语音向量相对应的多个文本信息。The method according to claim 1, wherein the preset speech model comprises: a speech processing model and a parameter matrix, wherein the parameter matrix is configured to pre-store parameter vectors of the plurality of moments, the speech processing model And processing, by using the parameter vector of the plurality of time points, the voice vector of the multiple time points to obtain a plurality of text information corresponding to the voice vector of the multiple time instants.
  3. 根据权利要求2所述的方法,其特征在于,利用预设语音模型对所述多个时刻的语音向量进行处理,得到与所述多个时刻的语音向量相对应的多个文本信息,包括:The method according to claim 2, wherein the plurality of text information corresponding to the plurality of time voice vectors are obtained by processing the voice vector of the plurality of time points by using a preset voice model, including:
    根据读操作从所述参数矩阵中获取所述多个时刻的第一参数向量;Acquiring the first parameter vector of the plurality of moments from the parameter matrix according to a read operation;
    利用所述多个时刻的第一参数向量对所述语音处理模型进行修正,得到修正后的语音处理模型;Correcting the speech processing model by using the first parameter vector of the plurality of moments to obtain a modified speech processing model;
    利用所述修正后的语音处理模型对所述多个时刻的语音向量进行处理,得到所述多个文本信息。The speech vector of the plurality of times is processed by the modified speech processing model to obtain the plurality of text information.
  4. 根据权利要求3所述的方法,其特征在于,在利用所述修正后的语音处理模型对所述多个时刻的语音向量进行处理,得到所述多个文本信息的同时,所述方法还包括:The method according to claim 3, wherein, when the plurality of text information is processed by using the modified speech processing model to obtain the plurality of text information, the method further comprises :
    利用所述修正后的语音处理模型,得到所述多个时刻的第二参数向量;Using the modified speech processing model, obtaining a second parameter vector of the plurality of moments;
    根据写操作将所述多个时刻的第二参数向量写入所述参数矩阵。The second parameter vector of the plurality of times is written to the parameter matrix according to a write operation.
  5. 根据权利要求4所述的方法,其特征在于,利用所述修正后的语音处理模型,得到所述多个时刻的第二参数向量,包括:The method according to claim 4, wherein the second parameter vector of the plurality of moments is obtained by using the modified speech processing model, including:
    利用所述修正后的语音处理模型对所述多个时刻的第一参数向量进行更新,得到所述多个时刻的第二参数向量。And updating the first parameter vector of the plurality of times by using the modified speech processing model to obtain a second parameter vector of the plurality of times.
  6. 根据权利要求2所述的方法,其特征在于,在利用预设语音模型对所述多个时刻的语音向量进行处理,得到与所述多个时刻的语音向量相对应的多个文本信息之 前,所述方法还包括:The method according to claim 2, wherein before the speech vectors of the plurality of times are processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of times, The method further includes:
    建立初始预设模型,所述初始预设模型包括:所述语音处理模型和初始参数矩阵;Establishing an initial preset model, the initial preset model comprising: the voice processing model and an initial parameter matrix;
    获取训练数据,其中,所述训练数据包括:多个训练语音向量,以及每个训练语音向量相对应的文本信息;Obtaining training data, wherein the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector;
    根据所述训练数据对所述初始预设模型进行训练,得到所述预设语音模型。And training the initial preset model according to the training data to obtain the preset voice model.
  7. 根据权利要求6所述的方法,其特征在于,根据所述训练数据对所述初始预设模型进行训练,得到所述预设语音模型包括:The method according to claim 6, wherein the initial preset model is trained according to the training data, and the obtained preset speech model comprises:
    将所述训练数据输入所述语音处理模型,得到预设参数向量;Inputting the training data into the voice processing model to obtain a preset parameter vector;
    通过写操作将所述预设参数向量写入所述初始参数矩阵,得到所述参数矩阵。The parameter matrix is obtained by writing the preset parameter vector into the initial parameter matrix by a write operation.
  8. 根据权利要求2至7中任意一项所述的方法,其特征在于,所述语音处理模型为LSTM模型,所述参数矩阵为记忆矩阵。The method according to any one of claims 2 to 7, wherein the speech processing model is an LSTM model and the parameter matrix is a memory matrix.
  9. 根据权利要求1所述的方法,其特征在于,根据所述预设语音模型的处理能力,确定所述预设时间段。The method according to claim 1, wherein the preset time period is determined according to a processing capability of the preset voice model.
  10. 一种语音处理装置,其特征在于,包括:A voice processing device, comprising:
    第一获取模块,用于获取预设时间段内多个时刻的语音向量;a first acquiring module, configured to acquire a voice vector at multiple moments in a preset time period;
    处理模块,用于利用预设语音模型对所述多个时刻的语音向量进行处理,得到与所述多个时刻的语音向量相对应的多个文本信息,其中,所述预设语音模型基于预先存储的多个时刻的参数向量对所述多个时刻的语音向量进行处理;a processing module, configured to process the voice vector of the multiple moments by using a preset voice model, to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, where the preset voice model is based on a pre- The stored parameter vector of the plurality of times processes the speech vector of the plurality of moments;
    输出模块,用于输出所述多个文本信息。And an output module, configured to output the plurality of text information.
  11. 一种存储介质,其特征在于,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行权利要求1至9中任意一项所述的语音处理方法。A storage medium, characterized in that the storage medium comprises a stored program, wherein the device in which the storage medium is located is controlled to perform the voice processing method according to any one of claims 1 to 9 while the program is running.
  12. 一种处理器,其特征在于,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1至9中任意一项所述的语音处理方法。A processor, wherein the processor is configured to execute a program, wherein the program is executed to perform the voice processing method according to any one of claims 1 to 9.
PCT/CN2018/079848 2017-07-28 2018-03-21 Speech processing method and apparatus, storage medium and processor WO2019019667A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710633042.2 2017-07-28
CN201710633042.2A CN109308896B (en) 2017-07-28 2017-07-28 Voice processing method and device, storage medium and processor

Publications (1)

Publication Number Publication Date
WO2019019667A1 true WO2019019667A1 (en) 2019-01-31

Family

ID=65040955

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/079848 WO2019019667A1 (en) 2017-07-28 2018-03-21 Speech processing method and apparatus, storage medium and processor

Country Status (2)

Country Link
CN (1) CN109308896B (en)
WO (1) WO2019019667A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489630A (en) * 2019-09-12 2021-03-12 武汉Tcl集团工业研究院有限公司 Voice recognition method and device
CN113095559A (en) * 2021-04-02 2021-07-09 京东数科海益信息科技有限公司 Hatching time prediction method, device, equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836270A (en) * 2021-09-28 2021-12-24 深圳格隆汇信息科技有限公司 Big data processing method and related product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010204391A (en) * 2009-03-03 2010-09-16 Nippon Telegr & Teleph Corp <Ntt> Voice signal modeling method, signal recognition device and method, parameter learning device and method, and feature value generating device, method, and program
US9378729B1 (en) * 2013-03-12 2016-06-28 Amazon Technologies, Inc. Maximum likelihood channel normalization
CN105989839A (en) * 2015-06-03 2016-10-05 乐视致新电子科技(天津)有限公司 Speech recognition method and speech recognition device
CN106157950A (en) * 2016-09-29 2016-11-23 合肥华凌股份有限公司 Speech control system and awakening method, Rouser and household electrical appliances, coprocessor
CN106257583A (en) * 2015-06-17 2016-12-28 大众汽车有限公司 Speech recognition system and the method being used for running speech recognition system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6021403A (en) * 1996-07-19 2000-02-01 Microsoft Corporation Intelligent user assistance facility
AU2002253416A1 (en) * 2002-03-27 2003-10-08 Nokia Corporation Pattern recognition
ATE466361T1 (en) * 2006-08-11 2010-05-15 Harman Becker Automotive Sys LANGUAGE RECOGNITION USING A STATISTICAL LANGUAGE MODEL USING SQUARE ROOT SMOOTHING
EP2734997A4 (en) * 2011-07-20 2015-05-20 Tata Consultancy Services Ltd A method and system for detecting boundary of coarticulated units from isolated speech
CN105070300A (en) * 2015-08-12 2015-11-18 东南大学 Voice emotion characteristic selection method based on speaker standardization change

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010204391A (en) * 2009-03-03 2010-09-16 Nippon Telegr & Teleph Corp <Ntt> Voice signal modeling method, signal recognition device and method, parameter learning device and method, and feature value generating device, method, and program
US9378729B1 (en) * 2013-03-12 2016-06-28 Amazon Technologies, Inc. Maximum likelihood channel normalization
CN105989839A (en) * 2015-06-03 2016-10-05 乐视致新电子科技(天津)有限公司 Speech recognition method and speech recognition device
CN106257583A (en) * 2015-06-17 2016-12-28 大众汽车有限公司 Speech recognition system and the method being used for running speech recognition system
CN106157950A (en) * 2016-09-29 2016-11-23 合肥华凌股份有限公司 Speech control system and awakening method, Rouser and household electrical appliances, coprocessor

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489630A (en) * 2019-09-12 2021-03-12 武汉Tcl集团工业研究院有限公司 Voice recognition method and device
CN113095559A (en) * 2021-04-02 2021-07-09 京东数科海益信息科技有限公司 Hatching time prediction method, device, equipment and storage medium
CN113095559B (en) * 2021-04-02 2024-04-09 京东科技信息技术有限公司 Method, device, equipment and storage medium for predicting hatching time

Also Published As

Publication number Publication date
CN109308896A (en) 2019-02-05
CN109308896B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
Poongodi et al. Image and audio caps: automated captioning of background sounds and images using deep learning
US11030414B2 (en) System and methods for performing NLP related tasks using contextualized word representations
US20210287663A1 (en) Method and apparatus with a personalized speech recognition model
Ritchie The shape of female mating preferences
US10635893B2 (en) Identity authentication method, terminal device, and computer-readable storage medium
US11004448B2 (en) Method and device for recognizing text segmentation position
CN110503128A (en) The spectrogram that confrontation network carries out Waveform composition is generated using convolution
US9911409B2 (en) Speech recognition apparatus and method
CN106688034B (en) Text-to-speech conversion with emotional content
US9818409B2 (en) Context-dependent modeling of phonemes
CN106575500B (en) Method and apparatus for synthesizing speech based on facial structure
WO2019019667A1 (en) Speech processing method and apparatus, storage medium and processor
CN109189544B (en) Method and device for generating dial plate
US20190272319A1 (en) Method and Device for Identifying Specific Text Information
WO2021227259A1 (en) Accent detection method and device and non-transitory storage medium
US11417339B1 (en) Detection of plagiarized spoken responses using machine learning
US20200051451A1 (en) Short answer grade prediction
WO2020151175A1 (en) Method and device for text generation, computer device, and storage medium
CN104077930A (en) Powerful memorizing and learning method based on mobile terminal and mobile terminal
US20220238098A1 (en) Voice recognition method and device
US20220318230A1 (en) Text to question-answer model system
CN112036174B (en) Punctuation marking method and device
US11682413B2 (en) Method and system to modify speech impaired messages utilizing neural network audio filters
CN109147819A (en) Audio-frequency information processing method, device and storage medium
JP2020077054A (en) Selection device and selection method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18839088

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.05.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18839088

Country of ref document: EP

Kind code of ref document: A1