WO2019019667A1

WO2019019667A1 - Speech processing method and apparatus, storage medium and processor

Info

Publication number: WO2019019667A1
Application number: PCT/CN2018/079848
Authority: WO
Inventors: 刘若鹏; 陈�峰
Original assignee: 深圳光启合众科技有限公司; 深圳光启创新技术有限公司
Priority date: 2017-07-28
Filing date: 2018-03-21
Publication date: 2019-01-31
Also published as: CN109308896A; CN109308896B

Abstract

Disclosed are a speech processing method and apparatus, a storage medium and a processor. The method comprises: acquiring speech vectors at a plurality of moments in a pre-set time period (S102); processing, by means of a pre-set speech model, the speech vectors at the plurality of moments to obtain a plurality of pieces of text information corresponding to the speech vectors at the plurality of moments (S104), wherein the pre-set speech model processes, based on pre-stored parameter vectors at the plurality of moments, the speech vectors at the plurality of moments; and outputting the plurality of pieces of text information (S106). By means of the present invention, the technical problem in the prior art that a speech processing method has a low processing efficiency is solved.

Description

Voice processing method and device, storage medium and processor

Technical field

The present invention relates to the field of data processing, and in particular to a voice processing method and apparatus, a storage medium, and a processor.

Background technique

Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between users and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use every day, so it is closely related to the study of linguistics, but there are important differences.

The commonly used natural language processing methods are: conditional random field CRF, hidden Markov model HMM, recurrent neural network model RNN and long-term and short-term memory model LSTM, etc. However, in order to improve processing accuracy, it is necessary to increase the model depth, resulting in processing complexity. High, low processing efficiency.

In view of the low processing efficiency of the voice processing method in the prior art, an effective solution has not been proposed yet.

Summary of the invention

The embodiment of the invention provides a voice processing method and device, a storage medium and a processor, so as to at least solve the technical problem of low processing efficiency of the voice processing method in the prior art.

According to an aspect of the embodiments of the present invention, a voice processing method includes: acquiring a voice vector at a plurality of times in a preset time period; and processing a voice vector at multiple times by using a preset voice model to obtain and The plurality of text information corresponding to the speech vector of the moment, wherein the preset speech model processes the speech vectors of the plurality of moments based on the parameter vectors of the plurality of pre-stored moments; and outputs the plurality of text information.

Further, the preset speech model includes: a speech processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of a plurality of moments, and the speech processing model is configured to process the speech vectors of the plurality of moments based on the parameter vectors of the plurality of moments. , obtaining a plurality of text information corresponding to the speech vectors of the plurality of times.

Further, processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, comprising: acquiring the first plurality of moments from the parameter matrix according to the reading operation The parameter vector is obtained by modifying the speech processing model by using the first parameter vector at a plurality of times to obtain a modified speech processing model; and processing the speech vector at a plurality of times by using the modified speech processing model to obtain a plurality of text information.

Further, while processing the speech vector of the plurality of times by using the modified speech processing model to obtain a plurality of text information, the method further includes: obtaining the second parameter of the plurality of times by using the modified speech processing model Vector; writes a second parameter vector of multiple moments to the parameter matrix according to a write operation.

Further, the second parameter vector of the plurality of times is obtained by using the modified speech processing model, including: updating the first parameter vector of the plurality of times by using the modified speech processing model to obtain the second parameter of the plurality of times vector.

Further, before processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, the method further includes: establishing an initial preset model, and initializing the preset The model includes: a speech processing model and an initial parameter matrix; acquiring training data, wherein the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector; and training the initial preset model according to the training data, Get the default speech model.

Further, the initial preset model is trained according to the training data, and the preset voice model is obtained by: inputting the training data into the voice processing model to obtain a preset parameter vector; and writing the preset parameter vector into the initial parameter matrix by using a write operation, Parameter matrix.

Further, the speech processing model is an LSTM model, and the parameter matrix is a memory matrix.

Further, the preset time period is determined according to the processing capability of the preset voice model.

According to another aspect of the present invention, a voice processing apparatus is further provided, including: a first acquiring module, configured to acquire a voice vector of a plurality of times in a preset time period; and a processing module, configured to use the preset voice The model processes the speech vectors at multiple moments to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the pre-stored parameter vectors of the plurality of moments to the speech vectors of the plurality of moments Processing; output module for outputting multiple text information.

Further, the processing module includes: an obtaining submodule, configured to acquire a first parameter vector of the plurality of moments from the parameter matrix according to the reading operation; and a correction submodule, configured to perform the speech processing model by using the first parameter vector of the multiple moments Correction, the corrected speech processing model is obtained; the first processing sub-module is configured to process the speech vectors of the plurality of moments by using the modified speech processing model to obtain a plurality of text information.

Further, the processing module further includes: a second processing submodule, configured to obtain a second parameter vector of the plurality of moments by using the modified speech processing model; and the first storage submodule configured to use the plurality of moments according to the writing operation The second parameter vector is written to the parameter matrix.

Further, the second processing sub-module is further configured to update the first parameter vector of the multiple moments by using the modified speech processing model to obtain a second parameter vector of the multiple moments.

Further, the foregoing apparatus further includes: an establishing module, configured to establish an initial preset model, where the initial preset model includes: a voice processing model and an initial parameter matrix; and a second acquiring module, configured to acquire training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector; and a training module, configured to train the initial preset model according to the training data to obtain a preset speech model.

Further, the training module includes: a third processing sub-module, configured to input training data into the speech processing model to obtain a preset parameter vector; and a second storage sub-module configured to write the preset parameter vector into the initial parameter matrix by using a write operation. , get the default speech model.

Further, the foregoing apparatus further includes: a determining module, configured to determine a preset time period according to a processing capability of the preset voice model.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium comprising a stored program, wherein the device in which the storage medium is located is controlled to execute the voice processing method in the above embodiment while the program is running.

According to another aspect of an embodiment of the present invention, there is further provided a processor for executing a program, wherein the program is executed to execute the voice processing method in the above embodiment.

In the embodiment of the present invention, a speech vector of a plurality of times in a preset time period is acquired, and a speech vector of a plurality of times is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the multiple moments. , output multiple text information to achieve natural language processing. It is easy to notice that since the speech vectors of the plurality of times in the preset time period are acquired, and the preset speech model processes the speech vectors of the plurality of times based on the parameter vectors of the plurality of times stored in advance, thereby realizing By using the temporal characteristics of natural speech, combined with the memory matrix and LSTM model of the neural Turing machine, the natural speech is processed, and the technical problem of low processing efficiency of the speech processing method in the prior art is solved. Therefore, the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.

DRAWINGS

The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:

1 is a flow chart of a voice processing method according to an embodiment of the present invention;

2 is a schematic diagram of an optional preset speech model according to an embodiment of the present invention;

3 is a schematic diagram of a repeating module of an optional speech processing model in accordance with an embodiment of the present invention;

4 is a schematic diagram of a voice processing device in accordance with an embodiment of the present invention.

Detailed ways

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

It is to be understood that the terms "first", "second", and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged where appropriate, so that the embodiments of the invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.

Example 1

In accordance with an embodiment of the present invention, an embodiment of a speech processing method is provided, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although The logical order is shown in the flowcharts, but in some cases the steps shown or described may be performed in a different order than the ones described herein.

FIG. 1 is a flowchart of a voice processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:

Step S102: Acquire a speech vector of a plurality of times in a preset time period.

Optionally, in the foregoing embodiment of the present invention, the preset time period is determined according to the processing capability of the preset voice model.

Specifically, the preset time period may be set according to the processing capability of the actual voice processing model, and the multiple times may be multiple sampling times with equal intervals, for example, the preset time period is 100s, and the sampling interval is 10s. , within 100s, you can get the speech vector of 10 moments.

Step S104: processing the speech vector of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the parameter vectors of the plurality of times stored in advance The speech vectors at multiple times are processed.

Optionally, in the foregoing embodiment of the present invention, the preset voice model includes: a voice processing model and a parameter matrix, wherein the parameter matrix is used to pre-store parameter vectors of multiple moments, and the voice processing model is used for parameter vectors based on multiple moments. The speech vectors at a plurality of times are processed to obtain a plurality of text information corresponding to the speech vectors of the plurality of times.

Optionally, in the foregoing embodiment of the present invention, the voice processing model is an LSTM model, and the parameter matrix is a memory matrix.

Specifically, the foregoing preset speech model may be a neuroturing machine. As shown in FIG. 2, the neuroturing machine includes two components: a controller (ie, the above-described speech processing model) and a memory matrix (ie, the parameter matrix described above). The memory matrix is an external storage matrix, and stores a parameter vector required by the speech processing model for speech processing. The controller can read and write the parameter vector in the memory matrix; the above speech processing model can be an LSTM model. It is a special type of RNN that can learn long-term dependency information. LSTM avoids long-term dependency problems by deliberate design. Specifically, LSTM has the same chain form of repeated neural network modules as other RNNs. Unlike a single neural network layer, a repetitive module has a different structure, as shown in Figure 3, which can be composed of input gates, forgotten gates, output gates, and interacts in a very special way, thus solving the RNN Gradient disappearance and gradient explosion problems.

In step S106, a plurality of text information is output.

In an optional solution, the natural voice data of the plurality of sampling moments in the preset time period may be acquired according to the time-series feature of the natural voice, and the voice vector of the multiple time points in the preset time period is obtained, and the pre-training is obtained. A good neural Turing machine uses a neural Turing machine to recognize speech vectors at multiple times, obtain corresponding text information, and output the recognized text information.

According to the foregoing embodiment of the present invention, a speech vector of a plurality of time instants in a preset time period is acquired, and a speech vector of a plurality of time instants is processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of time instants. , output multiple text information to achieve natural language processing. It is easy to notice that since the speech vectors of the plurality of times in the preset time period are acquired, and the preset speech model processes the speech vectors of the plurality of times based on the parameter vectors of the plurality of times stored in advance, thereby realizing By using the temporal characteristics of natural speech, combined with the memory matrix and LSTM model of the neural Turing machine, the natural speech is processed, and the technical problem of low processing efficiency of the speech processing method in the prior art is solved. Therefore, the solution provided by the above embodiments of the present invention can achieve the effects of improving processing efficiency, improving processing accuracy, reducing processing complexity, and reducing processing time.

Optionally, in the foregoing embodiment of the present invention, in step S104, the voice vector of the multiple moments is processed by using the preset voice model to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, including:

Step S1040: Acquire a first parameter vector of a plurality of moments from the parameter matrix according to the read operation.

Specifically, as shown in FIG. 2, the neural Turing machine may include: a read head and a write head, and the read operation may be read from the memory matrix by the read head to read the W parameter in the LSTM model, and the write operation may be performed by the write head. The new W parameters are written to the memory matrix.

In step S1042, the speech processing model is corrected by using the first parameter vector at a plurality of times to obtain a modified speech processing model.

In step S1044, the speech vector of the plurality of times is processed by the corrected speech processing model to obtain a plurality of text information.

In an optional solution, after acquiring the speech vectors at multiple times, for the natural speech processing process at each moment, the W parameter vector can be read from the memory matrix by the read head, and the W parameter vector is input into the LSTM. The model, the LSTM model is modified, and the modified LSTM model is obtained. The speech vector can be input as an input vector to the modified LSTM model, thereby obtaining the output vector of the LSTM model, that is, the text information of the speech vector, in all of the plurality. After the speech vector of the time is completed, a plurality of pieces of text information corresponding to the speech vectors of the plurality of times are obtained.

Optionally, in the foregoing embodiment of the present invention, in step S1044, the voice vector of the plurality of times is processed by using the modified voice processing model to obtain a plurality of text information, and the method further includes:

In step S1046, the second parameter vector of the plurality of times is obtained by using the modified speech processing model.

Optionally, in the foregoing embodiment of the present invention, in step S1046, using the modified speech processing model, obtaining a second parameter vector at multiple moments, including:

Step S10462: The first parameter vector of the plurality of times is updated by using the modified speech processing model to obtain a second parameter vector of the plurality of times.

Step S1048, writing a second parameter vector of the plurality of times into the parameter matrix according to the writing operation.

In an optional solution, for the natural speech processing process at each moment, in the process of processing the speech vector by using the LSTM model, not only the text information of the speech vector but also the new W parameter vector can be obtained. The new W parameter vector is written to the memory matrix by the write head as the W parameter vector for the next moment.

Optionally, in the foregoing embodiment of the present invention, in step S104, the voice vector of the multiple moments is processed by using the preset voice model to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, the method Also includes:

Step S108, establishing an initial preset model, where the initial preset model comprises: a voice processing model and an initial parameter matrix.

Step S110, acquiring training data, wherein the training data comprises: a plurality of training speech vectors, and text information corresponding to each training speech vector.

Step S112, training the initial preset model according to the training data to obtain a preset voice model.

In an optional solution, the LSTM model in the neural Turing machine can be pre-established according to the actual processing needs, and the W parameter vector in the memory matrix is set to an initial value, and then the pre-established neural Turing machine is based on the training data. Train and get a highly accurate neuroturbine machine.

Optionally, in the foregoing embodiment of the present invention, in step S112, the initial preset model is trained according to the training data, and the obtained preset voice model includes:

In step S1122, the training data is input into the speech processing model to obtain a preset parameter vector.

Step S1124: The preset parameter vector is written into the initial parameter matrix by a write operation to obtain a parameter matrix.

In an optional solution, in order to obtain a highly accurate neuroturing machine, a plurality of training speech vectors in the training data may be used as an input input vector, and text information corresponding to each training speech vector is used as an output vector. In the LSTM model, the preset W parameter vector of the LSTM model is obtained, and the preset W parameter vector is written into the memory matrix through the write head, thereby obtaining a highly accurate neuroturing machine.

Example 2

In accordance with an embodiment of the present invention, an embodiment of a speech processing apparatus is provided.

4 is a schematic diagram of a voice processing device according to an embodiment of the present invention. As shown in FIG. 4, the device includes:

The first obtaining module 41 is configured to acquire a voice vector at multiple moments in a preset time period.

Optionally, in the foregoing embodiment of the present invention, the apparatus further includes: a determining module, configured to determine a preset time period according to a processing capability of the preset voice model.

Specifically, the foregoing preset time period may be set according to a processing capability of the model, where the multiple time points may be multiple sampling moments with equal intervals, for example, the preset time period is 100s, and the sampling interval is 10s, then Within 100s, you can get the speech vector of 10 moments.

The processing module 43 is configured to process the speech vectors of the plurality of moments by using the preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on the plurality of pre-stored moments The parameter vector processes the speech vectors at multiple times.

The output module 45 is configured to output a plurality of text information.

Optionally, in the foregoing embodiment of the present invention, the processing module 43 includes:

The obtaining submodule is configured to obtain the first parameter vector of the plurality of moments from the parameter matrix according to the read operation.

The correction submodule is configured to correct the speech processing model by using the first parameter vector at multiple moments to obtain a modified speech processing model.

The first processing sub-module is configured to process the speech vector of the plurality of moments by using the modified speech processing model to obtain a plurality of text information.

Optionally, in the foregoing embodiment of the present invention, the processing module 43 further includes:

The second processing sub-module is configured to obtain the second parameter vector of the multiple moments by using the modified speech processing model.

Optionally, in the foregoing embodiment of the present invention, the second processing sub-module is further configured to update the first parameter vector of the multiple moments by using the modified speech processing model to obtain the second parameter vector of the multiple moments.

The first storage submodule is configured to write the second parameter vector of the multiple moments into the parameter matrix according to the write operation.

Optionally, in the foregoing embodiment of the present invention, the device further includes:

The module is established to establish an initial preset model, and the initial preset model includes: a voice processing model and an initial parameter matrix.

And a second acquiring module, configured to acquire training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector.

The training module is configured to train the initial preset model according to the training data to obtain a preset voice model.

Optionally, in the foregoing embodiment of the present invention, the training module includes:

The third processing sub-module is configured to input the training data into the speech processing model to obtain a preset parameter vector.

The second storage submodule is configured to write a preset parameter vector into the initial parameter matrix by a write operation to obtain a parameter matrix.

Example 3

According to an embodiment of the present invention, an embodiment of a storage medium is provided. The storage medium includes a stored program, wherein the device in which the storage medium is located is controlled to execute the voice processing method in Embodiment 1 above when the program is running.

Example 4

According to an embodiment of the present invention, there is provided an embodiment of a processor for executing a program, wherein the program is executed to execute the voice processing method in Embodiment 1 above.

The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

In the above-mentioned embodiments of the present invention, the descriptions of the various embodiments are different, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed technical contents may be implemented in other manners. The device embodiments described above are only schematic. For example, the division of the unit may be a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection according to some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate components may or may not be physically separate, and the components shown as unit may or may not be physical units, i.e., may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Claims

A voice processing method, comprising:

Obtaining a speech vector at multiple times in a preset time period;

Processing the speech vector of the plurality of moments by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of moments, wherein the preset speech model is based on a plurality of pre-stored moments The parameter vector processes the speech vector of the plurality of moments;

Outputting the plurality of text information.
The method according to claim 1, wherein the preset speech model comprises: a speech processing model and a parameter matrix, wherein the parameter matrix is configured to pre-store parameter vectors of the plurality of moments, the speech processing model And processing, by using the parameter vector of the plurality of time points, the voice vector of the multiple time points to obtain a plurality of text information corresponding to the voice vector of the multiple time instants.
The method according to claim 2, wherein the plurality of text information corresponding to the plurality of time voice vectors are obtained by processing the voice vector of the plurality of time points by using a preset voice model, including:

Acquiring the first parameter vector of the plurality of moments from the parameter matrix according to a read operation;

Correcting the speech processing model by using the first parameter vector of the plurality of moments to obtain a modified speech processing model;

The speech vector of the plurality of times is processed by the modified speech processing model to obtain the plurality of text information.
The method according to claim 3, wherein, when the plurality of text information is processed by using the modified speech processing model to obtain the plurality of text information, the method further comprises :

Using the modified speech processing model, obtaining a second parameter vector of the plurality of moments;

The second parameter vector of the plurality of times is written to the parameter matrix according to a write operation.
The method according to claim 4, wherein the second parameter vector of the plurality of moments is obtained by using the modified speech processing model, including:

And updating the first parameter vector of the plurality of times by using the modified speech processing model to obtain a second parameter vector of the plurality of times.
The method according to claim 2, wherein before the speech vectors of the plurality of times are processed by using a preset speech model to obtain a plurality of text information corresponding to the speech vectors of the plurality of times, The method further includes:

Establishing an initial preset model, the initial preset model comprising: the voice processing model and an initial parameter matrix;

Obtaining training data, wherein the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector;

And training the initial preset model according to the training data to obtain the preset voice model.
The method according to claim 6, wherein the initial preset model is trained according to the training data, and the obtained preset speech model comprises:

Inputting the training data into the voice processing model to obtain a preset parameter vector;

The parameter matrix is obtained by writing the preset parameter vector into the initial parameter matrix by a write operation.
The method according to any one of claims 2 to 7, wherein the speech processing model is an LSTM model and the parameter matrix is a memory matrix.
The method according to claim 1, wherein the preset time period is determined according to a processing capability of the preset voice model.
A voice processing device, comprising:

a first acquiring module, configured to acquire a voice vector at multiple moments in a preset time period;

a processing module, configured to process the voice vector of the multiple moments by using a preset voice model, to obtain a plurality of text information corresponding to the voice vectors of the multiple moments, where the preset voice model is based on a pre- The stored parameter vector of the plurality of times processes the speech vector of the plurality of moments;

And an output module, configured to output the plurality of text information.
A storage medium, characterized in that the storage medium comprises a stored program, wherein the device in which the storage medium is located is controlled to perform the voice processing method according to any one of claims 1 to 9 while the program is running.
A processor, wherein the processor is configured to execute a program, wherein the program is executed to perform the voice processing method according to any one of claims 1 to 9.