CN110634470A

CN110634470A - Intelligent voice processing method and device

Info

Publication number: CN110634470A
Application number: CN201810575092.4A
Authority: CN
Inventors: 李鑫; 孟通; 韩冬
Original assignee: Beijing Shenjian Intelligent Technology Co Ltd
Current assignee: Xilinx Technology Beijing Ltd
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2019-12-31

Abstract

The invention discloses an intelligent voice processing method and device. The method comprises the following steps: performing framing processing on input voice to obtain sentence break parameters of a plurality of frames; inputting sentence break parameters of the plurality of frames as feature values into a trained Artificial Neural Network (ANN), the ANN comprising a Recurrent Neural Network (RNN); and sending the output characteristics of the RNN into a full-connection layer to judge whether each input frame is a sentence break point. Preferably, the RNN may be a long term memory (LSTM) neural network or a gated cyclic unit (GRU) neural network, and the ANN may also be a CNN followed by RNN architecture. The scheme can utilize a neural network or a neural network computing platform required in subsequent voice recognition processing to intelligently break sentences of input long sentence voice, thereby realizing an accurate sentence breaking scheme matched with intelligent voice recognition.

Description

Intelligent voice processing method and device

Technical Field

The invention relates to speech processing, in particular to intelligent processing of speech based on a neural network.

Background

Speech Recognition (Speech Recognition) is a technique that maps analog signals of a language sequentially onto a specific set of words. In recent years, the effect of an Artificial Neural Network (ANN) method in the field of speech recognition is far beyond all traditional methods, and becomes the mainstream of the whole industry. Among them, the deep neural network has a very wide application.

However, with the rapid development in recent years, the size of neural networks has been increasing, and the disclosed advanced neural networks can reach hundreds of layers and hundreds of millions of connections, and belong to computing and memory-intensive applications. Although the existing heterogeneous neural network processor based on the GPU, the FPGA or the ASIC can greatly reduce the computational power and power consumption requirements of the neural network, due to the limitation of hardware resources, the GPU-based or the FPGA-based or ASIC-based dedicated neural network inference accelerator has a limitation on the length of single input data, that is, the maximum length of a statement that can be processed each time. Therefore, a suitable sentence-breaking method is required to process and identify a long input sentence in a segmented manner.

Most of the traditional voice segmentation or endpoint detection methods are based on the short-time energy and the short-time zero-crossing rate of voice, and are combined with a threshold value and judgment logic to judge. However, the above speech segmentation scheme does not involve time correlation, and usually cannot make a judgment on sentence break points.

Therefore, a more accurate sentence break scheme suitable for use with neural network speech recognition systems is needed.

Disclosure of Invention

The invention provides an intelligent voice processing scheme, which can intelligently break sentences of input long sentence voice by utilizing a neural network or a neural network computing platform required in subsequent voice recognition processing, thereby realizing a sentence breaking scheme matched with intelligent voice recognition.

According to an aspect of the present invention, an intelligent speech processing method is provided, including: performing framing processing on input voice to obtain sentence break parameters of a plurality of frames; inputting sentence break parameters of the plurality of frames as feature values into a trained Artificial Neural Network (ANN), the ANN comprising a Recurrent Neural Network (RNN); and sending the output characteristics of the RNN into a full-connection layer to judge whether each input frame is a sentence break point. Preferably, the RNN may be a long term memory (LSTM) neural network or a gated cyclic unit (GRU) neural network. The parameters of the ANN and the fully-connected layer are trained based on a network model including a Softmax layer following the fully-connected layer.

Therefore, more accurate sentence break point judgment can be realized by utilizing the extraction capability of the global time sequence characteristics of the recurrent neural network.

Preferably, inputting the sentence break parameters for the plurality of frames into the trained ANN comprises: inputting sentence break parameters for the plurality of frames into a trained Convolutional Neural Network (CNN); and feeding the output features of the CNN into the trained RNN. By introducing the CNN, the local feature extraction capability of the CNN can be utilized, and the capability is combined with the extraction capability of the global time sequence feature of the RNN, so that more accurate sentence break point judgment can be realized.

The sentence break parameters of the plurality of frames obtained may include at least one of: a normalized short-time energy for each frame of the plurality of frames; a short-time zero-crossing rate for each of the plurality of frames; a normalized short-time Fourier transform result for each of the plurality of frames; and any two or three of the above three items in combination or weighted combination. Thereby enabling application-based flexible parameter selection.

Feeding the output characteristics of the RNN into the full-concatenation layer to determine whether each frame of the input is a sentence break point may include: sending the output characteristics of the RNN into a full-connection layer to judge whether each input frame is a sentence break point for initial selection; and screening the initial selection broken sentence dots based on a preset rule to select the final selection broken sentence dots. Thereby further optimizing the choice of period breaks.

Screening the initial broken period based on a predetermined rule to select a final broken period may include: and clustering the initially selected sentence break points to select the clustering center of each silent segment as the screened final selected sentence break point. Screening the initial broken period based on a predetermined rule to select a final broken period may include: selecting a final break period based on at least one of the following predetermined limits: the distance between adjacent sentence break points is less than the maximum length of the sentence capable of being processed; and the number of the final break periods is minimum.

According to another aspect of the present invention, an intelligent speech processing device is provided, comprising: the parameter extraction device is used for performing framing processing on the input voice to acquire sentence break parameters of a plurality of frames; neural network computing means for inputting sentence break parameters of the plurality of frames as feature values into a trained Artificial Neural Network (ANN), the ANN including a Recurrent Neural Network (RNN); and the sentence break point judgment device is used for sending the output characteristics of the RNN into the full-connection layer so as to judge whether each input frame is a sentence break point.

Preferably, the RNN may be a long-term memory (LSTM) neural network or a gated cyclic unit (GRU) neural network. The parameters of the ANN and the fully-connected layer may be trained based on a network model that includes a Softmax layer following the fully-connected layer.

Preferably, the neural network computing device may input sentence break parameters of the plurality of frames into a trained Convolutional Neural Network (CNN) and the trained RNN to obtain output features for input into the fully-connected layer.

Preferably, the sentence break parameters of the plurality of frames acquired by the parameter extraction means may include at least one of: a normalized short-time energy for each frame of the plurality of frames; a short-time zero-crossing rate for each of the plurality of frames; a normalized short-time Fourier transform result for each of the plurality of frames; and any two or three of the above three items in combination or weighted combination.

Preferably, the sentence break point judgment means may include: the initial sentence break point judgment device is used for sending the output characteristics of the RNN into a full connection layer so as to judge whether each input frame is an initial sentence break point; and a final selection broken sentence dot screening device for screening the initial selection broken sentence dots based on a predetermined rule to select the final selection broken sentence dots.

The final break sentence screening apparatus may be further configured to: and clustering the initially selected sentence break points to select the clustering center of each silent segment as the screened final selected sentence break point. The final break sentence screening apparatus may be further configured to: selecting a final break period based on at least one of the following predetermined limits: the distance between adjacent sentence break points is less than the maximum length of the sentence capable of being processed; and the number of the final break periods is minimum.

According to yet another aspect of the present invention, a computing platform is proposed, comprising a high parallelism computation module for neural network computational reasoning, wherein the computing platform is adapted to implement a speech processing method comprising any of the methods as described above. Preferably, the computing platform is implemented by an ASIC, FPGA or GPU.

According to yet another aspect of the invention, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the intelligent speech processing method as described above.

According to an aspect of the present invention, a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform an intelligent speech processing method as described above.

The intelligent voice processing scheme of the invention can utilize a neural network or a neural network computing platform required in the subsequent voice recognition processing to intelligently break sentences of the input long sentence voice, thereby realizing a sentence breaking scheme matched with the intelligent voice recognition. In the method, by introducing the RNN, more accurate sentence break point judgment can be realized by utilizing the extraction capability of the global time sequence characteristics of the recurrent neural network. Further, by introducing the CNN, the local feature extraction capability of the CNN can be utilized, and the capability is combined with the extraction capability of the global time sequence feature of the RNN, so that more accurate sentence break point judgment can be realized. Therefore, the problem of sentence break in intelligent voice recognition can be solved, and the influence of limited sentence length on recognition accuracy is considered.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 shows a typical structure of DNN.

FIG. 2 shows a schematic diagram of an LSTM neural network model.

Fig. 3 shows a schematic diagram of the windowing operation.

Fig. 4 shows an example of the relationship between the original waveform, the short-term energy and the short-term zero-crossing rate.

FIG. 5 shows an example of a time-frequency distribution graph after a short-time Fourier transform is performed on speech.

Fig. 6A and 6B show schematic diagrams of the conventional single-threshold and double-threshold methods, respectively.

FIG. 7 illustrates a smart speech processing method according to one embodiment of the present invention.

Fig. 8 shows an example of windowing speech signal frames.

FIG. 9 shows a schematic diagram illustrating neural network processing for determining breakpoints.

FIG. 10 shows a schematic diagram of an intelligent speech processing apparatus according to one embodiment of the invention.

FIG. 11 is a schematic diagram of a computing device that can be used to implement the intelligent speech processing method described above according to one embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The application provides an intelligent sentence-breaking scheme for neural network speech recognition. For a better understanding of the principles of the present invention, an artificial neural network, in particular a recurrent neural network, is first described along with a conventional speech segmentation scheme.

Basic concept of artificial neural networks

An Artificial Neural Network (ANN) is a mathematical computation model that mimics the behavioral characteristics of animal neural networks and performs distributed parallel information processing. In a neural network there are a large number of nodes, also called "neurons", connected to each other. Each neuron processes the weighted input values from other neighboring neurons by some specific output function (also called "activation function"). The strength of information transfer between neurons is defined by the so-called "weight". The algorithm will continuously learn itself to adjust these weights.

Early neural networks had only two layers, the input and output layers. The practicality is greatly limited due to the inability to handle complex logic. Deep Neural Networks (DNNs) greatly improve the ability of neural networks to handle complex logic by adding a hidden intermediate layer between the input and output layers. Fig. 1 shows a DNN model schematic. It should be understood that the DNN in practical applications may have a much more complex, large structure than that shown in fig. 1, but its basic structure is still as shown in fig. 1.

Speech recognition is the sequential mapping of analog signals of a language onto a specific set of words. In recent years, the effect of the artificial neural network method in the speech recognition field is far beyond all traditional methods, and the artificial neural network method is becoming the mainstream of the whole industry. Among them, the deep neural network has a very wide application.

The Recurrent Neural Network (RNN) is a common deep neural network model, and unlike the traditional forward neural network, the recurrent neural network introduces directed circulation, which can deal with the problem of context between those inputs. In speech recognition, the signal is strongly correlated before and after, for example, a word in a sentence is recognized, and the word sequence relation before the word is very close. Therefore, the recurrent neural network has a very wide application in the field of speech recognition.

To address the problem of long-term information memory, Hochreiter & Schmidhuber proposed a long-term memory (LSTM) model in 1997. The LSTM neural network is a kind of RNN, changing a simple repetitive neural network module among general RNNs into a complex connection interaction relationship. FIG. 2 shows a schematic diagram of an LSTM neural network model. The LSTM neural network also achieves very good application effect in speech recognition, but has higher computational complexity compared with the common RNN.

LSTM enables the Recurrent neural network to not only memorize past information but also selectively forget some unimportant information to model long-term context and other relationships through a gating mechanism, and LSTM also has a variant called Gated Recurrent Unit (GRU) that can resolve the gradient vanishing problem that occurs in the standard RNN while retaining long-term sequence information. The GRU is simpler in structure (three gates down to two gates) and less computationally complex than the LSTM.

RNN, LSTM and GRU are often used to deal with timing-related issues. In addition to the unidirectional structure discussed above, the network described above corresponds to a bidirectional structure, such as bidirectional LSTM, that is, a sequence that is passed forward through the LSTM layer is concatenated or summed with the result that the sequence is passed backward through the LSTM layer as a result of the sequence being passed through the bidirectional LSTM layer. Bidirectional RNNs and GRUs are also based on similar principles.

With the rapid development of recent years and the increasing scale of neural networks, the disclosed advanced neural networks can reach hundreds of layers and hundreds of millions of connections, and belong to the application of intensive calculation and memory access. When the traditional CPU executes neural network reasoning, the traditional CPU is limited by parallel granularity and has lower computational efficiency; in a high-concurrency speech recognition scene, the GPU often needs to wait for multi-channel data to perform batch calculation, thereby resulting in higher processing delay. Compared with a mainstream CPU and a GPU, the special neural network reasoning accelerator realized based on a high-performance hardware carrier such as a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC) has the advantages of high parallelism, high energy efficiency ratio, low time delay, programmability and the like, and has a good market prospect.

However, due to the limitation of hardware resources, the dedicated neural network inference accelerator based on FPGA or ASIC may have a limitation on the length of input data, and correspondingly, the maximum length of statements that can be processed each time. Therefore, a suitable sentence-breaking method is required to process and identify a long input sentence in a segmented manner.

Traditional speech signal processing and segmentation method

Speech signals are typically analyzed in both the time and frequency domains. Most of the frequency of human voice is distributed in a frequency range of 300-3400 Hz in a centralized way, and people usually sample continuous voice signals in a selected frequency spectrum range to enable the continuous voice signals to become discrete time domain voice signals which can be processed by a computer. Due to the vocal characteristics of the human body, the speech signal is an astable, time-varying signal, and in signal processing, it is generally assumed that the speech signal is stationary for a short time. The short-time characteristic means that the oral cavity of a person needs to perform slow muscle movement when various sounds are made, the time of the muscle movement is a short time which is dozens of milliseconds, and a voice signal can be considered to be stable and time-invariant in the short time. For short-time processing, the voice signal needs to be "framed", and the frame length is generally 20-50 ms. Generally, a fourier transform is used for frequency domain analysis of a speech signal, and before a frame of signal is taken out for fourier transform, a "windowing" operation is performed, that is, a multiplication with a window function is performed, so that the amplitude of a frame of signal gradually changes to 0 at both ends (gradual change is beneficial to fourier transform, and the resolution of a frequency spectrum can be improved). Fig. 3 shows a schematic diagram of the windowing operation. As shown in fig. 3, the original signal becomes a windowed signal on the right side after superimposing the hamming window.

The cost of windowing is that the portions at the ends of a frame signal are attenuated and not as important as the central portion. As a compensation, frame truncation is not performed back-to-back, but adjacent frames are overlapped with each other. The time difference between the start positions of two adjacent frames is called frame shift, and usually the frame shift is taken to be half the frame length, or fixed to be, for example, 10 ms.

In addition, regardless of the language used by the speaker of the speech and the pitch, the speaker can be classified into unvoiced speech and voiced speech from an acoustic point of view. Compared with silence, unvoiced sound and voiced sound have remarkable characteristics in the aspects of short-time energy and short-time zero crossing rate. For the time domain signal s (t), its short-time energy e (t), short-time zero-crossing rate z (t) are defined as follows:

wherein

The energy is low, the zero-crossing rate is high, and the waveform characteristics are similar to random noise; the energy of the voiced sound section is high, the zero crossing rate is low, and the waveform has the periodic characteristic; the silence, the energy and the zero crossing rate are very low. Fig. 4 shows an example of the relationship between the original waveform, the short-term energy and the short-term zero-crossing rate. The figure shows the normalized values of original speech waveform (black), short-time energy (gray line) and short-time zero-crossing rate (dot-dash line) of English word "direction", it can be seen that the trend of the short-time energy curve is basically consistent with that of the original waveform, the short-time zero-crossing rate (zcr) has an obvious peak value in the unvoiced stage, and the short-time energy and the short-time zero-crossing rate of the unvoiced segment are both small, thus providing a judgment basis for searching the unvoiced segment.

In addition, the short-time analysis of a common speech signal further includes a short-time fourier transform (stft) or the like for analyzing short-time spectral information of the signal in each frame. FIG. 5 shows an example of a time-frequency distribution graph after a short-time Fourier transform of speech (e.g., the word direction).

Most of the traditional voice segmentation or endpoint detection methods are based on the short-time energy and the short-time zero-crossing rate of voice and are combined with a threshold value and judgment logic. Fig. 6A and 6B show schematic diagrams of the conventional single-threshold and double-threshold methods, respectively.

The single threshold method is to screen sentence-breaking parameters (including but not limited to normalized short-term energy, short-term zero-crossing rate, etc.) in combination with a specified threshold value, and select a point corresponding to a parameter value lower than the threshold value as a break point (segmentation point). As shown in fig. 6A, the curve represents the sentence break parameter obtained from the original waveform, the dotted line represents the threshold, and it can be seen that the point a crosses the threshold (mathematically, the parameter values of the adjacent points on the left and right sides of the point a are respectively different from the threshold, and the signs of the parameter values are different), so that a breakpoint can be obtained. And counting the relation between all sentence break parameters and the threshold value to obtain all breakpoints.

The single threshold method has its limitations, especially in the case shown in fig. 6B (taking the example of sentence-break parameter crossing the threshold upwards), so the double threshold method is derived.

As shown in fig. 6B, point a is a point desired to be selected, and point B, although also crossing the threshold, is not a point desired to be selected, which may be due to noise. At this time, by adopting a double-threshold method, a second threshold is set, and only under the condition that the sentence break parameter exceeds the second threshold value, the corresponding point is selected, such as point a.

Point a in fig. 6A and 6B is a break point corresponding to the start of speech. Similarly, the break point corresponding to the end of speech (i.e. sentence-break parameter crossing the corresponding threshold value downwards, as shown in fig. 6B as point C) can be selected by using the above-mentioned single-threshold/double-threshold method.

As can be seen from the above, conventional end-point detection methods for sentence breaks typically perform sentence breaks based on the waveform neighborhood of speech parameters. Although the above method can detect the unvoiced segment of the speech more accurately, it is unable to accurately find out the speech segment with relatively independent semantics, such as an independent phrase, by fully considering the time correlation of the speech information.

Intelligent sentence-cutting scheme of the invention

As described above, deep neural networks, particularly Recurrent Neural Networks (RNNs) such as LSTM, have been widely used in the field of speech recognition due to their superior performance. However, due to the limitation of hardware resources, the maximum length of a sentence which can be processed by a neural network processor deployed in a small system at a time is limited, so that a proper sentence interruption method is required to process a long input sentence, and a reasonable sentence interruption point is obtained to perform segmentation recognition.

Because the conventional speech segmentation method based on speech parameters alone cannot consider correlation over a large time span (e.g., between two waveform peaks), accurate phrase segmentation cannot be achieved. In view of this, the present invention provides an intelligent speech processing scheme based on an artificial neural network, which can intelligently divide a long sentence into a plurality of independent short sentences, thereby providing a premise for subsequent intelligent speech processing.

In step S710, the input speech is subjected to framing processing to acquire sentence break parameters for a plurality of frames. Here, the input original sentence speech is subjected to conventional framing and windowing. Fig. 8 shows an example of windowing speech signal frames. As shown in fig. 8, the speech signal may be processed with a hamming window with a frame length of 25 ms and a frame shift of 10 ms.

For each frame of data obtained by framing, sentence break parameters can be obtained. The sentence break parameter may be any parameter known in the art for characterizing speech frames. In one embodiment, the sentence break parameters of the plurality of frames obtained may include at least one of: a normalized short-time energy for each frame of the plurality of frames; a short-time zero-crossing rate for each of the plurality of frames; a normalized short-time Fourier transform result for each of the plurality of frames; and any two or three of the above three items in combination or weighted combination.

Subsequently, in step S720, the sentence break parameters of the acquired plurality of frames are input as feature values into a trained Artificial Neural Network (ANN), the ANN including a Recurrent Neural Network (RNN).

In this case, the sentence break parameters of the frames obtained in the previous step may be processed accordingly and then fed into the trained ANN in a certain arrangement (e.g., in a form readable by the ANN). For example, the normalized short-time energy, the short-time zero-crossing rate, and the result of the short-time fourier transform, or any combination thereof, for a plurality of frames may be fed into the ANN as one-dimensional or multidimensional vectors in frame order. The ANN may then process the feature vectors to extract the desired features.

Because the front and back relevance of the voice signal is strong, the correlation characteristic between adjacent voiced segments can be searched by utilizing the time sequence correlation attribute of the RNN, so that sentence break points for segmenting relatively independent short sentences can be found more accurately. The RNN utilized herein may be a conventional RNN, or may be a long-term memory (LSTM) neural network or a gated cyclic unit (GRU) neural network. The LSTM enables the recurrent neural network to not only memorize past information through a gating mechanism, but also selectively forget unimportant information to model longer contextual information, thereby enabling more accurate breakpoint determinations to be determined, but with correspondingly higher computational complexity.

In one embodiment, the artificial neural network may further include a Convolutional Neural Network (CNN). Thus, step S720 may include inputting sentence break parameters for the plurality of frames into a trained Convolutional Neural Network (CNN); and feeding the output features of the CNN into the trained RNN. Therefore, more accurate extraction features can be obtained by utilizing the local information extraction capability of the CNN, and then the features are sent to the RNN with stronger global timing correlation extraction capability, so that more accurate breakpoint selection can be realized.

Subsequently, in step S730, the output characteristics of RNN are sent to the full-link layer to determine whether each frame inputted is a punctuation. The fully-connected layer may be considered to be part of the artificial neural network, e.g., an output layer.

FIG. 9 shows a schematic diagram illustrating neural network processing for determining breakpoints. As shown in fig. 9, sentence break parameters of a plurality of frames obtained through the conventional processing are fed into the artificial neural network model as characteristic values after being reasonably rearranged. The above features are firstly extracted locally by CNN including a plurality of hidden layers, then extracted time sequence correlation by RNN (for example, preferably LSTM or GRU) including a plurality of hidden layers, and finally sent to the full connection layer to obtain the judgment whether each sent frame is the frame where the sentence break point is located. As shown in the figure, the LSTM or GRU may be a unidirectional network or a bidirectional network with a more complex structure, so as to implement the period break determination with higher precision.

Fig. 9 further shows the Softmax layer that needs to be used in network training. In other words, the parameters of the trained ANN and the fully-connected layer used in the intelligent speech processing scheme of the present invention are trained based on a network model that includes a Softmax layer following the fully-connected layer. The Softmax layer will assign fractional probabilities to each class within the network, and the sum of these fractional probabilities must be 1.0. Therefore, the output result is accurate, and the reverse gradient training and fine adjustment can be conveniently carried out according to the output result. However, the Softmax layer is not necessary for neural network reasoning calculation, so that only qualitative judgment is needed to be performed on whether each frame is a frame where a sentence break point is located, and quantitative adjustment is not involved.

An example of feature value generation, extraction, and sentence break point determination will be described below with reference to fig. 9. As described above, the sentence-breaking may be based on various parameters, such as normalized short-time energy, normalized short-time zero-crossing rate, and concatenation of normalized short-time fourier transform results, or may be combined by setting different weights in consideration of the normalized short-time zero-crossing rate and the normalized short-time fourier transform results. For example, for speech with a duration of 1s, assuming that the frame length is 20ms and the frame shift is 10ms, 100 frame framing results can be obtained, and for 100 frames of original data, normalized short-time energy with a dimension of (100,1), a normalized short-time zero-crossing rate with a dimension of (100,1) and a normalized short-time fourier transform result with a dimension of (100, n) are calculated (n depends on parameter settings for short-time fourier transform).

The sentence-break parameter can be selected to sequentially splice the normalized short-time energy, the normalized short-time zero-crossing rate and the normalized short-time Fourier transform result to obtain a matrix with the dimension of (100, n + 2); or combining the normalized short-time energy and the normalized short-time zero-crossing rate according to different weighting coefficients, and splicing with the normalized short-time Fourier transform result to obtain a matrix with the dimension of (100, n + 1).

After sentence break parameters are extracted for input voice, the dimension when the input voice is input into a neural network is (100, i), wherein 100 is the frame number of the input voice, namely the time dimension of input data, i is a characteristic dimension, and the time dimension of an output result of a CNN layer is still 100 when the CNN layer is designed. After the CNN layer, the characteristic dimension is assumed to be j (depending on the design of the convolution kernel, if there are multiple convolution kernels, the multiple convolution kernels should be unfolded and spliced), and after the CNN layer, the time dimension of the output result still remains 100, the characteristic dimension is assumed to be k (depending on the design size of the RNN layer), and the design dimension of the full connection layer is (k,2), so that the output result of the full connection layer is (100,2), and after the softmax layer, the result can be regarded as the probability of the binary result of 100 frames of data, and the corresponding labeled data represents a segmentation point (breakpoint) if the result is 1, and represents a non-segmentation point if the result is 0. Therefore, a required deep neural network can be trained for solving the breakpoint of the input voice. When performing neural network inference calculation, it is also possible to input a sentence break parameter of the above (100, i), for example, and directly obtain a determination whether each of the above 100 frames is a frame where a sentence break point is located through a network that does not include a softmax layer.

In practical application, screening can be performed according to the output probability value to judge whether the corresponding point of each frame of the original voice is a breakpoint, and the number of the breakpoints output in this step may be large to ensure that the distance between adjacent breakpoints is smaller than the maximum length that can be processed by voice recognition (so-called limited length voice recognition), so that the breakpoint is called as an initial selection breakpoint. Thus, step S730 may include sending the output characteristics of the RNN to a full-connectivity layer to determine whether each input frame is a first-choice sentence break; and screening the initial selection broken sentence dots based on a preset rule to select the final selection broken sentence dots.

In one embodiment, screening the initial break period based on a predetermined rule to select a final break period may include: and clustering the initially selected sentence break points to select the clustering center of each silent segment as the screened final selected sentence break point. In another embodiment, screening the initial break period based on a predetermined rule to select a final break period comprises: selecting a final break period based on at least one of the following predetermined limits: the distance between adjacent sentence break points is less than the maximum length of the sentence capable of being processed; and the number of the final break periods is minimum.

Since the physical meaning of the initial break point is the point corresponding to the silence period, the duration of the silence period is usually longer than the duration of one frame. Therefore, the initial selected breakpoints can be clustered, for example, a k-means method is adopted, and the clustering center of each unvoiced segment is taken as the screened breakpoints.

And (3) if the distance between adjacent screening breakpoints exceeds the length of the longest processable sentence, returning to 3), modifying the number of clustering centers, re-clustering and screening to ensure that each segment is less than the maximum processable sentence length. Selecting a final selection breakpoint from the screening breakpoints if the distances between adjacent screening breakpoints are all less than the length of the longest processable statement, so that a) the distance between the final selection breakpoints is also less than the length of the longest processable statement; b) the number of the final selection breakpoints is as small as possible, so that each clause segment after sentence breaking is as long as possible, and the influence of the segmented speech recognition on the precision is reduced.

FIG. 10 shows a schematic diagram of an intelligent speech processing apparatus according to one embodiment of the invention. As shown in the figure, the intelligent speech processing apparatus 1000 includes a parameter extraction apparatus 1010, a neural network calculation apparatus 1020, and a sentence break point judgment apparatus 1030. The parameter extracting means 1010 may be configured to perform framing processing on the input speech to obtain sentence break parameters of a plurality of frames. The neural network computing device 1020 may be configured to input the sentence break parameters for the plurality of frames as feature values into a trained Artificial Neural Network (ANN), the ANN comprising a Recurrent Neural Network (RNN). The sentence break point determining device 1030 is configured to send the output characteristic of the RNN to the full link layer to determine whether each input frame is a sentence break point.

In one embodiment, the sentence break parameters of the plurality of frames acquired by the parameter extraction device 1010 may include at least one of the following: a normalized short-time energy for each frame of the plurality of frames; a short-time zero-crossing rate for each of the plurality of frames; a normalized short-time Fourier transform result for each of the plurality of frames; and any two or three of the above three items in combination or weighted combination.

In one embodiment, the neural network computing device 1020 may input the sentence break parameters for the plurality of frames into a trained Convolutional Neural Network (CNN) and the trained RNN to obtain output features for input into the fully-connected layer. The RNN may preferably be a long-term memory (LSTM) neural network or a gated cyclic unit (GRU) neural network.

In one embodiment, the sentence break point judging means 1030 may include a first sentence break point judging means for sending the output characteristics of the RNN to a full link layer to judge whether each input frame is a first sentence break point; and a final selection broken sentence dot screening device for screening the initial selection broken sentence dots based on a predetermined rule to select the final selection broken sentence dots. The final selection broken sentence point screening device can also be used for clustering the initial selection broken sentence points to select the clustering center of each silent section as the screened final selection broken sentence point. Preferably, the final break period screening device may be further configured to: selecting a final break period based on at least one of the following predetermined limits: the distance between adjacent sentence break points is less than the maximum length of the sentence capable of being processed; and the number of the final break periods is minimum.

Likewise, the parameters of the ANN and the fully-connected layer are trained based on a network model that includes a Softmax layer following the fully-connected layer.

It should be noted that the intelligent speech processing scheme of the present invention may be incorporated into a speech processing scheme that uses an artificial neural network, and in particular, an RNN, to intelligently recognize speech. Therefore, RNNs (for example, different training parameters) to be used for speech recognition later can be multiplexed or a calculation model for realizing high-parallelism calculation of the neural network is shared, so that intelligent sentence breaking for speech recognition is realized without increasing extra overhead, and the problem of limited sentence length and the influence on precision are considered.

Thus, the invention also relates to a computing platform comprising a high-parallelism computing module for neural network computational reasoning, wherein the computing platform is used for implementing the speech recognition method incorporating the speech breakpoint scheme. Preferably, the computing platform is implemented by an ASIC, FPGA or GPU. For example, the computing platform may be a SoC that includes a dedicated neural network processor, or a chip implemented by an ASIC or FPGA.

Referring to fig. 11, computing device 1100 includes memory 1110 and processor 1120.

The processor 1120 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1120 may comprise a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 1120 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 1110 may include various types of storage units such as system memory, Read Only Memory (ROM), and permanent storage. The ROM may store, among other things, static data or instructions for the processor 1120 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 1110 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, as well. In some embodiments, memory 1110 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

Memory 1110 has stored thereon processable code that, when processed by processor 1120, causes processor 1120 to perform the intelligent speech processing methods described above.

In actual use, the computing device 1100 described above may be a general purpose computing device that includes a mass storage 1110 and a CPU 1120. The general purpose computing device can be combined with a fixed point computing platform dedicated to performing neural network computations and implemented at least in part by digital circuitry to achieve efficient neural network computations. In one embodiment, the neural network computing system of the present invention may be implemented in a system on a chip (SoC) that includes a general purpose processor, memory, and digital circuitry.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An intelligent speech processing method, comprising:

performing framing processing on input voice to obtain sentence break parameters of a plurality of frames;

inputting sentence break parameters of the plurality of frames as feature values into a trained Artificial Neural Network (ANN), the ANN comprising a Recurrent Neural Network (RNN);

and sending the output characteristics of the RNN into a full-connection layer to judge whether each input frame is a sentence break point.

2. The method of claim 1, wherein inputting sentence break parameters for the plurality of frames into the trained ANN comprises:

inputting sentence break parameters for the plurality of frames into a trained Convolutional Neural Network (CNN); and

feeding output characteristics of the CNN into the trained RNN.

3. The method of claim 1, wherein the obtained sentence break parameters for the plurality of frames comprise at least one of:

a normalized short-time energy for each frame of the plurality of frames;

a short-time zero-crossing rate for each of the plurality of frames;

a normalized short-time Fourier transform result for each of the plurality of frames; and

a combination or weighted combination of any two or three of the above three.

4. The method of claim 1, wherein feeding output characteristics of the RNN into a fully-connected layer to determine whether each frame of input is a punctuation comprises:

sending the output characteristics of the RNN into a full-connection layer to judge whether each input frame is a sentence break point for initial selection; and

and screening the initial selection broken sentence dots based on a preset rule to select the final selection broken sentence dots.

5. The method of claim 4, wherein screening the initial break points to select final break points based on a predetermined rule comprises:

and clustering the initially selected sentence break points to select the clustering center of each silent segment as the screened final selected sentence break point.

6. The method of claim 4, wherein screening the initial break points to select final break points based on a predetermined rule comprises:

selecting a final break period based on at least one of the following predetermined limits:

the distance between adjacent sentence break points is less than the maximum length of the sentence capable of being processed; and

the number of the final break periods is the minimum.

7. The method of claim 1, wherein the RNN is a long-time memory (LSTM) neural network or a gated cyclic unit (GRU) neural network.

8. The method of claim 1, wherein the ANN and the parameters of the fully-connected layer are trained based on a network model that includes a Softmax layer following the fully-connected layer.

9. An intelligent speech processing device comprising:

the parameter extraction device is used for performing framing processing on the input voice to acquire sentence break parameters of a plurality of frames;

neural network computing means for inputting sentence break parameters of the plurality of frames as feature values into a trained Artificial Neural Network (ANN), the ANN including a Recurrent Neural Network (RNN);

and the sentence break point judgment device is used for sending the output characteristics of the RNN into the full-connection layer so as to judge whether each input frame is a sentence break point.

10. The apparatus of claim 9, wherein the neural network computing device inputs sentence break parameters for the plurality of frames into a trained Convolutional Neural Network (CNN) and the trained RNN to obtain output features for input into the fully-connected layer.

11. The apparatus according to claim 9, wherein the sentence-break parameters of the plurality of frames acquired by the parameter extraction means include at least one of:

a normalized short-time energy for each frame of the plurality of frames;

a short-time zero-crossing rate for each of the plurality of frames;

a combination or weighted combination of any two or three of the above three.

12. The apparatus as claimed in claim 9, wherein said period break judging means includes:

the initial sentence break point judgment device is used for sending the output characteristics of the RNN into a full connection layer so as to judge whether each input frame is an initial sentence break point; and

and the final selection broken sentence dot screening device is used for screening the initial selection broken sentence dots based on a preset rule to select the final selection broken sentence dots.

13. The apparatus of claim 12, wherein said final break period screening means is further for:

14. The apparatus of claim 12, wherein said final break period screening means is further for:

the number of the final break periods is the minimum.

15. The apparatus of claim 9, wherein the RNN is a long-time memory (LSTM) neural network or a gated cyclic unit (GRU) neural network.

16. The apparatus of claim 9, wherein the ANN and parameters of a fully-connected layer are trained based on a network model that includes a Softmax layer that follows the fully-connected layer.

17. A computing platform comprising a high parallelism computation module for neural network computational inference, wherein the computing platform is for implementing an intelligent speech processing method comprising any of claims 1-8.

18. The computing platform of claim 17, wherein the computing platform is implemented by an ASIC, FPGA, or GPU.

19. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-8.

20. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-8.