Disclosure of Invention
The inventor initially considers the speex codec algorithm in the ogg format in the process of implementing the invention, because the pure voice is used for awakening and voice recognition after being coded and decoded, the overall performance is reduced little and is basically usable. However, after the original speech signal is coded and decoded and then subjected to front-end signal processing, the signal distortion is serious, and the performance of performing speech awakening and speech recognition is greatly reduced.
The ADPCM algorithm is used at first, and the ADPCM algorithm has advantages and disadvantages compared with the DPCM, and ADPCM coding is an adaptive differential coding technique, and has an advantage over DPCM in that it has an adaptive capability. It can adaptively change the previously described step values, using small step values for small differences and large step values for large differences. The inventor finds that the method has better performance in most cases in the process of implementing the invention, but the situation that some signal mutation occurs can cause a great deal of nonlinear change of the signal, for example, the situation that a frame is lost in signal transmission.
The audio is compressed through a low-loss compression coding mode DPCM, the DPCM compression can ensure extremely low audio loss and cause extremely small frequency spectrum distortion, and the influence on the awakening and recognition of the voice can be ignored through verification of a large number of test sets. Moreover, the calculation complexity of the algorithm is extremely low, and in the prior art, for example, the CPU occupancy rate of the cortex-m4 series does not exceed 10%. In addition, the DPCM compression ratio reaches 2-4 times, and basically reaches most transmission requirements.
The principle of DPCM is to use the correlation between signal sampling points, i.e. the difference between each sampling point and the adjacent sampling point is small, so that the compression can be performed by using this characteristic. Generally speaking, the value of the first sampling point is stored, and then the difference value between each sampling point and the previous sampling point is stored as the compressed data. Thus, the first sample point is added with the difference to obtain the second sample point, and then the difference … is added continuously to obtain the values of all sample points, thereby completing the speech restoration. Moreover, because the difference between each sampling point is small, the difference is not large, so that the compressed data can be stored by using a small number of bits, and the compression is realized.
An embodiment of the present invention provides a speech encoding and decoding method and apparatus, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a speech coding method, including:
predetermining a fixed step value for encoding a speech signal;
calculating a difference between the speech signal at the current time and the predicted speech signal at the previous time;
and coding the difference value based on the fixed stepping value to obtain a coding result.
In a second aspect, an embodiment of the present invention provides a speech decoding method, including:
receiving a coding result obtained by coding the voice signal at the current moment by adopting the voice coding method of the embodiment of the invention;
decoding the encoding result based on the fixed stepping value to obtain a decoding difference value;
and adding the decoding difference value and the predicted voice signal at the previous moment to obtain a decoding result.
In a third aspect, the present invention provides a speech signal encoding and decoding method, including:
the encoding step performed at the first terminal:
predetermining a fixed step value for encoding a speech signal;
calculating a difference between the speech signal at the current time and the predicted speech signal at the previous time;
coding the difference value based on the fixed stepping value to obtain a coding result;
decoding step performed at the second terminal:
receiving the encoding result;
decoding the encoding result based on the fixed stepping value to obtain a decoding difference value;
and adding the decoding difference value and the predicted voice signal at the previous moment to obtain a decoding result.
In a fourth aspect, the present invention provides a speech encoding apparatus, comprising a fixed step value determining module, configured to determine in advance a fixed step value for encoding a speech signal; the difference value calculating module is used for calculating the difference value between the voice signal at the current moment and the predicted voice signal at the previous moment; and the coding logic module is used for coding the difference value based on the fixed stepping value to obtain a coding result.
In a fifth aspect, the present invention provides a speech decoding apparatus, including: the signal receiving module is used for receiving a coding result obtained by coding the voice signal at the current moment by adopting the voice coding method of the embodiment of the invention; a decoding logic module, configured to decode the encoding result based on the fixed step value to obtain a decoding difference value; and the adder module is used for adding the decoding difference value and the predicted voice signal at the previous moment to obtain a decoding result.
In a sixth aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored, and the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described speech signal encoding methods or speech signal decoding methods of the present invention.
In a seventh aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any one of the speech signal encoding method or the speech signal decoding method of the present invention.
In an eighth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, which, when executed by a computer, cause the computer to execute any one of the speech signal encoding method or the speech signal decoding method described above.
The embodiment of the invention has the beneficial effects that: by adopting the fixed stepping value, the nonlinear change of the original voice signal can be reduced, so that the processing result is not influenced by the amplified nonlinear change when the front-end signal processing (such as echo cancellation and beam forming) is carried out subsequently, and the influence on the back-end voice recognition is avoided.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As shown in fig. 1, an embodiment of the present invention provides a speech signal encoding method, including:
s11, a fixed step value for encoding the speech signal is predetermined.
S12, calculating the difference between the speech signal at the current moment and the predicted speech signal at the previous moment; the voice signal at the current moment is PCM signal data based on the original voice signal.
And S13, coding the difference value based on the fixed stepping value to obtain a coding result.
In the embodiment of the invention, the fixed stepping value is adopted, so that the nonlinear change of the original voice signal can be reduced, and the processing result cannot be influenced by the amplified nonlinear change when the front-end signal processing (such as echo cancellation and beam forming) is carried out subsequently, thereby avoiding the influence on the back-end voice recognition.
In some embodiments, further comprising: quantizing the difference value by adopting a preset quantization factor to compress the difference value; at this time, the encoding of the difference value based on the fixed step value to obtain an encoding result is: and coding the result after the difference value quantization based on the fixed stepping value to obtain a change result.
Since the difference may be too large, a quantization factor is typically set for storage, e.g., if the quantization factor is 100, the difference 400 may be mapped to 4, so that the compressed data may be stored with fewer bits. However, the inventors further found that quantization errors easily occur in this case. If all differences are fixed within a certain range, e.g. 4, the difference range is-8 to 7, multiplied by the quantization factor. Therefore, if the difference is too large or too small, and out of range, it can be determined only by the boundary value. And in this case, quantization errors are caused.
If quantization error occurs in one data, the following data is restored on the wrong data in the restoration process, so that the error occurring before is accumulated all the time, and all the following data restoration is affected. Therefore, the inventor proposes a scheme that: and compressing the current data by using the last restored data, so that even if a quantization error is generated, only one sampling point is influenced, and the restoration of the subsequent sampling point is not influenced.
In response to the above-mentioned problem, in an embodiment of the present invention, the predicted speech signal at the previous time is obtained based on decoding the encoding result.
Fig. 2 is a flowchart of a speech signal decoding method according to an embodiment of the present invention, the method including:
s21, receiving a coding result obtained by coding the speech signal at the current time by using the speech coding method according to any of the foregoing embodiments of the present invention;
s22, decoding the coding result based on the fixed stepping value to obtain a decoding difference value;
and S23, adding the decoding difference value and the predicted voice signal of the previous moment to obtain a decoding result.
Fig. 3 is a flowchart of a speech signal encoding/decoding method according to an embodiment of the present invention, the method including:
s31, the encoding step performed at the first terminal:
s311, predetermining a fixed stepping value for coding the voice signal;
s312, calculating a difference value between the voice signal at the current moment and the predicted voice signal at the previous moment; the voice signal at the current moment is PCM signal data based on an original voice signal; the predicted speech signal at the previous time is obtained based on decoding the encoding result.
S313, coding the difference value based on the fixed stepping value to obtain a coding result;
s32, decoding step performed at the second terminal:
s321, receiving the coding result;
s322, decoding the coding result based on the fixed stepping value to obtain a decoding difference value;
and S323, adding the decoding difference value and the predicted voice signal at the previous moment to obtain a decoding result.
Fig. 4 is a schematic diagram of an embodiment of the speech signal codec algorithm according to the present invention. The left end of the transmission is device 1, the right end is device 2, and the device is defined in a broad sense, and may refer to two different devices or two CPUs.
X (n) is the original PCM signal, and X (n-1) is the predicted estimated value of the previous frame; d (n) is X (n) minus the difference of X (n-1); s (n) quantizing each step value for encoding; y (n) is output encoded data.
Step 1, firstly, DPCM coding operation is carried out on the original PCM signal to obtain coded data for transmission. Wherein, the detailed flow of the coding is as follows:
1) and subtracting the original signal from the prediction estimation value stored in the previous frame to obtain a signal difference value.
2) Step value calculation is used for quantizing the scale of coding, the algorithm comprises non-uniform quantization and uniform quantization, in order to reduce the non-linear change of the original speech signal, the uniform quantization is used, and the step value is a fixed value S (n).
The step value is adjusted by combining the characteristics of the voice signal and the corresponding compression ratio, the bit width of the voice signal is usually 16 bits, if the bit width is 2 times of the compression ratio, the voice signal needs to be compressed to 8 bits, and therefore the information quantity and the like need to be quantized to 8 bits step by step. The step length of each quantization is obtained by the above method.
3) And (3) carrying out quantization coding on the difference value in the step 1) through the quantization scale obtained in the step 2 to obtain a final coded value Y (n).
4) The above steps have completed a coding process, and the next frame needs to use the predicted value of the current frame and needs to be decoded by the decoding logic unit to obtain the predicted value
(with respect to the next frame X (n + 1)). The decoding operation is the inverse operation of step 3, and the quantization operation is also performed, and the quantization scale is the same as that of step 3.
And 2, the device 2 receives the coded data to perform decoding operation. Wherein the decoding operation is similar to the encoding operation:
1) and the coded data is sent to a step length calculator to calculate the step length, wherein the step length is constant by adopting uniform quantization as the coding.
2) And sending the step length and the coded data to a decoding logic unit together to obtain a difference value.
3) And adding the difference value and the prediction signal of the previous frame to obtain decoded original data. The whole process is the reverse process of coding.
And 3, sending the decoded data to front-end signal processing, such as echo cancellation, beam forming and the like, to obtain a voice signal after signal processing.
And 4, sending the voice signal obtained in the step 3 to a recognition engine or a wake-up engine.
The coding and decoding algorithm of the embodiment of the invention can firstly ensure the low loss of the signal and achieve the purpose of not influencing the signal processing. Meanwhile, the applicability of the method can be greatly expanded because the reliability of signal processing can be ensured. For example, data can be transmitted after being compressed on some CPUs without DMA, and the CPU occupancy rate is greatly reduced. In addition, when some low-end chips and high-end chips are communicated, data can be compressed and transmitted to the high-end chips for signal processing, and therefore the cost and low power consumption of the low-end chips are guaranteed.
It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
As shown in fig. 5, an embodiment of the present invention further provides a speech signal encoding apparatus 500, including:
a fixed step value determination module 510 for determining a fixed step value for encoding the speech signal in advance.
A difference calculation module 520, configured to calculate a difference between the speech signal at the current time and the predicted speech signal at the previous time; the voice signal at the current moment is PCM signal data based on an original voice signal; the predicted speech signal at the previous time is obtained based on decoding the encoding result.
An encoding logic module 530, configured to encode the difference value based on the fixed step value to obtain an encoding result.
As shown in fig. 6, an embodiment of the present invention further provides a speech signal decoding apparatus 600, including:
the signal receiving module 610 is configured to receive an encoding result obtained by encoding the speech signal at the current time by using the speech encoding method according to any of the foregoing embodiments of the present invention.
A decoding logic module 620, configured to decode the encoding result based on the fixed step value to obtain a decoding difference value.
An adder module 630, configured to add the decoded difference to the predicted speech signal at the previous time to obtain a decoded result.
In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-mentioned speech signal encoding methods or speech signal decoding methods of the present invention.
In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any one of the above speech signal encoding method or speech signal decoding method.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech signal encoding method or a speech signal decoding method.
In some embodiments, an embodiment of the present invention further provides a storage medium having a computer program stored thereon, wherein the program is characterized by a speech signal encoding method or a speech signal decoding method when the program is executed by a processor.
The speech signal encoding apparatus or the speech signal decoding apparatus according to the above-mentioned embodiment of the present invention may be configured to execute the speech signal encoding method or the speech signal decoding method according to the above-mentioned embodiment of the present invention, and accordingly achieve the technical effect achieved by the speech signal encoding method or the speech signal decoding method according to the above-mentioned embodiment of the present invention, which is not described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
Fig. 7 is a schematic diagram of a hardware structure of an electronic device for performing a speech signal encoding/decoding method according to another embodiment of the present application, and as shown in fig. 7, the electronic device includes:
one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7.
The apparatus for performing the voice signal encoding/decoding method may further include: an input device 730 and an output device 740.
The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, such as the bus connection in fig. 7.
The memory 720, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speech signal encoding/decoding method in the embodiments of the present application. The processor 710 executes various functional applications of the server and data processing, i.e., implements the voice signal encoding/decoding method of the above-described method embodiment, by executing nonvolatile software programs, instructions, and modules stored in the memory 720.
The memory 720 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the voice signal encoding/decoding method apparatus, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 720 may optionally include a memory remotely located from the processor 710, and these remote memories may be connected to the speech signal encoding/decoding method apparatus through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 730 may receive input numeric or character information and generate signals related to user settings and function control of the voice signal encoding/decoding method device. The output device 740 may include a display device such as a display screen.
The one or more modules are stored in the memory 720 and, when executed by the one or more processors 710, perform a speech signal encoding/decoding method in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart speakers, story machines, smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.