CN112071331A

CN112071331A - Voice file repairing method and device, computer equipment and storage medium

Info

Publication number: CN112071331A
Application number: CN202010990031.1A
Authority: CN
Inventors: 罗剑; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-11
Anticipated expiration: 2040-09-18
Also published as: WO2021169356A1; CN112071331B

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence, relates to a voice file restoration method and related equipment, can be applied to intelligent customer service or safe payment in intelligent hospitals, financial places and the like, and comprises the following steps: extracting a characteristic coefficient of the voice data according to the frame signal; locating a missing frame of the voice data based on a preset detection model and a characteristic coefficient, and determining a group position of the missing frame in the voice data as a first group of data; combining the first group of data, the second group of data and the third group of data to form a first repairing group, and acquiring hidden state parameters of a second repairing group; inputting the hidden state parameters, the first group of data, the second group of data and the third group of data into a preset first audio filling network, and calculating to obtain a repair frequency spectrum corresponding to the missing frame; and processing the repaired frequency spectrum based on a preset vocoder to obtain the repaired voice of the voice data. In addition, the present application also relates to a block chain technique, and the repair spectrum can be stored in the block chain. The application realizes the repair of damaged audio.

Description

Voice file repairing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for repairing a voice file, a computer device, and a storage medium.

Background

Currently, with the rapid development of artificial intelligence, intelligent speech processing is becoming more and more popular in daily life. For example, speech recognition widely used in intelligent customer service systems, speaker recognition used in secure payment systems, etc. all involve portions of intelligent speech processing. However, whether the voice recognition of the customer service system or the speaker recognition of the payment system is performed, when the network condition is poor or the device quality is low, the phenomenon of voice signal blocking or frame loss is caused. Thus, speech restoration is the key to solving such problems. And, besides, in the process of taking examination of history documents and film repair, voice repair is one of the necessary works therein.

Currently, less work is done on speech restoration, and more attention is paid to speech enhancement in traditional speech restoration, such as speech dereverberation, speech noise reduction, speech separation, and the like. The voice dereverberation is to eliminate sound ambiguity generated by reflection of a spatial environment on a sound signal, such as eliminating echo in an open environment; speech noise reduction is used to reduce various environmental noises, and speech separation is used to suppress sound signals from other speakers. Such speech enhancement processing is more useful for enhancement of the target sound signal, i.e. the sound signal that needs to be processed is already present but of a poor quality; when the signal is lost, if the network environment is poor, the voice signal is lost, and the voice enhancement cannot improve the quality of the audio file. The quality of audio file has directly decided the rate of accuracy of intelligent speech processing, and how on the basis of the audio file that the quality is relatively poor that has been had, the realization is to the restoration of audio file, then is the technical problem who awaits the solution.

Disclosure of Invention

An embodiment of the present application aims to provide a method and an apparatus for repairing a voice file, a computer device, and a storage medium, and aims to solve the technical problem of repairing a damaged voice file.

In order to solve the above technical problem, an embodiment of the present application provides a method for repairing a voice file, which adopts the following technical solutions:

a voice file repairing method comprises the following steps:

dividing voice data into a plurality of groups of frame signals, and extracting characteristic coefficients of the voice data according to the frame signals;

positioning a missing frame of the voice data based on a preset detection model and the characteristic coefficient, and determining a group position of the missing frame in the voice data as a first group of data;

acquiring the foreground and background data of the first group of data, respectively using the foreground and background data as a second group of data and a third group of data, combining the first group of data, the second group of data and the third group of data to form a first repair group, determining that the previous repair group of the first repair group is a second repair group, and acquiring hidden state parameters of the second repair group;

inputting the hidden state parameters, the first group of data, the second group of data and the third group of data into a preset first audio filling network, and calculating to obtain a repair frequency spectrum corresponding to the missing frame;

and processing the repaired frequency spectrum based on a preset vocoder to obtain the repaired voice of the voice data.

In order to solve the above technical problem, an embodiment of the present application further provides a device for restoring a voice file, which adopts the following technical solutions:

the dividing module is used for dividing the voice data into a plurality of groups of frame signals and extracting the characteristic coefficient of the voice data according to the frame signals;

the positioning module is used for positioning the missing frame of the voice data based on a preset detection model and the characteristic coefficient and determining the group position of the missing frame in the voice data as a first group of data;

an obtaining module, configured to obtain foreground and background data of the first group of data, respectively use the foreground and background data as a second group of data and a third group of data, combine the first group of data, the second group of data, and the third group of data to form a first repair group, determine that a previous repair group of the first repair group is a second repair group, and obtain a hidden state parameter of the second repair group;

the computing module is used for inputting the hidden state parameters, the first group of data, the second group of data and the third group of data into a preset first audio filling network and computing to obtain a repair frequency spectrum corresponding to the missing frame;

and the processing module is used for processing the repaired frequency spectrum based on a preset vocoder to obtain the repaired voice of the voice data.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the above voice file repairing method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the above voice file repairing method are implemented.

According to the voice file restoration method, the voice file restoration device, the computer equipment and the storage medium, the voice data are divided into a plurality of groups of frame signals, and the characteristic coefficients of the voice data are extracted according to the frame signals; based on a preset detection model and a characteristic coefficient, locating a missing frame of the voice data, so that a frame where a missing voice signal in the current voice data is located can be found; when the missing frame is positioned, determining the group position of the missing frame in the voice data as a first group of data, wherein the voice data is divided into a plurality of groups according to the number of the frames, and the group where the current missing frame is positioned is the first group of data; the current output result of the first group of data is often related to the adjacent group of data, so that the front and back group of data of the first group of data are obtained and respectively used as the second group of data and the third group of data; combining the first group of data, the second group of data and the third group of data into a first repair group, and determining a previous repair group of the first repair group as a second repair group; the output result of the current repair group is usually related to the state of the last repair group, so that the hidden state parameter of the second repair group is obtained; inputting the hidden state parameters, the first group of data, the second group of data and the third group of data into a preset first audio filling network, and calculating a repair frequency spectrum corresponding to the missing frame according to the first audio filling network; and processing the repaired frequency spectrum based on a preset vocoder to obtain the final repaired voice of the voice data. Therefore, the damaged voice data can be quickly repaired through the repair frequency spectrum, the repair speed is improved, the file repair quality is improved, in addition, the voice recognition accuracy rate of the voice recognition application is improved, the complete voice signal can be correctly recognized more easily, and the recognition efficiency of the voice recognition application is further improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for voice file repair;

FIG. 3 is a basic framework diagram for audio repair;

FIG. 4 is a block diagram of an audio filling network;

FIG. 5 is a schematic block diagram of one embodiment of a speech file restoration apparatus according to the present application;

FIG. 6 is a schematic block diagram of one embodiment of a computer device according to the present application.

Reference numerals: the device 500 for repairing the voice file comprises a dividing module 501, a positioning module 502, an obtaining module 503, a calculating module 504 and a processing module 505.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the voice file repairing method provided in the embodiment of the present application is generally executed by a server/terminal, and accordingly, the voice file repairing apparatus is generally disposed in the server/terminal device.

It should be understood that the number of terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continuing reference to FIG. 2, a flow diagram of one embodiment of a method for speech file repair in accordance with the present application is shown. The voice file repairing method comprises the following steps:

step S201, dividing voice data into a plurality of groups of frame signals, and extracting characteristic coefficients of the voice data according to the frame signals;

in this embodiment, when acquiring voice data, dividing the voice data into a plurality of groups of frame signals, where a frame signal is a voice signal in a frame unit; the dividing mode may be dividing the voice data according to a preset dividing duration, or dividing the voice data according to a preset dividing length. By dividing the voice data, a plurality of groups of frame signals corresponding to the voice data can be obtained through division. When the frame signal is obtained, extracting a feature coefficient of the speech data according to the frame signal, wherein the feature coefficient is a feature representation of the current speech data, and a mel-frequency cepstrum coefficient is taken as an example of the feature coefficient of the speech data. And calculating the Mel cepstrum coefficient of each frame signal of the voice data, and combining the Mel cepstrum coefficients of all the frame signals into a two-dimensional array, wherein the two-dimensional array is the characteristic coefficient of the current voice data.

Step S202, positioning the missing frame of the voice data based on a preset detection model and the characteristic coefficient, and determining the group position of the missing frame in the voice data as a first group of data;

in this embodiment, when the feature coefficient of the speech data is obtained, the missing frame of the current speech data is located according to the preset detection model and the feature coefficient, that is, the time point of the missing signal in the speech data is located, and the frame at the time point is the missing frame. The preset detection model is a preset sound detection model, and the obtained characteristic coefficients are calculated according to the preset detection model, so that the time point of the missing signal in the current voice data can be determined, namely the corresponding missing frame is found. Specifically, the characteristic coefficient is a two-dimensional array including a coefficient corresponding to each frame signal; sequentially inputting the coefficient of each frame signal into a preset detection model, so that an output result corresponding to the current voice data can be calculated, wherein the output result is a two-dimensional array, and the data respectively represent the absence and non-absence of the corresponding frame signal by 0 and 1; if the data in the output result is 0, the frame signal corresponding to 0 is a missing frame; if the data in the output result is 1, the frame signal corresponding to 1 is not a missing frame. When a missing frame is obtained, determining a group position in the voice data of the missing frame as a first group of data, wherein the group positions of different frame signals may be different, dividing the frame signals of the voice data into one group according to a preset number, for example, dividing five frame signals into one group of data; therefore, a plurality of groups of different data can be divided, and the group where the missing frame is located is the first group of data.

Step S203, obtaining the foreground and background data of the first group of data, respectively using the foreground and background data as a second group of data and a third group of data, combining the first group of data, the second group of data and the third group of data to form a first repair group, determining that the previous repair group of the first repair group is a second repair group, and obtaining the hidden state parameter of the second repair group;

in this embodiment, when the first group of data is obtained, the front and back groups of data of the first group of data are obtained, and the front group of data is determined as the second group of data, and the back group of data is determined as the third group of data. And combining the first group of data, the second group of data and the third group of data to form a first repair group, namely using the group where the missing frame is located and the front and rear groups of data as the first repair group. The previous repair group of the first repair group is a second repair group, and the second repair group includes group data where the previous missing frame of the missing frame is located, and group data before and after the previous missing frame. If the first repair group is (B)₄、B₅、B₆) Wherein B is₅The data of the first group where the missing frame in the first repair group is located is B₄Then the second repair group is (B)₃、B₄、B₅)。

And when the second repair group is obtained, obtaining a hidden state parameter of the second repair group, wherein the hidden state parameter is a Short-Term Memory parameter calculated according to a preset Long Short-Term Memory network (LSTM), the output of the repair group at the current moment is often related to the hidden state parameter of the repair group at the previous moment, and the hidden state parameter of the repair group at the current moment can be calculated through the hidden state parameter of the repair group at the previous moment. The data consistency among all the repair groups can be ensured through the hidden state parameter.

Step S204, inputting the hidden state parameters, the first group of data, the second group of data and the third group of data into a preset first audio filling network, and calculating to obtain a repair frequency spectrum corresponding to the missing frame;

in this embodiment, when obtaining the hidden state parameter of the second repair group, the hidden state parameter, the first group of data, the second group of data, and the third group of data are sent to the first audio filling network, and the repair spectrum of the current speech data is obtained by calculation according to the first audio filling network. The first audio filling network is a preset audio filling network and comprises a plurality of convolution layers and related residual error intensive networks. When the hidden state parameter, the first group of data, the second group of data and the third group of data are obtained, the obtained hidden state parameter, the first group of data, the second group of data and the third group of data are input into a first audio filling network, and then the repair frequency spectrum corresponding to the current repair frame can be obtained.

It is emphasized that, to further ensure the privacy and security of the repair spectrum, the repair spectrum may also be stored in a node of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

And step S205, processing the repaired frequency spectrum based on a preset vocoder to obtain the repaired voice of the voice data.

In this embodiment, when the repair spectrum is obtained, the repair spectrum is repaired by a preset vocoder, so that the speech signal corresponding to the repair spectrum is obtained, and the speech signal is the repair speech of the current speech data. Taking wavenet as a preset vocoder as an example, wavenet is a speech generation model based on deep learning, and the speech generation model can directly model original speech data. Therefore, when the repair spectrum is received, the repair spectrum is input into the speech generation model, and the repair speech corresponding to the repair spectrum can be obtained. The missing voice information in the current voice data can be obtained through the repair voice.

In the embodiment, the damaged voice data is quickly repaired, the repairing speed is increased, the file repairing quality is improved, the voice recognition accuracy of the voice recognition application is improved, the complete voice signal is more easily and correctly recognized, and the recognition efficiency of the voice recognition application is further improved.

In some embodiments of the present application, the locating the missing frame of the speech data based on the preset detection model and the feature coefficient includes:

acquiring a preset detection model, wherein the preset detection model comprises a detection neural network and a full connection layer, inputting the characteristic coefficient into the detection neural network, and calculating to obtain a detection value;

and inputting the detection value to the full-connection layer, calculating to obtain an output result, and positioning the missing frame of the voice data according to the output result.

In this embodiment, the preset detection model is a preset acoustic detection model, and includes a detection neural network and a full connection layer. When the characteristic coefficients are obtained, the coefficients corresponding to each frame signal are sequentially input into the detection neural network, and corresponding first detection parameters and second detection parameters are obtained through calculation of the detection neural network, wherein the calculation formula is as follows:

h_t,c_t＝LSTM(h_t-1,c_t-1,x_t)

wherein h is_tAnd c_tRespectively representing the detection parameters of the current moment, i.e. the first detection parameter and the second detection parameter, x_tCoefficient representing input frame signal, W weight parameter of full link layer, b bias parameter of full link layer, and H detection nerveThe detected value of the output of the network,

for output results of fully connected layers, e.g.

Wherein, 0 means that the corresponding frame signal is a missing frame, and 1 means that the corresponding frame signal is a non-missing frame.

When a first detection parameter and a second detection parameter are obtained, a detection value is obtained through calculation of a detection neural network according to the first detection parameter and the second detection parameter, then the detection value is input into a full connection layer, an output result is obtained through calculation, whether a corresponding frame signal is a missing frame or not can be judged according to the value of the output result, and if the value of the current output result is a preset value representing the missing frame, the frame signal corresponding to the output result is determined to be the missing frame; and if the numerical value of the current output result is a preset numerical value representing the non-missing frame, determining the frame signal non-missing frame corresponding to the output result, wherein the output result only comprises two conditions of the numerical value representing the missing frame and the numerical value representing the non-missing frame.

In this embodiment, the missing frame is quickly and accurately positioned according to the preset detection model, and the repair efficiency and accuracy for repairing the missing frame are further improved.

In some embodiments of the present application, the obtaining of the preset detection model includes:

acquiring an original file in a preset corpus, segmenting the original file into a plurality of frame data, randomly extracting a preset number of sub-frame data from all the frame data, replacing a signal in a preset time period in the sub-frame data by Gaussian white noise to obtain replaced sub-frame data, and combining the replaced sub-frame data and the frame data which is not replaced to form a training data set;

and training a basic detection model according to the training data set, wherein the basic detection model which is successfully trained is a preset detection model of the voice data.

In this embodiment, before obtaining the preset detection model, the basic detection model needs to be trained in advance, and the trained basic model is the preset detection model. Specifically, an original file in a predetermined corpus is obtained, which may be a common corpus. The voice file without any processing in the predetermined corpus is used as an original file, and the original file is segmented into a plurality of frame data, for example, 20s is used as one frame, and the original file is segmented into a plurality of frame data. When frame data is obtained, randomly extracting a preset number of sub-frame data from all the frame data, replacing a signal in a preset time period in the sub-frame data by Gaussian white noise to obtain replaced sub-frame data, and combining the replaced sub-frame data and the frame data which is not replaced into a training data set. For example, an original file is segmented to obtain 10 frame data, 5 sub-frame data are randomly extracted from the 10 frame data, 0.1s of voice signals are randomly extracted from the 5 sub-frame data and replaced by gaussian white noise, the loss of the voice signals is simulated, the replaced sub-frame data is obtained, and the replaced sub-frame data and the remaining 5 frame data which are not replaced are combined to form a training data set.

And when the training data set is obtained, putting the training data set into a basic detection model for training. After the basic detection model is trained according to the training data set, randomly selecting a plurality of verification frame data as verification data, wherein the verification frame data comprises the frame data of the missing signals and the frame data of the non-missing signals, and verifying the trained basic detection model according to the verification data. And if the verification accuracy of the trained basic detection model on the verification data reaches the preset accuracy, determining that the basic detection model is successfully trained, wherein the successfully trained basic detection model is the preset detection model.

In this embodiment, the training of the preset detection model is realized, so that the missing frame in the speech data can be accurately positioned through the preset detection model, and the accuracy of the missing frame positioning is further improved.

In some embodiments of the present application, the obtaining the hidden state parameter of the second repair group includes:

determining that the previous repairing group of the second repairing group is a third repairing group, and acquiring the cell state of the third repairing group;

and calculating hidden state parameters of the second repairing group according to the cell state and a preset long-short term memory network.

In this embodiment, when calculating the hidden state parameter of the second repair group, it is necessary to obtain a previous repair group of the second repair group, where the previous repair group of the second repair group is a third repair group. Obtaining the cell state of the third repair group as C_t-1Inputting the cell state into a long-short term memory network, calculating the cell state through the long-short term memory network to obtain the cell state C of the current second repair group_t(ii) a The cell state is the transmission path of the top layer information in the long and short term memory network, so that the information can be transmitted in the sequence, and the current cell state is related to the cell state at the previous moment. When the cell state of the second repair set is obtained, the hidden state parameter of the current second repair set can be obtained through the long-term and short-term memory network. Specifically, the hidden state parameter is represented by H_tExpressed that the corresponding calculation formula is H_t,C_t＝LSTM(H_t-1,C_t-1,I_t-1) Wherein, C_tIs the cell state of the current second repair group, C_t-1Is the cellular state of the third repair group, I_t-1The memory value of the third repair group.

In this embodiment, the hidden state parameter is calculated, so that the repair spectrum can be accurately calculated by using the hidden state parameter, and the accuracy of repairing the voice data is further improved.

In some embodiments of the present application, the inputting the hidden state parameter, the first group of data, the second group of data, and the third group of data into a preset first audio padding network, and the calculating to obtain the repair spectrum corresponding to the missing frame includes:

inputting the first group of data and the second group of data into a preset second audio filling network, calculating to obtain a first intermediate variable, inputting the second group of data and the third group of data into the second audio filling network, and calculating to obtain a second intermediate variable;

inputting the hidden state parameter, the first intermediate variable and the second intermediate variable into a preset first audio filling network, and calculating to obtain a repair frequency spectrum corresponding to the missing frame, wherein the first audio filling network and the second audio filling network have the same structure and different parameters.

In this embodiment, when obtaining the hidden state parameter, the first set of data, the second set of data, and the third set of data, the first set of data and the second set of data are input into a second audio filling network, where the second audio filling network has the same structure and different parameters as the first audio filling network. And calculating the first group of data and the second group of data through the second audio filling network to obtain a first intermediate variable, and calculating the second group of data and the third group of data through the second audio filling network to obtain a second intermediate variable. When the first intermediate variable and the second intermediate variable are obtained, the first intermediate variable, the second intermediate variable and the hidden state parameter are input into the first audio filling network, and therefore the repair frequency spectrum corresponding to the current missing frame can be calculated. The calculation formula of the repair spectrum is as follows:

I_i＝F₂(I_i-1,i,I_i,i+1,H_t-1)

wherein, I_iTo repair the spectrum, F₂Filling the network for a first audio frequency I_i-1,iIs a first intermediate variable, I_i,i+1Is a second intermediate variable, H_t-1Is a hidden state parameter of the second repair set.

FIG. 3 is a basic framework diagram of audio repair, shown in FIG. 3, in the left triangular region, B₁、B₂、B₃A first set of data B in which the current missing frame is located₂A corresponding first repair group; f₂Repairing the network for the first audio frequency, F₁Repairing the network for a second audio frequency I_1-2Is a first repair groupFirst intermediate variable of (1), I_2-3Is a second intermediate variable of the first repair group, I₂Is the repair spectrum of the first repair group, H_t-1Hidden state parameter, H, of the previous, i.e. second, repair group of the current repair group_tHidden state parameters of the current repair group, i.e. the first repair group; in the middle zone, LSTM is a preset long-short term memory network, C_t-1Is the cellular state of the second repair group, C_tSigma is a sigmoid activation function, the output of the sigma is between 0 and 1, Tanh is a hyperbolic tangent function, and the output of the sigma is between-1 and 1; in the right triangular region, B₂、B₃、B₄For group data B of the next missing frame₃Corresponding repair group, I_2-3Is B₃First intermediate variable, I, of the corresponding repair group_3-4Is B₃Second intermediate variable, I, of the corresponding repair group₂Is B₃Repair spectrum of the corresponding repair group.

In the embodiment, the calculation of the repaired frequency spectrum is realized, so that the repaired frequency spectrum can be accurately calculated through the audio filling network, the accurate repair of the voice data is further improved, and the repair quality of the voice data is improved.

In some embodiments of the present application, the second audio padding network includes a first convolutional layer, a second convolutional layer, and a residual dense network, the inputting the first set of data and the second set of data into the second audio padding network includes:

inputting the first set of data and the second set of data into the first convolution layer to calculate to obtain a first parameter value;

and inputting the first parameter value to the residual error dense network, calculating to obtain a second parameter value, and inputting the second parameter value to the second convolution layer to obtain a first intermediate variable.

In this embodiment, the second audio fill network comprises a first convolutional layer comprising a convolution of 5 convolutional kernels and a convolution of 3 convolutional kernels, a second convolutional layer comprising a convolution of 1 convolutional kernel and a convolution of 3 convolutional kernels, and a residual dense network comprising convolutional layers, an active layer, and a connection layer. As shown in fig. 4, fig. 4 is a structural diagram of an audio filling network, and both the first audio filling network and the second audio filling network are the structures shown in the structural diagram. The convolution, 5 × 5 is a convolution layer with a convolution kernel of 5, the convolution, 3 × 3 is a convolution layer with a convolution kernel of 3, and the convolution, 1 × 1 is a convolution layer with a convolution kernel of 1, and the structure of the residual dense network is specifically shown as the structure on the right side of fig. 4, and includes a convolution layer, an active layer and a connection layer, where the active layer is a Linear rectification function (ReLu).

When a first group of data and a second group of data are obtained, inputting the first group of data and the second group of data into a first convolution layer of a second audio filling network, and sequentially carrying out convolution of 5 convolution kernels and convolution calculation of 3 convolution kernels to obtain a first parameter value; when a first parameter value is obtained, inputting the first parameter value into the residual error intensive network, and calculating to obtain a second parameter value; and then inputting the second parameter value to the convolution of the convolution kernel 1 and the convolution of the convolution kernel 3 of the second convolution layer, so as to calculate and obtain the first intermediate variable. The calculation formulas of the first intermediate variable and the second intermediate variable are as follows:

I_i-1,i＝F₁(B_i-1,B_i)

wherein, I_i-1,iIs a first intermediate variable or a second intermediate variable, F₁Filling the network for a second audio, B_i-1And B_iTwo sets of data are adjacent. The second intermediate variable is calculated in the same way as the first intermediate variable, only the input parameters are different, and the parameters input by the second intermediate variable are the second group of data and the third group of data.

In the embodiment, the calculation of the intermediate variable is realized, and the processing speed and the data accuracy of the voice data are improved in a convolution mode, so that the repair frequency spectrum calculated according to the intermediate variable is more accurate.

In some embodiments of the present application, the dividing the voice data into a plurality of groups of frame signals, and the extracting the feature coefficients of the voice data according to the frame signals includes:

acquiring preset division duration, and dividing the voice data into a plurality of groups of frame signals according to the preset division duration;

and calculating the Mel cepstrum coefficient of each group of frame signals, and taking the Mel cepstrum coefficient as the characteristic coefficient of the voice data.

In this embodiment, when extracting the feature coefficients of the voice data, a preset division duration is obtained, where the preset division duration is a preset frame division duration, for example, 10ms is used as a frame, and the voice data is divided into multiple groups of frame signals. And calculating the Mel cepstrum coefficient of each group of frame signals, and outputting the Mel cepstrum coefficients of all the frame signals as a two-dimensional array to obtain the characteristic coefficient of the current voice data. The two-dimensional array may be represented by X ═ X₁,x₂,...,x_n]It is shown that, among others,

n is the number of frame signals, x_iIs the Mel cepstrum coefficient calculated from each frame of signal, and k is a coefficient constant, for example, k is 40.

In the embodiment, the extraction of the voice data feature coefficient is realized, so that the missing frame signal can be positioned through the feature coefficient, and the positioning accuracy of the missing frame is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a speech file restoration apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the voice file restoration apparatus 500 according to the present embodiment includes: a partitioning module 501, a positioning module 502, an acquisition module 503, a calculation module 504, and a processing module 505. Wherein,

a dividing module 501, configured to divide voice data into multiple groups of frame signals, and extract feature coefficients of the voice data according to the frame signals;

wherein, the dividing module 501 includes:

the dividing unit is used for acquiring preset dividing duration and dividing the voice data into a plurality of groups of frame signals according to the preset dividing duration;

and a first calculation unit that calculates mel cepstrum coefficients for each group of the frame signals, and uses the mel cepstrum coefficients as feature coefficients of the voice data.

A positioning module 502, configured to position a missing frame of the speech data based on a preset detection model and the feature coefficient, and determine a group position of the missing frame in the speech data as a first group of data;

wherein, the positioning module 502 comprises:

the first acquisition unit is used for acquiring a preset detection model, wherein the preset detection model comprises a detection neural network and a full connection layer, the characteristic coefficient is input into the detection neural network, and a detection value is obtained through calculation;

and the second calculation unit is used for inputting the detection value to the full-connection layer, calculating to obtain an output result, and positioning the missing frame of the voice data according to the output result.

Wherein, the first acquisition unit includes:

the segmentation sub-unit is used for acquiring an original file in a preset corpus, segmenting the original file into a plurality of frame data, randomly extracting a preset number of sub-frame data from all the frame data, replacing a signal in a preset time period in the sub-frame data by Gaussian white noise to obtain replaced sub-frame data, and combining the replaced sub-frame data and the frame data which is not replaced to form a training data set;

and the training subunit is used for training a basic detection model according to the training data set, and the basic detection model which is successfully trained is a preset detection model of the voice data.

An obtaining module 503, configured to obtain a foreground and background data of the first group of data, respectively use the foreground and background data as a second group of data and a third group of data, combine the first group of data, the second group of data, and the third group of data to form a first repair group, determine that a previous repair group of the first repair group is a second repair group, and obtain a hidden state parameter of the second repair group;

wherein, the obtaining module 503 includes:

a second obtaining unit, configured to determine that a previous repair group of the second repair group is a third repair group, and obtain a cell state of the third repair group;

and the third calculating unit is used for calculating the hidden state parameters of the second repairing group according to the cell state and a preset long-short term memory network.

In the present embodiment, when the first set of data is obtained, the pre-and post-sets of data of the first set of data are acquired, and the pre-sets of data are determined as the second set of data,the latter set of data is determined as the third set of data. And combining the first group of data, the second group of data and the third group of data to form a first repair group, namely using the group where the missing frame is located and the front and rear groups of data as the first repair group. The previous repair group of the first repair group is the second repair group, and the second repair group includes the group data of the previous missing frame of the missing frame and the group data before and after the previous missing frame. If the first repair group is (B)₄、B₅、B₆) Wherein B is₅The data of the first group where the missing frame in the first repair group is located is B₄Then the second repair group is (B)₃、B₄、B₅)。

A calculating module 504, configured to input the hidden state parameter, the first group of data, the second group of data, and the third group of data into a preset first audio padding network, and calculate to obtain a repair spectrum corresponding to the missing frame;

wherein, the calculating module 504 comprises:

the fourth calculation unit is used for inputting the first group of data and the second group of data into a preset second audio filling network, calculating to obtain a first intermediate variable, inputting the second group of data and the third group of data into the second audio filling network, and calculating to obtain a second intermediate variable;

and a fifth calculating unit, configured to input the hidden state parameter, the first intermediate variable, and the second intermediate variable into a preset first audio filling network, and calculate to obtain a repair spectrum corresponding to the missing frame, where the first audio filling network and the second audio filling network have the same structure and different parameters.

Wherein the second audio filling network comprises a first convolution layer, a second convolution layer and a residual dense network, and the fourth computing unit comprises:

the first calculation subunit is used for inputting the first group of data and the second group of data into the first convolution layer to calculate to obtain a first parameter value;

and the second calculation subunit is used for inputting the first parameter value to the residual error dense network, calculating to obtain a second parameter value, and inputting the second parameter value to the second convolution layer to obtain a first intermediate variable.

In this embodiment, when obtaining the hidden state parameter of the second repair group, the hidden state parameter, the first group of data, the second group of data, and the third group of data are sent to the first audio filling network, and the repair spectrum of the current speech data is obtained by calculation according to the first audio filling network. When the hidden state parameter, the first group of data, the second group of data and the third group of data are obtained, the obtained hidden state parameter, the first group of data, the second group of data and the third group of data are input into the first audio filling network, and then the repair frequency spectrum corresponding to the current repair frame can be obtained.

And a processing module 505, configured to process the repaired spectrum based on a preset vocoder, so as to obtain a repaired voice of the voice data.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 6, fig. 6 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various application software, such as program codes of a voice file repairing method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute the program code stored in the memory 61 or process data, for example, execute the program code of the voice file repairing method.

The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The computer device provided by the embodiment realizes the quick repair of damaged voice data, improves the quality of file repair while improving the repair speed, and in addition, also improves the voice recognition accuracy of voice recognition application, so that a complete voice signal is more easily and correctly recognized, and further improves the recognition efficiency of the voice recognition application.

The present application further provides another embodiment, which is a computer-readable storage medium storing a voice file repair program executable by at least one processor to cause the at least one processor to perform the steps of voice file repair as described above.

The computer-readable storage medium provided by the embodiment realizes quick repair of damaged voice data, improves the quality of file repair while improving the repair speed, and in addition, improves the voice recognition accuracy of voice recognition application, so that a complete voice signal is more easily and correctly recognized, and further improves the recognition efficiency of the voice recognition application.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A voice file restoration method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of locating the missing frame of the speech data based on the preset detection model and the feature coefficient comprises:

3. The method for repairing and matching a voice file according to claim 2, wherein the step of obtaining the preset detection model comprises:

4. The method according to claim 1, wherein the step of obtaining the hidden state parameters of the second repair group comprises:

5. The method according to claim 1, wherein the step of inputting the hidden state parameter, the first group of data, the second group of data, and the third group of data into a preset first audio padding network, and calculating the repair spectrum corresponding to the missing frame comprises:

6. The method according to claim 5, wherein the second audio filling network comprises a first convolution layer, a second convolution layer and a residual intensive network, the step of inputting the first set of data and the second set of data into the second audio filling network comprises:

7. The method according to claim 1, wherein the dividing of the speech data into a plurality of groups of frame signals, and the extracting of the feature coefficients of the speech data from the frame signals comprises:

8. A voice file restoration apparatus, comprising:

9. A computer device comprising a memory in which a computer program is stored and a processor which, when executing the computer program, implements the steps of the voice file restoration method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the voice file repair method according to any one of claims 1 to 7.