WO2021169356A1 - 语音文件修复方法、装置、计算机设备及存储介质 - Google Patents

语音文件修复方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021169356A1
WO2021169356A1 PCT/CN2020/124898 CN2020124898W WO2021169356A1 WO 2021169356 A1 WO2021169356 A1 WO 2021169356A1 CN 2020124898 W CN2020124898 W CN 2020124898W WO 2021169356 A1 WO2021169356 A1 WO 2021169356A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
group
preset
voice
frame
Prior art date
Application number
PCT/CN2020/124898
Other languages
English (en)
French (fr)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021169356A1 publication Critical patent/WO2021169356A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and storage medium for repairing a voice file.
  • voice recognition widely used in intelligent customer service systems, speaker recognition used in secure payment systems, etc. all involve the part of intelligent voice processing.
  • the inventor realized that whether it is the voice recognition of the customer service system or the speaker recognition of the payment system, when the network conditions are poor or the equipment quality is low, it will cause the voice signal to freeze or lose frames. Therefore, voice repair is the key to solving such problems.
  • voice restoration is also one of the necessary tasks.
  • speech restoration focuses more on speech enhancement, such as speech de-reverberation, speech noise reduction, and speech separation.
  • speech de-reverberation is to eliminate the sound ambiguity caused by the reflection of the sound signal in the space environment, such as eliminating the echo in the open environment; the speech noise reduction is used to reduce various environmental noises, and the speech separation is to suppress the noise from other speakers. Sound signal.
  • This type of speech enhancement processing is more used to improve the target sound signal, that is, the sound signal that needs to be processed already exists but the quality is poor; and when the signal is lost, such as the sound signal frame loss caused by the poor network environment, the voice signal Enhancement cannot improve the quality of audio files.
  • the quality of audio files directly determines the accuracy of intelligent voice processing, and how to repair audio files based on the existing audio files of poor quality is a technical problem to be solved.
  • the purpose of the embodiments of the present application is to propose a method, device, computer equipment, and storage medium for repairing a voice file, aiming to solve the technical problem of repairing a damaged voice file.
  • an embodiment of the present application provides a method for repairing a voice file, which adopts the following technical solutions:
  • a method for repairing voice files includes the following steps:
  • the group data is the first repair group, the previous repair group of the first repair group is determined to be the second repair group, and the hidden state parameter of the second repair group is acquired;
  • the repaired frequency spectrum is processed based on a preset vocoder to obtain the repaired voice of the voice data.
  • an embodiment of the present application also provides a voice file repairing device, which adopts the following technical solutions:
  • a dividing module configured to divide voice data into multiple groups of frame signals, and extract characteristic coefficients of the voice data according to the frame signals;
  • a positioning module configured to locate the missing frame of the voice data based on a preset detection model and the feature coefficient, and determine the group position of the missing frame in the voice data as the first set of data;
  • the acquisition module is used to acquire the front and back group data of the first group of data, and use the front and back group data as the second group of data and the third group of data, respectively, and combine the first group of data and the second group of data And the third set of data is the first repair group, determine that the previous repair group of the first repair group is the second repair group, and obtain the hidden state parameter of the second repair group;
  • the calculation module is used to input the hidden state parameter, the first set of data, the second set of data, and the third set of data into a preset first audio filling network, and calculate the corresponding missing frame Repair spectrum;
  • the processing module is configured to process the repaired frequency spectrum based on a preset vocoder to obtain the repaired voice of the voice data.
  • an embodiment of the present application also provides a computer device, including a memory and a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor executes
  • the computer-readable instructions implement the steps of the method for repairing a voice file as described below:
  • the group data is the first repair group, the previous repair group of the first repair group is determined to be the second repair group, and the hidden state parameter of the second repair group is acquired;
  • the repaired frequency spectrum is processed based on a preset vocoder to obtain the repaired voice of the voice data.
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following The steps of the voice file repair method:
  • the group data is the first repair group, the previous repair group of the first repair group is determined to be the second repair group, and the hidden state parameter of the second repair group is acquired;
  • the repaired frequency spectrum is processed based on a preset vocoder to obtain the repaired voice of the voice data.
  • the above voice file repair method, device, computer equipment and storage medium divide the voice data into multiple sets of frame signals, extract the feature coefficients of the voice data according to the frame signals; locate the missing frames of the voice data based on the preset detection model and feature coefficients , Thus you can find the frame of the missing voice signal in the current voice data; when the missing frame is located, the group position of the missing frame in the voice data is determined as the first set of data, where the voice data is divided according to the number of frames
  • the group where the currently missing frame is located is the first group of data; the output result of the current first group of data is often related to the adjacent group of data, therefore, the previous and subsequent group data of the first group of data are obtained, And take the data before and after the group as the second group of data and the third group of data; combine the first group of data, the second group of data, and the third group of data as the first repair group, and determine that the previous repair group of the first repair group is the first repair group.
  • Second repair group the output result of the current repair group is usually related to the state of the previous repair group. Therefore, obtain the hidden state parameters of the second repair group; enter the hidden state parameters, the first set of data, the second set of data, and the third Group the data into the preset first audio filling network, according to the first audio filling network, the repaired spectrum corresponding to the missing frame can be calculated; based on the preset vocoder to process the repaired spectrum, the final voice data can be obtained Repair voice.
  • the rapid repair of damaged voice data by repairing the frequency spectrum is realized, and while the repair speed is increased, the quality of file repair is improved.
  • the accuracy of speech recognition of speech recognition applications is also improved, so that The complete speech signal is easier to be correctly recognized, which further improves the recognition efficiency of speech recognition applications.
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a schematic flowchart of an embodiment of a method for repairing a voice file
  • Figure 3 is a basic frame diagram of audio repair
  • Figure 4 is a structural diagram of an audio filling network
  • Fig. 5 is a schematic structural diagram of an embodiment of a device for repairing a voice file according to the present application
  • Fig. 6 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social platform software, can be installed on the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and support for web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4) players, laptop portable computers and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, dynamic image experts compress standard audio layer 4
  • laptop portable computers and desktop computers etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • the voice file repairing method provided in the embodiments of the present application is generally executed by the server/terminal, and accordingly, the voice file repairing device is generally set in the server/terminal device.
  • terminals, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • the voice file repair method includes the following steps:
  • Step S201 Divide voice data into multiple groups of frame signals, and extract feature coefficients of the voice data according to the frame signals;
  • the voice data when the voice data is acquired, the voice data is divided into multiple sets of frame signals, where the frame signal is a voice signal in units of frames; the division method may be to divide the voice data according to a preset division duration, It can also be divided according to the preset dividing length.
  • the division method may be to divide the voice data according to a preset division duration, It can also be divided according to the preset dividing length.
  • multiple groups of frame signals corresponding to the voice data can be divided.
  • the feature coefficient of the voice data is extracted according to the frame signal, where the feature coefficient is the feature representation of the current voice data, and the Mel cepstrum coefficient is the feature coefficient of the voice data as an example. Calculate the Mel cepstral coefficients of each frame signal of the voice data, and combine the Mel cepstral coefficients of all frame signals into a two-dimensional array, and the two-dimensional array is the feature coefficient of the current voice data.
  • Step S202 locate the missing frame of the voice data based on a preset detection model and the feature coefficient, and determine that the group position of the missing frame in the voice data is the first set of data;
  • the missing frame of the current voice data is located according to the preset detection model and the feature coefficient, that is, the time point of the missing signal in the voice data is located.
  • the frame at the time point is the missing frame.
  • the preset detection model is a preset sound detection model, and the acquired feature coefficients are calculated according to the preset detection model, so that the time point of the missing signal in the current voice data can be determined, that is, the corresponding missing frame can be found.
  • the characteristic coefficients are a two-dimensional array, which includes the coefficients corresponding to each frame signal; the coefficients of each frame signal are sequentially input into the preset detection model, and the output result corresponding to the current voice data can be calculated.
  • the output result is a two-dimensional array, where the data uses 0 and 1 to indicate the absence and non-missing of the corresponding frame signal; if the data in the output result is 0, the frame signal corresponding to 0 is a missing frame; if the output result If the data in is 1, then 1 corresponds to the non-missing frame of the frame signal.
  • the group position in the voice data where the missing frame is located is determined as the first group of data, where the group position of different frame signals may be different, and the frame signals of the speech data are divided according to the preset number As a group, for example, five frame signals are divided into a group of data; thus, it can be divided into multiple groups of data, and the group where the missing frame is located is the first group of data.
  • Step S203 Obtain the before and after data of the first group of data, and use the before and after data as the second group of data and the third group of data, respectively, and combine the first group of data, the second group of data, and all the data.
  • the third set of data is the first repair group, the previous repair group of the first repair group is determined to be the second repair group, and the hidden state parameter of the second repair group is acquired;
  • the front and back sets of data of the first set of data are obtained, and the front set of data is determined to be the second set of data, and the rear set of data is determined to be the third set of data.
  • Combining the first group of data, the second group of data, and the third group of data is the first repair group, that is, the group where the missing frame is located and the two groups of data before and after it are used as the first repair group.
  • the previous repair group of the first repair group is the second repair group, and the second repair group includes the group data of the previous missing frame of the missing frame and the group data before and after the previous missing frame.
  • the first repair group is (B 4 , B 5 , B 6 ), where B 5 is the first group of data where the missing frame in the first repair group is located, and the group data of the previous missing frame of the missing frame is B 4 ,
  • the second repair group is (B 3 , B 4 , B 5 ).
  • the hidden state parameter of the second repair group is obtained, where the hidden state parameter is the short-term memory calculated according to the preset Long Short-Term Memory (LSTM) network Parameter, the output of the repair group at the current moment is often related to the hidden state parameter of the repair group at the previous moment, and the hidden state parameter of the repair group at the current moment can be calculated through the hidden state parameter of the repair group at the previous moment.
  • This hidden state parameter can ensure the continuity of the data between each repair group.
  • Step S204 input the hidden state parameter, the first set of data, the second set of data, and the third set of data to a preset first audio filling network, and calculate the repair corresponding to the missing frame Spectrum
  • the hidden state parameter of the second repair group when the hidden state parameter of the second repair group is obtained, the hidden state parameter, the first group of data, the second group of data, and the third group of data are transferred to the first audio filling network, and the hidden state parameter, the first group of data, the second group of data, and the third group of data
  • the audio filling network calculates the repaired frequency spectrum of the current voice data.
  • the first audio filling network is a preset audio filling network, which includes multiple convolutional layers and related residual dense networks.
  • the hidden state parameters, the first set of data, the second set of data, and the third set of data are obtained, the obtained hidden state parameters, the first set of data, the second set of data, and the third set of data are input to the first audio filling network , You can get the repaired frequency spectrum corresponding to the current repaired frame.
  • the above-mentioned repaired spectrum can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Step S205 Process the repaired frequency spectrum based on a preset vocoder to obtain a repaired voice of the voice data.
  • the repaired frequency spectrum when the repaired frequency spectrum is obtained, the repaired frequency spectrum is repaired through the preset vocoder, and the speech signal corresponding to the repaired frequency spectrum can be obtained from this, and the speech signal is the repaired current speech data.
  • voice Taking wavenet as the preset vocoder as an example, wavenet is a voice generation model based on deep learning, which can directly model the original voice data. Therefore, when the repaired frequency spectrum is received, the repaired frequency spectrum is input into the speech generation model, and the repaired speech corresponding to the repaired frequency spectrum can be obtained. The voice information that is missing in the current voice data can be obtained through the repaired voice.
  • the above-mentioned locating the missing frame of the voice data based on the preset detection model and the feature coefficient includes:
  • the preset detection model including a detection neural network and a fully connected layer, inputting the feature coefficients into the detection neural network, and calculating a detection value;
  • Input the detection value to the fully connected layer calculate an output result, and locate the missing frame of the voice data according to the output result.
  • the preset detection model is a preset sound detection model, including a detection neural network and a fully connected layer.
  • the coefficients corresponding to each frame signal are sequentially input into the detection neural network, and the corresponding first detection parameters and second detection parameters are calculated by the detection neural network.
  • the calculation formula is as follows:
  • h t and c t represent the detection parameters at the current moment, namely the first detection parameter and the second detection parameter
  • x t represents the coefficient of the input frame signal
  • W is the weight parameter of the fully connected layer
  • b is the fully connected layer.
  • H is the detection value output by the detection neural network
  • I the output result of the fully connected layer, such as Among them, 0 means that the corresponding frame signal is a missing frame, and 1 means that the corresponding frame signal is a non-missing frame.
  • the detection value is calculated by the detection neural network according to the first detection parameter and the second detection parameter, and then the detection value is input into the fully connected layer, and the output is calculated as a result
  • the value of the output result it can be judged whether the corresponding frame signal is a missing frame. If the value of the current output result is a preset value representing a missing frame, it is determined that the frame signal corresponding to the output result is a missing frame ; If the value of the current output result is a preset value representing non-missing frames, the frame signal corresponding to the output result is determined to be non-missing frames.
  • the output result includes only the value representing the missing frame and the value representing the non-missing frame. This situation.
  • the rapid and accurate positioning of the missing frame according to the preset detection model is realized, which further improves the repair efficiency and accuracy of the subsequent repair of the missing frame.
  • the foregoing obtaining the preset detection model includes:
  • the basic detection model is trained according to the training data set, and the successfully trained basic detection model is the preset detection model of the voice data.
  • the basic detection model before acquiring the preset detection model, the basic detection model needs to be trained in advance, and the trained basic model is the preset detection model.
  • the original document in a preset corpus is acquired, and the preset corpus may be a public corpus.
  • a voice file that has not undergone any processing in the preset corpus is used as an original file, and the original file is divided into multiple frame data, for example, using 20s as a frame, the original file is divided into multiple frame data.
  • a preset number of sub-frame data is randomly extracted from all the frame data, and the signal of the preset time period in the sub-frame data is replaced with Gaussian white noise, thereby obtaining the replaced sub-frame data, Combine the replaced subframe data and the unreplaced frame data into a training data set.
  • the original file is segmented to obtain 10 frame data, and 5 subframe data are randomly extracted from the 10 frame data, and then 0.1s voice signal is randomly extracted from the 5 subframe data and replaced with Gaussian white noise to simulate the voice signal
  • the subframe data after the replacement is obtained, and the subframe data after the replacement is combined with the remaining 5 unreplaced frame data to form a training data set.
  • the training data is concentrated into the basic detection model for training.
  • a number of verification frame data are randomly selected as verification data.
  • the verification frame data includes frame data with missing signals and frame data without missing signals.
  • the training of the preset detection model is implemented, so that the missing frames in the voice data can be accurately located by the preset detection model, which further improves the accuracy of locating the missing frames.
  • the foregoing acquiring the hidden state parameter of the second repair group includes:
  • the previous repair group of the second repair group needs to be obtained, and the previous repair group of the second repair group is the third repair group.
  • the cell state of the third repair group is the cell state of the third repair group as C t-1 , input the cell state into the long short-term memory network, and calculate the cell state through the long short-term memory network to obtain the element of the current second repair group Cell state C t ;
  • the cell state is the path of the uppermost information transmission in the long and short-term memory network, so that the information can be passed on in the sequence.
  • the current cell state is related to the cell state at the previous moment.
  • the hidden state parameters of the current second repair group can be obtained through the long and short-term memory network.
  • the hidden state parameter is represented by H t
  • C t is the current second restoration
  • the cell state of the group, C t-1 is the cell state of the third repair group, and I t-1 is the memory value of the third repair group.
  • the calculation of the hidden state parameter is implemented, so that the repaired spectrum can be accurately calculated by the hidden state parameter, and the accuracy of the voice data repair is further improved.
  • the above-mentioned input of the hidden state parameter, the first set of data, the second set of data, and the third set of data into the preset first audio filling network is calculated to obtain
  • the repaired spectrum corresponding to the missing frame includes:
  • the first set of data and the second set of data are input into a preset second audio filling network, a first intermediate variable is calculated, and the second set of data and the third set of data are input to all In the second audio filling network, the second intermediate variable is calculated;
  • the first set of data and the second set of data are input to the second audio filling network, where,
  • the second audio filling network has the same structure and different parameters as the first audio filling network.
  • the first set of data and the second set of data are calculated through the second audio filling network to obtain the first intermediate variable
  • the second set of data and the third set of data are calculated through the second audio filling network, namely The second intermediate variable can be obtained.
  • the first intermediate variable and the second intermediate variable are obtained, the first intermediate variable, the second intermediate variable, and the hidden state parameter are input into the first audio filling network, so that the repaired spectrum corresponding to the current missing frame can be calculated.
  • the calculation formula of the repaired spectrum is as follows:
  • I i F 2 (I i-1,i ,I i,i+1 ,H t-1 )
  • I i is the repaired spectrum
  • F 2 is the first audio filling network
  • I i-1,i are the first intermediate variables
  • I i, i+1 are the second intermediate variables
  • H t-1 is the second repair group The hidden state parameter.
  • Figure 3 is the basic frame diagram of audio restoration.
  • B 1 , B 2 , and B 3 are the first restoration group corresponding to the first group of data B 2 where the current missing frame is located ;
  • F 2 Is the first audio repair network,
  • F 1 is the second audio repair network,
  • I 1-2 is the first intermediate variable of the first repair group,
  • I 2-3 is the second intermediate variable of the first repair group, and
  • I 2 is the first intermediate variable of the first repair group.
  • H t-1 is the previous repair group of the current repair group, that is, the hidden state parameter of the second repair group, and H t current repair group, that is, the hidden state parameter of the first repair group; in the middle area , LSTM is the preset long and short-term memory network, C t-1 is the cell state of the second repair group, C t is the cell state of the first repair group, ⁇ is the sigmoid activation function, and its output is from 0 to Between 1, Tanh is a hyperbolic tangent function, and its output is between -1 and 1; in the right triangle area, B 2 , B 3 , and B 4 are the repairs corresponding to the group data B 3 where the next missing frame is located group, I 2-3 B 3 is a first intermediate variable corresponding to the repair set, I 3-4 B 3 is a second intermediate variable corresponding to the repair set, I 2 B 3 corresponding to the spectrum for the repair of the repair group.
  • the calculation of the repaired spectrum is implemented, so that the repaired spectrum can be accurately calculated through the audio filling network, which further improves the accurate repair of the voice data and improves the quality of the repair of the voice data.
  • the second audio filling network includes a first convolutional layer, a second convolutional layer, and a residual dense network.
  • the first set of data and the second set of data are input to the preset
  • the first intermediate variable obtained by calculation includes:
  • the second audio filling network includes a first convolutional layer, a second convolutional layer, and a residual dense network.
  • the first convolutional layer includes a convolution with 5 convolution kernels and a convolution with 3 convolutions.
  • Convolution of the core the second convolution layer includes a convolution of 1 convolution kernel and a convolution of 3 convolution kernels
  • the residual dense network includes a convolution layer, an activation layer, and a connection layer.
  • FIG. 4 is a structural diagram of an audio filling network, and the first audio filling network and the second audio filling network are both the structures shown in the structural diagram.
  • convolution 5*5 is a convolution layer with a convolution kernel of 5
  • convolution 3*3 is a convolution layer with a convolution kernel of 3
  • convolution and 1*1 is a convolution with a convolution kernel of 1.
  • the structure of the multi-layer, residual dense network is specifically shown in the structure on the right of Figure 4, including convolution, activation layer, and connection layer.
  • the activation layer is a linear rectification function (Rectified Linear Unit, ReLu).
  • the first group of data and the second group of data are obtained, the first group of data and the second group of data are input into the first convolutional layer of the second audio filling network, and the convolution and 3
  • the convolution calculation of the convolution kernel obtains the first parameter value; when the first parameter value is obtained, the first parameter value is input into the residual dense network, and the second parameter value is calculated; and then the second parameter value is input To the convolution of 1 convolution kernel and the convolution of 3 convolution kernels of the second convolution layer, the first intermediate variable can be calculated.
  • the calculation formulas of the first intermediate variable and the second intermediate variable are as follows:
  • I i-1,i F 1 (B i-1 ,B i )
  • I i-1, i are the first intermediate variable or the second intermediate variable
  • F 1 is the second audio filling network
  • B i-1 and B i are two adjacent sets of data.
  • the calculation method of the second intermediate variable is the same as that of the first intermediate variable, except that the input parameters are different.
  • the input parameters of the second intermediate variable are the second group of data and the third group of data.
  • the calculation of the intermediate variable is implemented, and the voice data processing speed and data accuracy are improved by means of convolution, so that the repaired spectrum calculated according to the intermediate variable is more accurate.
  • the above-mentioned division of voice data into multiple sets of frame signals, and extracting characteristic coefficients of the voice data according to the frame signals includes:
  • the preset division time length is obtained, and the preset division time length is the preset frame division time length, such as 10ms as a frame, and the voice data is divided into multiple Framing signal.
  • the feature coefficient of the voice data is extracted, so that the missing frame signal can be located through the feature coefficient, and the positioning accuracy of the missing frame is improved.
  • the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a computer-readable storage medium.
  • the computer-readable instructions When executed, they may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • this application provides an embodiment of a device for repairing a voice file.
  • the device embodiment corresponds to the method embodiment shown in FIG. It can be applied to various electronic devices.
  • the voice file repairing device 500 in this embodiment includes: a dividing module 501, a positioning module 502, an acquiring module 503, a calculating module 504, and a processing module 505. in,
  • the dividing module 501 is configured to divide voice data into multiple groups of frame signals, and extract characteristic coefficients of the voice data according to the frame signals;
  • the dividing module 501 includes:
  • a dividing unit configured to obtain a preset dividing time length, and divide the voice data into multiple groups of frame signals according to the preset dividing time length;
  • the first calculation unit calculates the Mel cepstrum coefficients of each group of the frame signals, and uses the Mel cepstrum coefficients as the characteristic coefficients of the voice data.
  • the voice data when the voice data is acquired, the voice data is divided into multiple sets of frame signals, where the frame signal is a voice signal in units of frames; the division method may be to divide the voice data according to a preset division duration, It can also be divided according to the preset dividing length.
  • the division method may be to divide the voice data according to a preset division duration, It can also be divided according to the preset dividing length.
  • multiple groups of frame signals corresponding to the voice data can be divided.
  • the feature coefficient of the voice data is extracted according to the frame signal, where the feature coefficient is the feature representation of the current voice data, and the Mel cepstrum coefficient is the feature coefficient of the voice data as an example. Calculate the Mel cepstral coefficients of each frame signal of the voice data, and combine the Mel cepstral coefficients of all frame signals into a two-dimensional array, and the two-dimensional array is the feature coefficient of the current voice data.
  • the positioning module 502 is configured to locate the missing frame of the voice data based on a preset detection model and the feature coefficient, and determine the group position of the missing frame in the voice data as the first set of data;
  • the positioning module 502 includes:
  • the first acquiring unit is configured to acquire a preset detection model, where the preset detection model includes a detection neural network and a fully connected layer, input the feature coefficients into the detection neural network, and calculate a detection value;
  • the second calculation unit is configured to input the detection value to the fully connected layer, calculate an output result, and locate the missing frame of the voice data according to the output result.
  • the first obtaining unit includes:
  • the slicing unit is used to obtain the original file in the preset corpus, divide the original file into multiple frame data, randomly extract a preset number of sub-frame data from all the frame data, and divide the sub-frame
  • the signal of the preset time period in the data is replaced with Gaussian white noise to obtain the replaced subframe data, and the replaced subframe data and the unreplaced frame data are combined into a training data set;
  • the training subunit is used to train the basic detection model according to the training data set, and obtain the successfully trained basic detection model as the preset detection model of the voice data.
  • the missing frame of the current voice data is located according to the preset detection model and the feature coefficient, that is, the time point of the missing signal in the voice data is located.
  • the frame at the time point is the missing frame.
  • the preset detection model is a preset sound detection model, and the acquired feature coefficients are calculated according to the preset detection model, so that the time point of the missing signal in the current voice data can be determined, that is, the corresponding missing frame can be found.
  • the characteristic coefficients are a two-dimensional array, which includes the coefficients corresponding to each frame signal; the coefficients of each frame signal are sequentially input into the preset detection model, and the output result corresponding to the current voice data can be calculated.
  • the output result is a two-dimensional array, where the data uses 0 and 1 to indicate the absence and non-missing of the corresponding frame signal; if the data in the output result is 0, the frame signal corresponding to 0 is a missing frame; if the output result If the data in is 1, then 1 corresponds to the non-missing frame of the frame signal.
  • the group position in the voice data where the missing frame is located is determined as the first group of data, where the group position of different frame signals may be different, and the frame signals of the speech data are divided according to the preset number As a group, for example, five frame signals are divided into a group of data; thus, it can be divided into multiple groups of data, and the group where the missing frame is located is the first group of data.
  • the obtaining module 503 is configured to obtain the front and back group data of the first group of data, and use the front and back group data as the second group of data and the third group of data, respectively, and combine the first group of data and the second group of data.
  • the data and the third group of data are the first repair group, determine that the previous repair group of the first repair group is the second repair group, and obtain the hidden state parameter of the second repair group;
  • the obtaining module 503 includes:
  • a second acquiring unit configured to determine that the previous repair group of the second repair group is the third repair group, and acquire the cell state of the third repair group;
  • the third calculation unit is configured to calculate the hidden state parameters of the second repair group according to the cell state and a preset long-short-term memory network.
  • the front and back sets of data of the first set of data are obtained, and the front set of data is determined to be the second set of data, and the rear set of data is determined to be the third set of data.
  • Combining the first group of data, the second group of data, and the third group of data is the first repair group, that is, the group where the missing frame is located and the two groups of data before and after it are used as the first repair group.
  • the previous repair group of the first repair group is the second repair group, and the second repair group includes the group data of the previous missing frame of the missing frame and the previous and subsequent group data of the previous missing frame.
  • the first repair group is (B 4 , B 5 , B 6 ), where B 5 is the first group of data where the missing frame in the first repair group is located, and the group data of the previous missing frame of the missing frame is B 4 ,
  • the second repair group is (B 3 , B 4 , B 5 ).
  • the hidden state parameter of the second repair group is obtained, where the hidden state parameter is the short-term memory calculated according to the preset Long Short-Term Memory (LSTM) network Parameter, the output of the repair group at the current moment is often related to the hidden state parameter of the repair group at the previous moment, and the hidden state parameter of the repair group at the current moment can be calculated through the hidden state parameter of the repair group at the previous moment.
  • This hidden state parameter can ensure the continuity of data between each repair group.
  • the calculation module 504 is configured to input the hidden state parameter, the first set of data, the second set of data, and the third set of data into a preset first audio filling network, and calculate the missing frame Corresponding repair spectrum;
  • the calculation module 504 includes:
  • the fourth calculation unit is configured to input the first set of data and the second set of data into a preset second audio filling network, calculate a first intermediate variable, and combine the second set of data with the The third group of data is input into the second audio filling network, and the second intermediate variable is obtained by calculation;
  • the fifth calculation unit is configured to input the hidden state parameter, the first intermediate variable, and the second intermediate variable into the preset first audio filling network, and calculate the repaired spectrum corresponding to the missing frame, wherein
  • the first audio filling network and the second audio filling network have the same structure and different parameters.
  • the second audio filling network includes a first convolutional layer, a second convolutional layer, and a residual dense network
  • the fourth calculation unit includes:
  • a first calculation subunit configured to input the first set of data and the second set of data into the first convolutional layer to calculate a first parameter value
  • the second calculation subunit is used to input the first parameter value to the residual dense network, calculate the second parameter value, and input the second parameter value to the second convolutional layer to obtain the first intermediate variable.
  • the hidden state parameter of the second repair group when the hidden state parameter of the second repair group is obtained, the hidden state parameter, the first group of data, the second group of data, and the third group of data are transferred to the first audio filling network, and the hidden state parameter, the first group of data, the second group of data, and the third group of data
  • the audio filling network calculates the repaired frequency spectrum of the current voice data.
  • the first audio filling network is a preset audio filling network, which includes multiple convolutional layers and related residual dense networks.
  • the first set of data After obtaining the hidden state parameters, the first set of data, the second set of data, and the third When grouping data, input the obtained hidden state parameters, the first group of data, the second group of data, and the third group of data into the first audio filling network to obtain the repaired spectrum corresponding to the current repaired frame.
  • the above-mentioned repaired spectrum can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the processing module 505 is configured to process the repaired frequency spectrum based on a preset vocoder to obtain the repaired voice of the voice data.
  • the repaired frequency spectrum when the repaired frequency spectrum is obtained, the repaired frequency spectrum is repaired through the preset vocoder, and the speech signal corresponding to the repaired frequency spectrum can be obtained from this, and the speech signal is the repaired current speech data.
  • voice Taking wavenet as the preset vocoder as an example, wavenet is a voice generation model based on deep learning, which can directly model the original voice data. Therefore, when the repaired frequency spectrum is received, the repaired frequency spectrum is input into the speech generation model, and the repaired speech corresponding to the repaired frequency spectrum can be obtained. The voice information that is missing in the current voice data can be obtained through the repaired voice.
  • FIG. 6 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 6 includes a memory 61, a processor 62, and a network interface 63 that communicate with each other through a system bus. It should be pointed out that the figure only shows the computer device 6 with components 61-63, but it should be understood that it is not required to implement all of the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 61 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6.
  • the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk equipped on the computer device 6, a smart media card (SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device.
  • the memory 61 is generally used to store an operating system and various application software installed in the computer device 6, such as computer-readable instructions for a voice file repair method.
  • the memory 61 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 62 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 62 is generally used to control the overall operation of the computer device 6.
  • the processor 62 is configured to run computer-readable instructions or process data stored in the memory 61, for example, computer-readable instructions for running the voice file repair method.
  • the network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.
  • the computer device proposed in this embodiment realizes rapid repair of damaged voice data, and improves the quality of file repair while increasing the repair speed. In addition, it also improves the accuracy of voice recognition in voice recognition applications. , Which makes the complete speech signal easier to be correctly recognized, and further improves the recognition efficiency of speech recognition applications.
  • This application also provides another implementation manner, that is, a computer-readable storage medium storing a voice file repairing process, and the voice file repairing process can be executed by at least one processor to The at least one processor is caused to execute the steps of repairing the voice file as described above.
  • the computer-readable storage medium proposed in this embodiment realizes rapid repair of damaged voice data, and improves the quality of file repair while improving the repair speed. In addition, it also improves the voice of voice recognition applications.
  • the recognition accuracy rate makes it easier for the complete speech signal to be correctly recognized, which further improves the recognition efficiency of speech recognition applications.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

一种语音文件修复方法及相关设备,可应用于智慧医院、金融场所等的智能客服或安全支付,包括:根据帧信号提取语音数据的特征系数;基于预设检测模型和特征系数对语音数据的缺失帧进行定位,并确定缺失帧在语音数据中的组位置为第一组数据;组合第一组数据、第二组数据和第三组数据为第一修复组,获取第二修复组的隐状态参数;输入隐状态参数、第一组数据、第二组数据和第三组数据至预设的第一音频填充网络中,计算得到缺失帧对应的修复频谱;基于预设声码器对修复频谱进行处理,得到所述语音数据的修复语音。此外,该方法还涉及区块链技术,所述修复频谱可存储于区块链中,从而实现了对受损音频的修复。

Description

语音文件修复方法、装置、计算机设备及存储介质
本申请要求于2020年9月18日提交中国专利局、申请号为202010990031.1,发明 名称为“语音文件修复方法、装置、计算机设备及存储介质”的中国专利申请的优先权, 其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种语音文件修复方法、装置、计算机设备及存储介质。
背景技术
当前,随着人工智能的高速发展,智能语音处理在日常生活中也越来越被普及。例如,广泛应用于智能客服系统的语音识别、应用于安全支付系统的说话人识别等,均涉及到智能语音处理的部分。然而,发明人意识到,无论是客服系统的语音识别还是支付系统的说话人识别,当网络条件较差或设备质量较低时,则会导致语音信号卡顿或丢帧的现象产生。由此,语音修复则是解决此类问题的关键。并且,除此之外,在对历史文献进行考据和影片修复的过程中,语音修复也是其中必要的工作之一。
目前对语音修复的工作较少,传统的语音修复更多的是关注于语音的增强,如语音解混响、语音降噪和语音分离等。其中,语音解混响是消除由于空间环境对声音信号的反射产生的声音模糊,如消除空旷环境中的回音;语音降噪用来降低各种环境噪声,语音分离为了抑制来自于其他说话人的声音信号。这类语音增强处理更多的用于目标声音信号的提升,即需要进行处理的声音信号已存在但质量较差;而当信号丢失时,如网络环境较差时造成的声音信号丢帧,语音增强则无法提升音频文件质量。音频文件的质量直接决定了智能语音处理的准确率,而如何在已有的质量较差的音频文件的基础上,实现对音频文件的修复,则是丞待解决的技术问题。
发明内容
本申请实施例的目的在于提出一种语音文件修复方法、装置、计算机设备及存储介质,旨在解决受损语音文件修复的技术问题。
为了解决上述技术问题,本申请实施例提供一种语音文件修复方法,采用了如下所述的技术方案:
一种语音文件修复方法,包括以下步骤:
划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
为了解决上述技术问题,本申请实施例还提供一种语音文件修复装置,采用了如下所述的技术方案:
划分模块,用于划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
定位模块,用于基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
获取模块,用于获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
计算模块,用于输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
处理模块,用于基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的语音文件修复方法的步骤:
划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的语音文件修复方法的步骤:
划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
上述语音文件修复方法、装置、计算机设备及存储介质,通过划分语音数据为多组帧信号,根据帧信号提取语音数据的特征系数;基于预设检测模型和特征系数对语音数据的缺失帧进行定位,由此即可找到当前语音数据中缺失的语音信号所在的帧;在定位到缺失帧时,确定缺失帧在语音数据中的组位置为第一组数据,其中,语音数据按照帧的数量划分为多个组,当前缺失帧所在的组的为即为该第一组数据;当前第一组数据的输出结果往往与相邻的组数据相关,因此,获取第一组数据的前后组数据,并将前后组数据分别作为第二组数据和第三组数据;组合第一组数据、第二组数据和第三组数据为第一修复组,确定第一修复组的前一个修复组为第二修复组;当前修复组的输出结果则通常与上一个修复组的状态相关,因此,获取第二修复组的隐状态参数;输入隐状态参数、第一组数据、第二组数据和第三组数据至预设的第一音频填充网络中,根据该第一音频填充网络则可计算得到缺失帧对应的修复频谱;基于预设声码器对修复频谱进行处理,即可得到最终的语音数据的修复语音。由此,实现了通过修复频谱对受损语音数据的快速修复,并且在提高修 复速度的同时,提高了文件修复的质量,除此之外,还提高了语音识别应用的语音识别准确率,使得完整的语音信号更容易地被正确识别,进一步提高了语音识别应用的识别效率。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请可以应用于其中的示例性系统架构图;
图2是语音文件修复方法的一个实施例的流程示意图;
图3是音频修复基本框架图;
图4是音频填充网络的结构图;
图5是根据本申请的语音文件修复装置的一个实施例的结构示意图;
图6是根据本申请的计算机设备的一个实施例的结构示意图。
附图标记:语音文件修复装置500,划分模块501、定位模块502、获取模块503、计算模块504以及处理模块505。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本申请的目的、技术方案及优点更加清楚明白,下面结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的语音文件修复方法一般由服务端/终端执行,相应地,语音文件修复装置一般设置于服务端/终端设备中。
应该理解,图1中的终端、网络和服务端的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本申请的语音文件修复方法的一个实施例的流程图。所述语音文件修复方法,包括以下步骤:
步骤S201,划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
在本实施例中,在获取到语音数据时,划分语音数据为多组帧信号,其中,帧信号为以帧为单位的语音信号;划分方式可以为将语音数据按照预设划分时长进行划分,也可以按照预设划分长度进行划分。通过对语音数据进行划分,可以划分得到该语音数据对应的多组帧信号。在得到该帧信号时,根据该帧信号提取该语音数据的特征系数,其中,该特征系数为当前该语音数据的特征表示,以梅尔倒谱系数为该语音数据的特征系数为例。计算该语音数据的每个帧信号的梅尔倒谱系数,组合所有帧信号的梅尔倒谱系数为一个二维数组,该二维数组即为当前语音数据的特征系数。
步骤S202,基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
在本实施例中,在得到语音数据的特征系数时,根据预设检测模型和该特征系数,对当前语音数据的缺失帧进行定位,即对该语音数据中缺失信号的时间点进行定位,该时间点所在的帧即为缺失帧。预设检测模型为预先设定的声音检测模型,根据该预设检测模型对获取到的特征系数进行计算,从而可以确定当前语音数据中缺失信号的时间点,即找到对应的缺失帧。具体地,特征系数为一个二维数组,其中包括了每个帧信号对应的系数;将每个帧信号的系数依次输入至预设检测模型中,即可计算得到当前语音数据对应的输出结果,该输出结果为一个二维数组,其中的数据分别用0和1表示对应帧信号的缺失和不缺失;若输出结果中的数据为0,则0对应的该帧信号为缺失帧;若输出结果中的数据为1,则1对应的该帧信号非缺失帧。在得到缺失帧时,则将该缺失帧所在语音数据中的组位置确定为第一组数据,其中,不同的帧信号所在的组位置可能不同,将语音数据的帧信号按照预设个数划分为一组,如五个帧信号划分为一组数据;由此,可划分到多组不同的组数据,缺失帧所在的组即为第一组数据。
步骤S203,获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
在本实施例中,在得到第一组数据时,获取该第一组数据的前后组数据,并将前组数据确定为第二组数据,后组数据确定为第三组数据。组合第一组数据、第二组数据和第三组数据为第一修复组,即将缺失帧所在的组以及其前后两组数据作为第一修复组。该第一修复组的前一个修复组则为第二修复组,该第二修复组中包括了该缺失帧前一个缺失帧所在的组数据,以及该前一个缺失帧的前后组数据。如第一修复组为(B 4、B 5、B 6),其中,B 5为第一修复组中缺失帧所在的第一组数据,该缺失帧的前一个缺失帧所在组数据为B 4,则第二修复组为(B 3、B 4、B 5)。
在得到该第二修复组时,则获取该第二修复组的隐状态参数,其中,隐状态参数为根据预设的长短期记忆网络(LSTM,Long Short-Term Memory)计算得到的短时记忆参数,当前时刻的修复组的输出往往与上一时刻的修复组的隐状态参数相关,通过上一时刻修复组的隐状态参数可以计算得到当前时刻修复组的隐状态参数。通过该隐状态参数可以保证各个修复组之间数据的连贯性。
步骤S204,输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
在本实施例中,在得到第二修复组的隐状态参数时,将该隐状态参数、第一组数据、 第二组数据和第三组数据至第一音频填充网络中,根据该第一音频填充网络计算得到当前语音数据的修复频谱。其中,第一音频填充网络为预先设定的音频填充网络,其包括了多个卷积层和相关的残差密集网络。在得到隐状态参数、第一组数据、第二组数据和第三组数据时,将得到的隐状态参数、第一组数据、第二组数据和第三组数据输入至第一音频填充网络,即可得到当前修复帧对应的修复频谱。
需要强调的是,为进一步保证上述修复频谱的私密和安全性,上述修复频谱还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
步骤S205,基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
在本实施例中,在得到修复频谱时,则通过预设声码器对该修复频谱进行修复处理,由此可得到的该修复频谱对应的语音信号,该语音信号即为当前语音数据的修复语音。以wavenet作为预设声码器为例,wavenet为基于深度学习的语音生成模型,该语音生成模型可以直接对原始语音数据进行建模。因此,在接收到修复频谱时,将该修复频谱输入至该语音生成模型中,即可得到该修复频谱对应的修复语音。通过该修复语音即可获取到当前语音数据中缺失的语音信息。
在本实施例中,实现了对受损语音数据的快速修复,并且在提高修复速度的同时,提高了文件修复的质量,除此之外,还提高了语音识别应用的语音识别准确率,使得完整的语音信号更容易地被正确识别,进一步提高了语音识别应用的识别效率。本申请属于人工智能技术领域,在机器学习、深度学习中均具有较好的表现。
在本申请的一些实施例中,上述基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位包括:
获取预设检测模型,所述预设检测模型包括检测神经网络和全连接层,输入所述特征系数至所述检测神经网络中,计算得到检测值;
输入所述检测值至所述全连接层,计算得到输出结果,根据所述输出结果对所述语音数据的缺失帧进行定位。
在本实施例中,预设检测模型为预先设定的声音检测模型,包括检测神经网络和全连接层。在得到特征系数时,将每个帧信号对应的系数依次输入至该检测神经网络中,通过该检测神经网络计算得到对应的第一检测参数和第二检测参数,计算公式如下所示:
h t,c t=LSTM(h t-1,c t-1,x t)
Figure PCTCN2020124898-appb-000001
其中,h t和c t分别表示当前时刻的检测参数,即第一检测参数和第二检测参数,x t表示输入的帧信号的系数,W为全连接层的权值参数,b为全连接层的偏置参数,H为检测神经网络输出的检测值,
Figure PCTCN2020124898-appb-000002
为全连接层的输出结果,如
Figure PCTCN2020124898-appb-000003
其中,0即表示对应帧信号为缺失帧,1则表示对应的帧信号非缺失帧。
在得到第一检测参数和第二检测参数时,根据该第一检测参数和第二检测参数通过检测神经网络计算得到检测值,而后将该检测值输入至全连接层中,由此计算得到输出结果,根据该输出结果的数值即可对对应的帧信号是否为缺失帧进行判断,若当前输出结果的数值为预设的表示缺失帧的数值,则确定该输出结果对应的帧信号为缺失帧;若当前输出结果的数值为预设的表示非缺失帧的数值,则确定该输出结果对应的帧信号非缺失帧,该输出结果中只包括表示缺失帧的数值和表示非缺失帧的数值两种情况。
在本实施例中,实现了根据预设检测模型对缺失帧的快速精确定位,进一步提高了此 后对该缺失帧进行修复的修复效率及准确率。
在本申请的一些实施例中,上述获取预设检测模型包括:
获取预设语料库中的原始文件,切分所述原始文件为多个帧数据,从所有所述帧数据中随机抽取预设个数的子帧数据,将所述子帧数据中预设时间段的信号用高斯白噪声替换,得到替换后的子帧数据,组合所述替换后的子帧数据和未被替换的帧数据为训练数据集;
根据所述训练数据集对基础检测模型进行训练,得到训练成功的基础检测模型为所述语音数据的预设检测模型。
在本实施例中,在获取预设检测模型之前,需要对基础检测模型预先进行训练,训练后的基础模型则为该预设检测模型。具体地,获取预设语料库中的原始文件,该预设语料库可以为公共语料库。将该预设语料库中未经过任何处理的语音文件作为原始文件,切分该原始文件为多个帧数据,例如,以20s为一帧,将该原始文件切分为多个帧数据。在得到帧数据时,从所有帧数据中随机抽取预设个数的子帧数据,将该子帧数据中预设时间段的信号用高斯白噪声替换,由此得到替换后的子帧数据,将替换后的子帧数据和未被替换的帧数据组合为训练数据集。例如,原始文件切分得到10个帧数据,从该10个帧数据中随机抽取5个子帧数据,再从该5个子帧数据中随机抽取0.1s的语音信号用高斯白噪声替换,模拟语音信号的缺失,得到替换后的子帧数据,件该替换后的子帧数据与剩余的5个未被替换的帧数据组合为训练数据集。
在得到该训练数据集时,将该训练数据集中投入基础检测模型中进行训练。在根据该训练数据集对该基础检测模型训练之后,随机选取若干验证帧数据作为验证数据,该验证帧数据中包括了缺失信号的帧数据也包括了不缺失信号的帧数据,根据该验证数据对训练后的基础检测模型进行验证。若训练后的基础检测模型对验证数据的验证正确率达到预设正确率,则确定该基础检测模型训练成功,该训练成功的基础检测模型即为该预设检测模型。
在本实施例中,实现了对预设检测模型的训练,使得通过该预设检测模型能够对该语音数据中缺失帧进行精确定位,进一步提高了对缺失帧定位的准确率。
在本申请的一些实施例中,上述获取所述第二修复组的隐状态参数包括:
确定所述第二修复组的前一个修复组为第三修复组,获取所述第三修复组的元胞状态;
根据所述元胞状态和预设的长短期记忆网络,计算所述第二修复组的隐状态参数。
在本实施例中,在对第二修复组的隐状态参数进行计算时,需要获取该第二修复组的前一个修复组,该第二修复组的前一修复组即为第三修复组。获取该第三修复组的元胞状态为C t-1,将该元胞状态输入至长短期记忆网络,通过该长短期记忆网络对该元胞状态进行计算,得到当前第二修复组的元胞状态C t;其中,元胞状态为长短期记忆网络中最上层信息传输的路径,使得信息能够在序列连中传递下去,当前的元胞状态与前一时刻的元胞状态相关。在得到该第二修复组的元胞状态时,通过长短期记忆网络则可得到当前第二修复组的隐状态参数。具体地,隐状态参数用H t表示,对应的计算公式则为H t,C t=LSTM(H t-1,C t-1,I t-1),其中,C t为当前第二修复组的元胞状态,C t-1为第三修复组的元胞状态,I t-1为第三修复组的记忆值。
在本实施例中,实现了对隐状态参数的计算,由此使得通过该隐状态参数能够对修复频谱进行精确计算,进一步提高了语音数据修复的精确度。
在本申请的一些实施例中,上述输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱包括:
将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第一中间变量,将所述第二组数据和所述第三组数据输入至所述第二音频填充网络中,计算得到第二中间变量;
输入所述隐状态参数、所述第一中间变量和所述第二中间变量至预设的第一音频填充 网络中,计算得到所述缺失帧对应的修复频谱,其中,所述第一音频填充网络和所述第二音频填充网络具有相同的结构和不同的参数。
在本实施例中,在得到隐状态参数、第一组数据、第二组数据和第三组数据时,将该第一组数据和第二组数据输入至第二音频填充网络中,其中,该第二音频填充网络与第一音频填充网络具有相同的结构和不同的参数。通过该第二音频填充网络对该第一组数据和第二组数据进行计算,即可得到第一中间变量,通过第二音频填充网络对该第二组数据和第三组数据进行计算,即可得到第二中间变量。在得到第一中间变量和第二中间变量时,输入该第一中间变量、第二中间变量以及隐状态参数至第一音频填充网络中,由此则可计算得到当前缺失帧对应的修复频谱。该修复频谱的计算公式如下所示:
I i=F 2(I i-1,i,I i,i+1,H t-1)
其中,I i为修复频谱,F 2为第一音频填充网络,I i-1,i为第一中间变量,I i,i+1为第二中间变量,H t-1为第二修复组的隐状态参数。
如图3所示,图3为音频修复基本框架图,在左三角区域中,B 1、B 2、B 3为当前缺失帧所在的第一组数据B 2对应的第一修复组;F 2为第一音频修复网络,F 1为第二音频修复网络,I 1-2为第一修复组的第一中间变量,I 2-3为第一修复组的第二中间变量,I 2为第一修复组的修复频谱,H t-1当前修复组的前一修复组,即第二修复组的隐状态参数,H t当前修复组,即第一修复组的隐状态参数;在中间区域中,LSTM为预设的长短期记忆网络,C t-1为第二修复组的元胞状态,C t为第一修复组的元胞状态,σ为sigmoid激活函数,它的输出是在0到1之间,Tanh为双曲正切函数,它的输出在-1到1之间;在右三角区域中,B 2、B 3、B 4为下一个缺失帧所在的组数据B 3对应的修复组,I 2-3为B 3对应的修复组的第一中间变量,I 3-4为B 3对应的修复组的第二中间变量,I 2为B 3对应的修复组的修复频谱。
在本实施例中,实现了对修复频谱的计算,使得通过音频填充网络能够对修复频谱进行精确计算,进一步提高了对语音数据精确修复,提高了语音数据的修复质量。
在本申请的一些实施例中,第二音频填充网络包括第一卷积层、第二卷积层和残差密集网络,上述将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第一中间变量包括:
将所述第一组数据和所述第二组数据输入至所述第一卷积层计算得到第一参数值;
将所述第一参数值输入至所述残差密集网络,计算得到第二参数值,输入所述第二参数值至所述第二卷积层得到第一中间变量。
在本实施例中,第二音频填充网络包括第一卷积层、第二卷积层和残差密集网络,其中,第一卷积层包括一个5卷积核的卷积和一个3卷积核的卷积,第二卷积层包括一个1卷积核的卷积和3卷积核的卷积,残差密集网络则包括卷积层、激活层和连接层。如图4所示,图4为音频填充网络的结构图,第一音频填充网络与第二音频填充网络均为该结构图所示的结构。其中,卷积,5*5为卷积核为5的卷积层,卷积,3*3为卷积核为3的卷积层,卷积,1*1为卷积核为1的卷积层,残差密集网络的结构则具体如图4右的结构所示,包括卷积、激活层和连接层,其中,激活层为线性整流函数(Rectified Linear Unit,ReLu)。
在得到第一组数据和第二组数据时,将该第一组数据和第二组数据输入至第二音频填充网络的第一卷积层中,依次通过5卷积核的卷积和3卷积核的卷积计算得到第一参数值;在得到第一参数值时,将该第一参数值输入至该残差密集网络中,计算得到第二参数值;而后输入该第二参数值至第二卷积层的1卷积核的卷积和3卷积核的卷积,即可计算得到第一中间变量。该第一中间变量和第二中间变量的计算公式如下所示:
I i-1,i=F 1(B i-1,B i)
其中,I i-1,i为第一中间变量或第二中间变量,F 1为第二音频填充网络,B i-1和B i为相邻的两个组数据。第二中间变量与该第一中间变量的计算方式相同,只有输入的参数不同, 第二中间变量输入的参数为第二组数据和第三组数据。
在本实施例中,实现了对中间变量的计算,并通过卷积的方式提高了语音数据处理速度和数据准确度,使得根据中间变量计算得到的修复频谱更加准确。
在本申请的一些实施例中,上述划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数包括:
获取预设划分时长,按照所述预设划分时长将所述语音数据划分为多组帧信号;
计算每组所述帧信号的梅尔倒谱系数,将所述梅尔倒谱系数作为所述语音数据的特征系数。
在本实施例中,在对语音数据的特征系数进行提取时,获取预设划分时长,该预设划分时长为预先设定的帧划分时长,如以10ms为一帧,划分该语音数据为多组帧信号。计算每组帧信号的梅尔倒谱系数,如将所有帧信号的梅尔倒谱系数作为一个二维数组输出,即可得到当前语音数据的特征系数。该二维数组可用X=[x 1,x 2,...,x n]表示,其中,
Figure PCTCN2020124898-appb-000004
n为帧信号的数量,x i是每帧信号计算得到的梅尔倒谱系数,k为系数常量,如k取40。
在本实施例中,实现了对语音数据特征系数的提取,使得通过该特征系数能够对缺失的帧信号进行定位,提高了缺失帧的定位准确率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
进一步参考图5,作为对上述图2所示方法的实现,本申请提供了一种语音文件修复装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例所述的语音文件修复装置500包括:划分模块501、定位模块502、获取模块503、计算模块504以及处理模块505。其中,
划分模块501,用于划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
其中,划分模块501包括:
划分单元,用于获取预设划分时长,按照所述预设划分时长将所述语音数据划分为多组帧信号;
第一计算单元,计算每组所述帧信号的梅尔倒谱系数,将所述梅尔倒谱系数作为所述语音数据的特征系数。
在本实施例中,在获取到语音数据时,划分语音数据为多组帧信号,其中,帧信号为以帧为单位的语音信号;划分方式可以为将语音数据按照预设划分时长进行划分,也可以按照预设划分长度进行划分。通过对语音数据进行划分,可以划分得到该语音数据对应的多组帧信号。在得到该帧信号时,根据该帧信号提取该语音数据的特征系数,其中,该特征系数为当前该语音数据的特征表示,以梅尔倒谱系数为该语音数据的特征系数为例。计算该语音数据的每个帧信号的梅尔倒谱系数,组合所有帧信号的梅尔倒谱系数为一个二维数组,该二维数组即为当前语音数据的特征系数。
定位模块502,用于基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
其中,定位模块502包括:
第一获取单元,用于获取预设检测模型,所述预设检测模型包括检测神经网络和全连接层,输入所述特征系数至所述检测神经网络中,计算得到检测值;
第二计算单元,用于输入所述检测值至所述全连接层,计算得到输出结果,根据所述输出结果对所述语音数据的缺失帧进行定位。
其中,第一获取单元包括:
切分子单元,用于获取预设语料库中的原始文件,切分所述原始文件为多个帧数据,从所有所述帧数据中随机抽取预设个数的子帧数据,将所述子帧数据中预设时间段的信号用高斯白噪声替换,得到替换后的子帧数据,组合所述替换后的子帧数据和未被替换的帧数据为训练数据集;
训练子单元,用于根据所述训练数据集对基础检测模型进行训练,得到训练成功的基础检测模型为所述语音数据的预设检测模型。
在本实施例中,在得到语音数据的特征系数时,根据预设检测模型和该特征系数,对当前语音数据的缺失帧进行定位,即对该语音数据中缺失信号的时间点进行定位,该时间点所在的帧即为缺失帧。预设检测模型为预先设定的声音检测模型,根据该预设检测模型对获取到的特征系数进行计算,从而可以确定当前语音数据中缺失信号的时间点,即找到对应的缺失帧。具体地,特征系数为一个二维数组,其中包括了每个帧信号对应的系数;将每个帧信号的系数依次输入至预设检测模型中,即可计算得到当前语音数据对应的输出结果,该输出结果为一个二维数组,其中的数据分别用0和1表示对应帧信号的缺失和不缺失;若输出结果中的数据为0,则0对应的该帧信号为缺失帧;若输出结果中的数据为1,则1对应的该帧信号非缺失帧。在得到缺失帧时,则将该缺失帧所在语音数据中的组位置确定为第一组数据,其中,不同的帧信号所在的组位置可能不同,将语音数据的帧信号按照预设个数划分为一组,如五个帧信号划分为一组数据;由此,可划分到多组不同的组数据,缺失帧所在的组即为第一组数据。
获取模块503,用于获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
其中,获取模块503包括:
第二获取单元,用于确定所述第二修复组的前一个修复组为第三修复组,获取所述第三修复组的元胞状态;
第三计算单元,用于根据所述元胞状态和预设的长短期记忆网络,计算所述第二修复组的隐状态参数。
在本实施例中,在得到第一组数据时,获取该第一组数据的前后组数据,并将前组数据确定为第二组数据,后组数据确定为第三组数据。组合第一组数据、第二组数据和第三组数据为第一修复组,即将缺失帧所在的组以及其前后两组数据作为第一修复组。该第一修复组的前一个修复组则为第二修复组,该第二修复组中即包括了该缺失帧前一个缺失帧所在的组数据,以及该前一个缺失帧的前后组数据。如第一修复组为(B 4、B 5、B 6),其中,B 5为第一修复组中缺失帧所在的第一组数据,该缺失帧的前一个缺失帧所在组数据为B 4,则第二修复组为(B 3、B 4、B 5)。
在得到该第二修复组时,则获取该第二修复组的隐状态参数,其中,隐状态参数为根据预设的长短期记忆网络(LSTM,Long Short-Term Memory)计算得到的短时记忆参数,当前时刻的修复组的输出往往与上一时刻的修复组的隐状态参数相关,通过上一时刻修复组的隐状态参数可以计算得到当前时刻修复组的隐状态参数。通过该隐状态参数可以保证 各个修复组之间数据的连贯性。
计算模块504,用于输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
其中,计算模块504包括:
第四计算单元,用于将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第一中间变量,将所述第二组数据和所述第三组数据输入至所述第二音频填充网络中,计算得到第二中间变量;
第五计算单元,用于输入所述隐状态参数、所述第一中间变量和所述第二中间变量至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱,其中,所述第一音频填充网络和所述第二音频填充网络具有相同的结构和不同的参数。
其中,第二音频填充网络包括第一卷积层、第二卷积层和残差密集网络,第四计算单元包括:
第一计算子单元,用于将所述第一组数据和所述第二组数据输入至所述第一卷积层计算得到第一参数值;
第二计算子单元,用于将所述第一参数值输入至所述残差密集网络,计算得到第二参数值,输入所述第二参数值至所述第二卷积层得到第一中间变量。
在本实施例中,在得到第二修复组的隐状态参数时,将该隐状态参数、第一组数据、第二组数据和第三组数据至第一音频填充网络中,根据该第一音频填充网络计算得到当前语音数据的修复频谱。其中,第一音频填充网络为预先设定的音频填充网络,其包括了多个卷积层和相关的残差密集网络,在得到隐状态参数、第一组数据、第二组数据和第三组数据时,将得到的隐状态参数、第一组数据、第二组数据和第三组数据输入至第一音频填充网络,即可得到当前修复帧对应的修复频谱。
需要强调的是,为进一步保证上述修复频谱的私密和安全性,上述修复频谱还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
处理模块505,用于基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
在本实施例中,在得到修复频谱时,则通过预设声码器对该修复频谱进行修复处理,由此可得到的该修复频谱对应的语音信号,该语音信号即为当前语音数据的修复语音。以wavenet作为预设声码器为例,wavenet为基于深度学习的语音生成模型,该语音生成模型可以直接对原始语音数据进行建模。因此,在接收到修复频谱时,将该修复频谱输入至该语音生成模型中,即可得到该修复频谱对应的修复语音。通过该修复语音即可获取到当前语音数据中缺失的语音信息。
在本实施例中,实现了对受损语音数据的快速修复,并且在提高修复速度的同时,提高了文件修复的质量,除此之外,还提高了语音识别应用的语音识别准确率,使得完整的语音信号更容易地被正确识别,进一步提高了语音识别应用的识别效率。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图6,图6为本实施例计算机设备基本结构框图。
所述计算机设备6包括通过系统总线相互通信连接存储器61、处理器62、网络接口63。需要指出的是,图中仅示出了具有组件61-63的计算机设备6,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数 值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器61至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。所述计算机可读存储介质可以是非易失性,也可以是易失性。在一些实施例中,所述存储器61可以是所述计算机设备6的内部存储单元,例如该计算机设备6的硬盘或内存。在另一些实施例中,所述存储器61也可以是所述计算机设备6的外部存储设备,例如该计算机设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器61还可以既包括所述计算机设备6的内部存储单元也包括其外部存储设备。本实施例中,所述存储器61通常用于存储安装于所述计算机设备6的操作系统和各类应用软件,例如语音文件修复方法的计算机可读指令等。此外,所述存储器61还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器62在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器62通常用于控制所述计算机设备6的总体操作。本实施例中,所述处理器62用于运行所述存储器61中存储的计算机可读指令或者处理数据,例如运行所述语音文件修复方法的计算机可读指令。
所述网络接口63可包括无线网络接口或有线网络接口,该网络接口63通常用于在所述计算机设备6与其他电子设备之间建立通信连接。
本实施例提出的计算机设备,实现了对受损语音数据的快速修复,并且在提高修复速度的同时,提高了文件修复的质量,除此之外,还提高了语音识别应用的语音识别准确率,使得完整的语音信号更容易地被正确识别,进一步提高了语音识别应用的识别效率。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有语音文件修复流程,所述语音文件修复流程可被至少一个处理器执行,以使所述至少一个处理器执行如上述的语音文件修复的步骤。
本实施例提出的计算机可读存储介质,实现了对受损语音数据的快速修复,并且在提高修复速度的同时,提高了文件修复的质量,除此之外,还提高了语音识别应用的语音识别准确率,使得完整的语音信号更容易地被正确识别,进一步提高了语音识别应用的识别效率。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图 中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种语音文件修复方法,包括下述步骤:
    划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
    基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
    获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
    输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
    基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
  2. 根据权利要求1所述的语音文件修复方法,其中,所述基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位的步骤包括:
    获取预设检测模型,所述预设检测模型包括检测神经网络和全连接层,输入所述特征系数至所述检测神经网络中,计算得到检测值;
    输入所述检测值至所述全连接层,计算得到输出结果,根据所述输出结果对所述语音数据的缺失帧进行定位。
  3. 根据权利要求2所述的语音文件修复方法,其中,所述获取预设检测模型的步骤包括:
    获取预设语料库中的原始文件,切分所述原始文件为多个帧数据,从所有所述帧数据中随机抽取预设个数的子帧数据,将所述子帧数据中预设时间段的信号用高斯白噪声替换,得到替换后的子帧数据,组合所述替换后的子帧数据和未被替换的帧数据为训练数据集;
    根据所述训练数据集对基础检测模型进行训练,得到训练成功的基础检测模型为所述语音数据的预设检测模型。
  4. 根据权利要求1所述的语音文件修复方法,其中,所述获取所述第二修复组的隐状态参数的步骤包括:
    确定所述第二修复组的前一个修复组为第三修复组,获取所述第三修复组的元胞状态;
    根据所述元胞状态和预设的长短期记忆网络,计算所述第二修复组的隐状态参数。
  5. 根据权利要求1所述的语音文件修复配方法,其中,所述输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱的步骤包括:
    将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第一中间变量,将所述第二组数据和所述第三组数据输入至所述第二音频填充网络中,计算得到第二中间变量;
    输入所述隐状态参数、所述第一中间变量和所述第二中间变量至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱,其中,所述第一音频填充网络和所述第二音频填充网络具有相同的结构和不同的参数。
  6. 根据权利要求5所述的语音文件修复方法,其中,第二音频填充网络包括第一卷积层、第二卷积层和残差密集网络,所述将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第一中间变量的步骤包括:
    将所述第一组数据和所述第二组数据输入至所述第一卷积层计算得到第一参数值;
    将所述第一参数值输入至所述残差密集网络,计算得到第二参数值,输入所述第二参数值至所述第二卷积层得到第一中间变量。
  7. 根据权利要求1所述的语音文件修复方法,其中,所述划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数的步骤包括:
    获取预设划分时长,按照所述预设划分时长将所述语音数据划分为多组帧信号;
    计算每组所述帧信号的梅尔倒谱系数,将所述梅尔倒谱系数作为所述语音数据的特征系数。
  8. 一种语音文件修复装置,包括:
    划分模块,用于划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
    定位模块,用于基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
    获取模块,用于获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
    计算模块,用于输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
    处理模块,用于基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的语音文件修复方法的步骤:
    划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
    基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
    获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
    输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
    基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
  10. 根据权利要求9所述的计算机设备,其中,所述基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位的步骤包括:
    获取预设检测模型,所述预设检测模型包括检测神经网络和全连接层,输入所述特征系数至所述检测神经网络中,计算得到检测值;
    输入所述检测值至所述全连接层,计算得到输出结果,根据所述输出结果对所述语音数据的缺失帧进行定位。
  11. 根据权利要求10所述的计算机设备,其中,所述获取预设检测模型的步骤包括:
    获取预设语料库中的原始文件,切分所述原始文件为多个帧数据,从所有所述帧数据中随机抽取预设个数的子帧数据,将所述子帧数据中预设时间段的信号用高斯白噪声替换,得到替换后的子帧数据,组合所述替换后的子帧数据和未被替换的帧数据为训练数据集;
    根据所述训练数据集对基础检测模型进行训练,得到训练成功的基础检测模型为所述语音数据的预设检测模型。
  12. 根据权利要求9所述的计算机设备,其中,所述获取所述第二修复组的隐状态参数的步骤包括:
    确定所述第二修复组的前一个修复组为第三修复组,获取所述第三修复组的元胞状态;
    根据所述元胞状态和预设的长短期记忆网络,计算所述第二修复组的隐状态参数。
  13. 根据权利要求9所述的计算机设备,其中,所述输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱的步骤包括:
    将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第 一中间变量,将所述第二组数据和所述第三组数据输入至所述第二音频填充网络中,计算得到第二中间变量;
    输入所述隐状态参数、所述第一中间变量和所述第二中间变量至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱,其中,所述第一音频填充网络和所述第二音频填充网络具有相同的结构和不同的参数。
  14. 根据权利要求13所述的计算机设备,其中,第二音频填充网络包括第一卷积层、第二卷积层和残差密集网络,所述将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第一中间变量的步骤包括:
    将所述第一组数据和所述第二组数据输入至所述第一卷积层计算得到第一参数值;
    将所述第一参数值输入至所述残差密集网络,计算得到第二参数值,输入所述第二参数值至所述第二卷积层得到第一中间变量。
  15. 根据权利要求9所述的计算机设备,其中,所述划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数的步骤包括:
    获取预设划分时长,按照所述预设划分时长将所述语音数据划分为多组帧信号;
    计算每组所述帧信号的梅尔倒谱系数,将所述梅尔倒谱系数作为所述语音数据的特征系数。
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的语音文件修复方法的步骤:
    划分语音数据为多组帧信号,根据所述帧信号提取所述语音数据的特征系数;
    基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位,并确定所述缺失帧在所述语音数据中的组位置为第一组数据;
    获取所述第一组数据的前后组数据,并将所述前后组数据分别作为第二组数据和第三组数据,组合所述第一组数据、所述第二组数据和所述第三组数据为第一修复组,确定所述第一修复组的前一个修复组为第二修复组,获取所述第二修复组的隐状态参数;
    输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱;
    基于预设声码器对所述修复频谱进行处理,得到所述语音数据的修复语音。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述基于预设检测模型和所述特征系数对所述语音数据的缺失帧进行定位的步骤包括:
    获取预设检测模型,所述预设检测模型包括检测神经网络和全连接层,输入所述特征系数至所述检测神经网络中,计算得到检测值;
    输入所述检测值至所述全连接层,计算得到输出结果,根据所述输出结果对所述语音数据的缺失帧进行定位。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述获取预设检测模型的步骤包括:
    获取预设语料库中的原始文件,切分所述原始文件为多个帧数据,从所有所述帧数据中随机抽取预设个数的子帧数据,将所述子帧数据中预设时间段的信号用高斯白噪声替换,得到替换后的子帧数据,组合所述替换后的子帧数据和未被替换的帧数据为训练数据集;
    根据所述训练数据集对基础检测模型进行训练,得到训练成功的基础检测模型为所述语音数据的预设检测模型。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述获取所述第二修复组的隐状态参数的步骤包括:
    确定所述第二修复组的前一个修复组为第三修复组,获取所述第三修复组的元胞状态;
    根据所述元胞状态和预设的长短期记忆网络,计算所述第二修复组的隐状态参数。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述输入所述隐状态参数、所述第一组数据、所述第二组数据和所述第三组数据至预设的第一音频填充网络中,计算 得到所述缺失帧对应的修复频谱的步骤包括:
    将所述第一组数据和所述第二组数据输入至预设的第二音频填充网络中,计算得到第一中间变量,将所述第二组数据和所述第三组数据输入至所述第二音频填充网络中,计算得到第二中间变量;
    输入所述隐状态参数、所述第一中间变量和所述第二中间变量至预设的第一音频填充网络中,计算得到所述缺失帧对应的修复频谱,其中,所述第一音频填充网络和所述第二音频填充网络具有相同的结构和不同的参数。
PCT/CN2020/124898 2020-09-18 2020-10-29 语音文件修复方法、装置、计算机设备及存储介质 WO2021169356A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010990031.1 2020-09-18
CN202010990031.1A CN112071331B (zh) 2020-09-18 2020-09-18 语音文件修复方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021169356A1 true WO2021169356A1 (zh) 2021-09-02

Family

ID=73681627

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124898 WO2021169356A1 (zh) 2020-09-18 2020-10-29 语音文件修复方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112071331B (zh)
WO (1) WO2021169356A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112652310B (zh) * 2020-12-31 2024-08-09 乐鑫信息科技(上海)股份有限公司 分布式语音处理系统及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894565A (zh) * 2009-05-19 2010-11-24 华为技术有限公司 语音信号修复方法和装置
US20110142257A1 (en) * 2009-06-29 2011-06-16 Goodwin Michael M Reparation of Corrupted Audio Signals
CN107564533A (zh) * 2017-07-12 2018-01-09 同济大学 基于信源先验信息的语音帧修复方法和装置
CN108899032A (zh) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 声纹识别方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8126578B2 (en) * 2007-09-26 2012-02-28 University Of Washington Clipped-waveform repair in acoustic signals using generalized linear prediction
EP3471267B1 (en) * 2017-10-13 2021-01-06 Vestel Elektronik Sanayi ve Ticaret A.S. Method and apparatus for repairing distortion of an audio signal
CN109887515B (zh) * 2019-01-29 2021-07-09 北京市商汤科技开发有限公司 音频处理方法及装置、电子设备和存储介质
CN110136735B (zh) * 2019-05-13 2021-09-28 腾讯音乐娱乐科技(深圳)有限公司 一种音频修复方法、设备及可读存储介质
CN110534120B (zh) * 2019-08-31 2021-10-01 深圳市友恺通信技术有限公司 一种移动网络环境下的环绕声误码修复方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894565A (zh) * 2009-05-19 2010-11-24 华为技术有限公司 语音信号修复方法和装置
US20110142257A1 (en) * 2009-06-29 2011-06-16 Goodwin Michael M Reparation of Corrupted Audio Signals
CN107564533A (zh) * 2017-07-12 2018-01-09 同济大学 基于信源先验信息的语音帧修复方法和装置
CN108899032A (zh) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 声纹识别方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN112071331B (zh) 2023-05-30
CN112071331A (zh) 2020-12-11

Similar Documents

Publication Publication Date Title
WO2021155713A1 (zh) 基于权重嫁接的模型融合的人脸识别方法及相关设备
WO2022105125A1 (zh) 图像分割方法、装置、计算机设备及存储介质
WO2021135455A1 (zh) 语义召回方法、装置、计算机设备及存储介质
US20210097159A1 (en) Electronic device, method and system of identity verification and computer readable storage medium
WO2023035531A1 (zh) 文本图像超分辨率重建方法及其相关设备
CN112395390B (zh) 意图识别模型的训练语料生成方法及其相关设备
WO2022142032A1 (zh) 手写签名校验方法、装置、计算机设备及存储介质
WO2021139316A1 (zh) 建立表情识别模型方法、装置、计算机设备及存储介质
WO2021218027A1 (zh) 智能面试中专业术语的提取方法、装置、设备及介质
US20220044105A1 (en) Training multimodal representation learning model on unnanotated multimodal data
CN115312033A (zh) 基于人工智能的语音情感识别方法、装置、设备及介质
CN111126084B (zh) 数据处理方法、装置、电子设备和存储介质
CN114241411B (zh) 基于目标检测的计数模型处理方法、装置及计算机设备
CN107729944B (zh) 一种低俗图片的识别方法、装置、服务器及存储介质
CN115438149A (zh) 一种端到端模型训练方法、装置、计算机设备及存储介质
CN116796730A (zh) 基于人工智能的文本纠错方法、装置、设备及存储介质
CN113988223B (zh) 证件图像识别方法、装置、计算机设备及存储介质
US20240153490A1 (en) Systems and methods for correcting automatic speech recognition errors
CN113436633B (zh) 说话人识别方法、装置、计算机设备及存储介质
WO2021169356A1 (zh) 语音文件修复方法、装置、计算机设备及存储介质
CN117132950A (zh) 一种车辆追踪方法、系统、设备及存储介质
CN114372889A (zh) 核保监控方法、装置、计算机设备及存储介质
CN115062136A (zh) 基于图神经网络的事件消歧方法及其相关设备
US10924629B1 (en) Techniques for validating digital media content
CN113361629A (zh) 一种训练样本生成的方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921655

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921655

Country of ref document: EP

Kind code of ref document: A1