CN115547339A

CN115547339A - Voice processing method, processing device, electronic equipment and storage medium

Info

Publication number: CN115547339A
Application number: CN202210956020.0A
Authority: CN
Inventors: 丁俊豪; 陈东鹏
Original assignee: Voiceai Technologies Co ltd
Current assignee: Voiceai Technologies Co ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-12-30

Abstract

The application discloses a voice processing method, a processing device, electronic equipment and a storage medium, wherein the voice processing method comprises the steps of obtaining a first sound segment corresponding to a comparison voice and a second sound segment corresponding to a target voice; generating a first spectrogram corresponding to the first sound segment; generating a second spectrogram corresponding to the second sound segment; and aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram, wherein the target spectrogram comprises an alignment spectrogram corresponding to the first spectrogram and the second spectrogram, and the time position of the alignment spectrogram is aligned with the time position of the first spectrogram. The method realizes that the target spectrogram comprising the first spectrogram corresponding to the first sound segment and the alignment spectrogram corresponding to the second sound segment is generated automatically according to the first sound segment corresponding to the comparison voice and the second sound segment corresponding to the target voice, the first spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the second sound segment do not need to be aligned manually, and the identification efficiency of voice identity identification is improved.

Description

Voice processing method, processing device, electronic equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a speech processing method, a processing apparatus, an electronic device, and a storage medium.

Background

Voiceprint Identification (Voice Identification) is one of the biometric Identification techniques, and is also called Voice identity Identification. In the identification process, when different voice data need to be compared, for example, whether different voice data are of the same speaker needs to be confirmed, the similarity of spectrogram corresponding to different voice data can be compared.

At present, when the similarity of the speech spectrograms corresponding to different speech data is compared, the speech spectrograms corresponding to different speech data need to be manually aligned, the operation process is complicated, and the identification efficiency of speech identity identification is low.

Disclosure of Invention

In view of the above problems, the present application provides a speech processing method, a processing device, an electronic device and a storage medium to overcome or at least partially solve the above problems of the prior art.

In a first aspect, an embodiment of the present application provides a speech processing method, including: acquiring a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice; generating a first spectrogram corresponding to the first sound segment; generating a second spectrogram corresponding to the second sound segment; and aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram, wherein the target spectrogram comprises an alignment spectrogram corresponding to the first spectrogram and the second spectrogram, and the time position of the alignment spectrogram is aligned with the time position of the first spectrogram.

In a second aspect, an embodiment of the present application provides a speech processing apparatus, including: the acquisition module is used for acquiring a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice; the first generating module is used for generating a first spectrogram corresponding to the first sound segment; the second generating module is used for generating a second spectrogram corresponding to the second sound segment; and the third generation module is used for aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram, wherein the target spectrogram comprises an alignment spectrogram corresponding to the first spectrogram and the second spectrogram, and the boundary of the alignment spectrogram is aligned with the boundary of the first spectrogram.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory; one or more processors coupled with the memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs being configured to perform the speech processing method as provided in the first aspect above.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code can be called by a processor to execute the speech processing method provided in the first aspect.

According to the scheme, the first sound segment corresponding to the comparison voice and the second sound segment corresponding to the target voice are obtained, the first spectrogram corresponding to the first sound segment is generated, the second spectrogram corresponding to the second sound segment is generated, the target spectrogram is generated by aligning according to the first spectrogram and the second spectrogram, the target spectrogram comprises the alignment spectrogram corresponding to the first spectrogram and the second spectrogram, the time position of the alignment spectrogram is aligned with the time position of the first spectrogram, the target spectrogram comprising the alignment spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the target voice are generated automatically according to the first sound segment corresponding to the comparison voice and the second sound segment corresponding to the target voice, the first spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the second sound segment do not need to be aligned manually, and the identification efficiency of voice identity identification is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a flowchart of a speech processing method according to an embodiment of the present application.

Fig. 2 shows another flow chart of the speech processing method provided in the embodiment of the present application.

Fig. 3 is a schematic flow chart illustrating a speech processing method according to an embodiment of the present application.

Fig. 4 shows a block diagram of a speech processing apparatus according to an embodiment of the present application.

Fig. 5 shows a functional block diagram of an electronic device provided in an embodiment of the present application.

Fig. 6 illustrates a computer-readable storage medium provided by an embodiment of the present application for storing or carrying program codes for implementing a speech processing method provided by an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for explaining the present application and are not to be construed as limiting the present application.

The following disclosure provides many different embodiments or examples for implementing different features of the application. In order to simplify the disclosure of the present application, specific example components and arrangements are described below. Of course, they are merely examples and are not intended to limit the present application. Moreover, the present application may repeat reference numerals and/or letters in the various examples, such repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Voiceprint Identification (Voice Identification) is one of the biometric techniques, also known as Voice identity Identification. In the identification process, when different voice data need to be compared, for example, whether the different voice data are the same speaker needs to be confirmed, the similarity of the spectrogram corresponding to the different voice data can be compared.

In view of the above problems, the inventor has conducted long-term research and provides a voice processing method, a processing apparatus, an electronic device, and a storage medium according to the embodiments of the present application, so as to generate a target spectrogram including a first spectrogram corresponding to a first voice segment and an alignment spectrogram corresponding to a second voice segment automatically according to a first voice segment corresponding to a comparison voice and a second voice segment corresponding to a target voice, without manually aligning the first spectrogram corresponding to the first voice segment and the second spectrogram corresponding to the second voice segment, thereby improving the identification efficiency of voice identity identification.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Referring to fig. 1, a flowchart of a speech processing method according to an embodiment of the present application is shown. In a specific embodiment, the voice processing method may be executed by an electronic device with processing capability, such as a terminal device like a desktop computer or a notebook computer, and may be executed interactively by a processing system including a server and a terminal. As shown in fig. 1, the voice processing method may include steps S110 to S140.

Step S110: and acquiring a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice.

In the embodiment of the application, for two voice data which need to be subjected to voice identity authentication, the voice data with undetermined identity information is comparison voice, and the voice data with definite identity information is target voice.

And obtaining a corresponding first sound segment by performing Automatic Speech Recognition (ASR) on the comparison Speech, and obtaining a corresponding second sound segment by performing ASR on the target Speech.

The first sound segment may include a speech segment having start and end time information of a specific phoneme (e.g., initial consonant and vowel of chinese, or consonant and vowel of english, etc.) in the comparison speech, or a speech segment having start and end time information of a syllable (e.g., a single word of chinese, or a single syllable of english, etc.) in the comparison speech.

The second sound segment may include a speech segment having start and stop time information of a specific phoneme (e.g., initials, finals of chinese, or consonants, vowels of english, etc.) in the target speech, or a speech segment having start and stop time information of a syllable (e.g., a single word of chinese, or a monosyllable of english, etc.) in the target speech.

In the embodiment of the present application, a phoneme (phone) refers to a minimum phonetic unit divided according to natural attributes of speech, and is divided into two categories, namely vowel and consonant. The phonemes are analyzed according to pronunciation actions in syllables, and one action forms one phoneme, for example, chinese syllables o (257) only has one phoneme (257); the love (aii) has two phonemes, i.e., aii and i; the generations (d-a-i) have three phonemes, i.e. d, a-and i. For another example, english a has only one phoneme a; an has two phonemes, a and n; red has three phonemes, r, e and d.

It is understood that, because the pronunciation of the same letter varies according to the language pronunciation, the phoneme may be a phoneme in any language, for example, a chinese phoneme, or an english phoneme, and is not limited herein.

The syllable with the syllable type of the word type is the word, and the syllable with the syllable type of the phoneme type is the phoneme. Thus, in this scheme, the same syllables may be the same word, or the same phone.

Step S120: and generating a first spectrogram corresponding to the first sound segment.

In this embodiment of the application, the acquired first segment may be preprocessed, and Matlab parameter configuration may be performed on the preprocessed first segment, so as to generate a corresponding first spectrogram.

The preprocessing may include normalization, pre-emphasis, framing, and windowing in sequence. The method comprises the steps of normalizing a first sound segment to reduce differences among different voice fragments, improving high-frequency components of the first sound segment by a pre-emphasis technology to enable the first sound segment to be relatively flat from low frequency to high frequency, improving high-frequency components by a first-order high-pass filter, performing frame windowing on the normalized and pre-emphasized first sound segment, multiplying a window function with a certain length by the first sound segment to obtain each frame of windowed sound segment, performing Matlab parameter configuration on each frame of windowed sound segment, and generating a corresponding first spectrogram.

The window function can be a Hamming window, a Hanning window or a rectangular window, etc.; the configured parameter may be at least any one of a signal bandwidth parameter, a dynamic range parameter, a sampling range parameter, an attenuation parameter, a high-frequency boost parameter, a windowing type parameter, and the like.

The spectrogram can represent a time-varying graph of a voice spectrum, the vertical axis of the graph is frequency, the horizontal axis of the graph is time position, and the strength of any given frequency component at a given moment is represented by the gray scale or tone shade of a corresponding point. The color is dark, which means that the speech energy of the point is stronger, whereas the speech energy of the point is weaker. The speech spectrogram can be divided into a narrow-band speech spectrogram or a wide-band speech spectrogram, and the narrow-band speech spectrogram can clearly display the structure of harmonics and reflect the time-varying process of the fundamental frequency; the broadband spectrogram can clearly display the structure and the spectral envelope of the formants and reflect the rapid time-varying process of the frequency spectrum, so that higher time resolution can be obtained on the broadband spectrogram.

In this embodiment of the application, the first spectrogram may be a wide-band spectrogram, and the first spectrogram describes frequency domain features of each speech frame, such as frequency and speech energy, according to a chronological order of time, that is, the first spectrogram is related to time.

Step S130: and generating a second spectrogram corresponding to the second sound segment.

In this embodiment of the present application, the obtained second segment may be preprocessed, and Matlab parameter configuration may be performed on the preprocessed second segment, so as to generate a corresponding second spectrogram. The process of preprocessing the second sound segment and configuring Matlab parameters is the same as the process of preprocessing the first sound segment and configuring Matlab parameters, and is not described herein again.

The second spectrogram may be a wideband spectrogram, and the second spectrogram describes frequency domain features of each speech frame, such as frequency and speech energy, according to a time sequence, that is, the second spectrogram is related to time.

Step S140: and aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram.

In the embodiment of the present application, the widths of the spectrogram of different segments (i.e. time information) are different at the same time resolution due to the different durations of different segments. The first spectrogram and the second spectrogram can be aligned to generate a target spectrogram, the target spectrogram can comprise an alignment spectrogram corresponding to the first spectrogram and the second spectrogram, the time position of the alignment spectrogram is aligned with the time position of the first spectrogram, the first spectrogram corresponding to the first spectrogram and the second spectrogram corresponding to the second spectrogram can be automatically aligned according to the comparison of a first sound segment corresponding to the voice and a second sound segment corresponding to the target voice, the target spectrogram comprising the alignment spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the second sound segment can be generated, the first spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the second sound segment do not need to be manually aligned, and the identification efficiency of voice identity identification is improved.

In some embodiments, the first spectrogram may be used as a reference map, the second spectrogram may be aligned with the first spectrogram to obtain an alignment spectrogram, and the first spectrogram and the alignment spectrogram are aligned in the same canvas to generate a target spectrogram. The canvas is a layer for displaying a spectrogram.

As an implementation manner, the first spectrogram may be used as a reference, a ratio between a time position of the second spectrogram and a time position of the first spectrogram is calculated to obtain a time position ratio, the second spectrogram is scaled according to the time position ratio to obtain a corresponding scaled spectrogram, which is the alignment spectrogram, and the first spectrogram and the alignment spectrogram may be aligned in the same canvas to generate the target spectrogram.

It can be understood that when the time position ratio is greater than 1, the time position of the second spectrogram is greater than the time position of the first spectrogram, and the second spectrogram is reduced according to the time position ratio to obtain a corresponding reduced spectrogram, which is the alignment spectrogram. When the time position proportion is equal to 1, the time position of the second spectrogram is equal to the time position of the first spectrogram, the second spectrogram is not required to be scaled, and the second spectrogram is the alignment spectrogram. And when the time position proportion is less than 1, the time position of the second spectrogram is less than that of the first spectrogram, and the second spectrogram is amplified according to the time position proportion to obtain a corresponding amplified spectrogram, namely the alignment spectrogram.

As an embodiment, since the number of pixels of the spectrogram displayed in the specific image area is fixed, when the image of the second spectrogram is directly scaled to obtain the alignment spectrogram, the alignment spectrogram generates a mosaic, thereby affecting the image quality of the alignment spectrogram. The second spectrogram can be analyzed to obtain transverse pixels corresponding to the second spectrogram, the transverse pixels of the second spectrogram correspond to time positions of the second spectrogram, a target Fourier transform frequency spectrum corresponding to the transverse pixels is calculated according to a preset rule, the time positions corresponding to the target Fourier transform frequency spectrum are aligned with the time positions of the first spectrogram, and a corresponding aligned spectrogram is generated according to the target Fourier transform frequency spectrum.

The preset rule can be a frame extraction calculation rule of the transverse pixels, the transverse pixels of the second spectrogram are subjected to frame extraction calculation to obtain a corresponding target Fourier transform spectrum, the transverse pixels of the generated alignment spectrogram and the transverse pixels of the second spectrogram can be guaranteed to be consistent all the time, mosaic of the generated alignment spectrogram can be avoided, and the image quality of the alignment spectrogram is guaranteed.

As an example, the longitudinal height of the second spectrogram is x pixels, the transverse width of the second spectrogram is y pixels, according to a frame extraction calculation rule of the transverse pixels, y frames are uniformly extracted from the sound segment data corresponding to the second spectrogram according to a frame length parameter, N-point target fourier transform spectrums of each frame of voice signals are respectively calculated, a spectrogram energy data matrix with the size of y (N/2 + 1) is obtained, and energy values in the spectrogram energy data matrix are mapped to gray values of an image, so that a mapping spectrogram, namely an alignment spectrogram, is obtained.

Referring to fig. 2, a flowchart of a speech processing method according to another embodiment of the present application is shown. In a specific embodiment, the voice processing method may be executed by an electronic device with processing capability, such as a terminal device like a desktop computer or a notebook computer, and may also be executed interactively by a processing system including a server and a terminal. As shown in fig. 2, the voice processing method may include steps S210 to S260.

Step S210: and acquiring a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice.

Step S220: and generating a first spectrogram corresponding to the first sound segment.

In this embodiment, step S210 and step S220 may refer to the content of the corresponding steps in the foregoing embodiments, and are not described herein again.

Step S230: and determining whether to adjust time information corresponding to the segment data of the second segment.

In this embodiment, since there is an error in the time information in the segment obtained by performing ASR on the speech, so that there is a deviation between the start and end times of the segment and the image in the spectrogram, the segment data of the second segment may be analyzed to obtain an analysis result, and whether to adjust the time information corresponding to the segment data of the second segment may be determined according to the analysis result, where the segment data may include phoneme data, syllable data, and the like.

In some embodiments, the analysis result may be a time matching degree, the time information corresponding to the segment data of the second segment may be matched with preset time information to obtain a time matching degree, and whether to adjust the time information corresponding to the segment data of the second segment is determined according to the time matching degree. The preset time information is the actual time information corresponding to the actual voice data of the second voice segment.

When the time matching degree is greater than or equal to the time matching degree threshold value, determining to adjust time information corresponding to the segment data of the second segment; and when the time matching degree is smaller than the time matching degree threshold value, determining not to adjust the time information corresponding to the sound segment data of the second sound segment.

Step S240: and when the time information corresponding to the segment data of the second segment is determined to be adjusted, adjusting the time information corresponding to the segment data of the second segment to obtain an adjusted segment.

In this embodiment, when determining to adjust the time information corresponding to the segment data of the second segment, the time information corresponding to the segment data of the second segment may be adjusted to obtain an adjusted segment, and the time information corresponding to the segment data of the adjusted segment is matched with the preset time information, so that the time information of the second segment is corrected, and the accuracy of identifying the identity of the voice can be improved.

In some embodiments, when it is determined to adjust time information corresponding to segment data of the second segment, the second segment may be input to a deep learning network model trained in advance, and the deep learning network model may be configured to adjust the second segment to an adjusted segment whose time information corresponding to the segment data matches preset time information, and receive the adjusted segment output by the deep learning network model.

The Deep learning network model may be a Convolutional Neural Network (CNN) model, a Deep Belief Network (DBN) model, a Stacked Auto Encoder network (SAE) model, a Recurrent Neural Network (RNN) model, a Deep Neural Network (DNN) model, a Long Short Term Memory (Long Short Term Memory, LSTM) network model, or a threshold Recurring unit (Gated learning Units, GRU) model, and the like, where the type of the Deep learning network model is not limited, and may be specifically set according to actual requirements.

Step S250: and generating a spectrogram corresponding to the adjusted sound segment, and taking the spectrogram as a second spectrogram corresponding to the second sound segment.

In this embodiment, the adjustment segments may be preprocessed, matlab parameter configuration may be performed on the preprocessed adjustment segments, a spectrogram corresponding to the adjustment segments is generated, and the spectrogram may be used as a second spectrogram corresponding to a second segment. The process of preprocessing the tuning segment and configuring the Matlab parameter is the same as the process of preprocessing the first segment and the second segment and configuring the Matlab parameter, and is not described herein again.

Step S260: and aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram.

In this embodiment, the step S260 may refer to the content of the corresponding step in the foregoing embodiments, and is not described herein again.

According to the scheme provided by the embodiment, the first sound segment corresponding to the comparison voice and the second sound segment corresponding to the target voice are obtained, the first speech spectrogram corresponding to the first sound segment is generated, whether time information corresponding to sound segment data of the second sound segment is adjusted or not is determined, when the time information corresponding to the sound segment data of the second sound segment is determined to be adjusted, the time information corresponding to the sound segment data of the second sound segment is adjusted, the adjusted sound segment is obtained, the speech spectrogram corresponding to the adjusted sound segment is generated and serves as the second speech spectrogram corresponding to the second sound segment, the target speech spectrogram including the first speech spectrogram corresponding to the first sound segment and the second speech spectrogram corresponding to the second sound segment is generated according to the first speech spectrogram corresponding to the comparison voice and the second speech corresponding to the target voice, the target speech spectrogram including the first speech spectrogram corresponding to the first sound segment and the aligned second spectrogram corresponding to the second speech segment is generated, and the efficiency of identifying the identity is improved.

Further, when the time information corresponding to the segment data of the second segment is determined to be adjusted, the time information corresponding to the segment data of the second segment is adjusted to obtain the adjusted segment, so that the time information of the second segment is corrected, and the identification accuracy of the voice identity identification is improved.

Referring to fig. 3, a flowchart of a speech processing method according to still another embodiment of the present application is shown. In a specific embodiment, the voice processing method may be executed by an electronic device with processing capability, such as a terminal device like a desktop computer or a notebook computer, and may be executed interactively by a processing system including a server and a terminal. As shown in fig. 3, the voice processing method may include steps S310 to S370.

Step S310: and acquiring a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice.

Step S320: and generating a first spectrogram corresponding to the first sound segment.

Step S330: and generating a second spectrogram corresponding to the second sound segment.

Step S340: and aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram.

In this embodiment, step S310, step S320, step S330, and step S340 may refer to the content of the corresponding steps in the foregoing embodiments, and are not described herein again.

Step S350: and acquiring a first low-frequency formant of the first spectrogram and a second low-frequency formant of the alignment spectrogram corresponding to the second spectrogram.

In this embodiment, the consistency of the pitch quality is very obviously reflected on the spectrogram corresponding to the segments, and the trends of the low-frequency formants and the center frequencies of the start and stop times of the segments with the same pitch quality are basically consistent, so in the process of identifying the identity of the voice, two segments for identifying the identity are generally two segments with the same pitch quality.

The low-frequency formants of the voice frames corresponding to the first spectrogram can be automatically calculated through a voice signal processing algorithm to obtain first low-frequency formants, and the low-frequency formants of the voice frames corresponding to the aligned spectrogram can be automatically calculated through the voice signal processing algorithm to obtain second low-frequency formants. The speech signal processing algorithm may be an autocorrelation algorithm, a cepstrum algorithm, or a Linear Predictive Coding (LPC) algorithm, for example.

Step S360: respectively intercepting a first target spectrogram corresponding to the first low-frequency formant on the first spectrogram and a second target spectrogram corresponding to the second low-frequency formant on the alignment spectrogram.

In this embodiment, a spectrogram corresponding to the first low-frequency formant may be intercepted on the first spectrogram to obtain a first target spectrogram, and a spectrogram corresponding to the second low-frequency formant may be intercepted on the alignment spectrogram to obtain a second target spectrogram. And the first target sound segment corresponding to the first target spectrogram and the second target sound segment corresponding to the second target spectrogram have matched tone tuning quality.

Step S370: and determining the identity identification result of the comparison voice and the target voice according to the similarity of the spectrogram of the first target spectrogram and the spectrogram of the second target spectrogram.

In this embodiment, the spectrogram similarity of the first target spectrogram and the second target spectrogram may be obtained, and the identity identification result of the comparison voice and the target voice may be determined according to the spectrogram similarity. When the spectrogram similarity is larger than or equal to the spectrogram similarity threshold, determining that the comparison voice and the target voice are voices of the same person; and when the spectrogram similarity is smaller than the spectrogram similarity threshold, determining that the comparison voice and the target voice are not voices of the same person. The identification result of the identity of the comparison voice and the target voice is determined automatically according to the first voice spectrogram of the comparison voice and the alignment spectrogram of the target voice, the first voice spectrogram of the comparison voice and the alignment spectrogram of the target voice do not need to be compared and judged manually, and the identification efficiency of voice identity identification is improved.

In some embodiments, the spectrogram similarity may be a first spectrogram similarity. According to a first preset algorithm rule, the similarity of the first spectrogram of the first target spectrogram and the similarity of the second spectrogram of the second target spectrogram are calculated, and the identity identification result of the comparison voice and the target voice is determined according to the similarity of the first spectrogram.

If the first spectrogram similarity is larger than or equal to a spectrogram similarity threshold, determining that the comparison voice and the target voice are voices of the same person; and if the similarity of the first spectrogram is smaller than the similarity threshold of the spectrogram, determining that the comparison voice and the target voice are not voices of the same person.

The first preset algorithm rule may at least include any one of a histogram algorithm, a gray scale distribution algorithm, an image template matching algorithm, a structural similarity algorithm, a peak signal-to-noise ratio algorithm, a perceptual hash algorithm, and the like.

In some embodiments, the spectrogram similarity can be a second spectrogram similarity. The first target spectrogram and the second target spectrogram can be input into a pre-trained spectrogram similarity detection model, the second spectrogram similarity output by the spectrogram similarity detection model is received, and the identity identification result of the comparison voice and the target voice is determined according to the second spectrogram similarity.

If the second spectrogram similarity is larger than or equal to the spectrogram similarity threshold, determining that the comparison voice and the target voice are voices of the same person; and if the second spectrogram similarity is smaller than the spectrogram similarity threshold, determining that the comparison voice and the target voice are not voices of the same person.

The spectrogram similarity detection model can be used for detecting the similarity of a second spectrogram of the first target spectrogram and the second target spectrogram. The spectrogram similarity detection model may be a Convolutional Neural Network (CNN) model, a Deep Belief Network (DBN) model, a Stacked Auto Encoder network (SAE) model, a Recurrent Neural Network (RNN) model, a Deep Neural Network (DNN) model, a Long Short Term Memory (Long Short Term Memory, LSTM) network model, or a threshold recycling unit (Gated recycling Units, GRU) model, and the like, where the type of the spectrogram similarity detection model is not limited, and may be specifically set according to actual requirements.

According to the scheme provided by the embodiment, a first sound segment corresponding to a comparison voice and a second sound segment corresponding to a target voice are obtained, a first spectrogram corresponding to the first sound segment is generated, a second spectrogram corresponding to the second sound segment is generated, alignment is performed according to the first spectrogram and the second spectrogram, a target spectrogram is generated, a first low-frequency formant of the first spectrogram and a second low-frequency formant of the second spectrogram are obtained, a first target spectrogram corresponding to the first low-frequency formant on the first spectrogram is intercepted, a second target spectrogram corresponding to the second low-frequency formant on the second spectrogram is obtained, an identity identification result of the comparison voice and the target voice is determined according to the similarity of the spectrogram of the first target spectrogram and the second target spectrogram, an alignment identification result including the first spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the target voice is automatically generated according to the first sound segment corresponding to the comparison voice and the second spectrogram corresponding to the target voice, the identity identification of the first sound segment and the second spectrogram are automatically performed, and the identity of the first sound segment and the second spectrogram are identified without manual identification of the first sound segment and the second spectrogram.

Furthermore, the identification result of the identity of the comparison voice and the target voice is determined according to the similarity of the voice spectrogram of the first target voice spectrogram corresponding to the comparison voice and the similarity of the voice spectrogram of the second target voice spectrogram corresponding to the target voice, so that the identification result of the identity of the comparison voice and the target voice is determined automatically according to the first voice spectrogram of the comparison voice and the alignment spectrogram of the target voice, the first voice spectrogram of the comparison voice and the alignment spectrogram of the target voice do not need to be manually compared and judged, and the identification efficiency of the voice identity identification is improved.

Referring to fig. 4, which illustrates a speech processing apparatus 400 according to an embodiment of the present application, as shown in fig. 4, the speech processing apparatus 400 may include an obtaining module 410, a first generating module 420, a second generating module 430, and a third generating module 440.

The obtaining module 410 may be configured to obtain a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice; the first generating module 420 may be configured to generate a first spectrogram corresponding to a first segment; the second generating module 430 may be configured to generate a second spectrogram corresponding to a second sound segment; the third generating module 440 may be configured to align according to the first spectrogram and the second spectrogram, and generate a target spectrogram, where the target spectrogram includes an alignment spectrogram corresponding to the first spectrogram and the second spectrogram, and a boundary of the alignment spectrogram is aligned with a boundary of the first spectrogram.

In some embodiments, the speech processing apparatus 400 may further include a first determination module and a second determination module.

The first determining module may be configured to determine whether to adjust time information corresponding to segment data of the second segment before the second generating module 430 generates the second spectrogram corresponding to the second segment, where the segment data includes phoneme data and syllable data; the second determining module may be configured to, when it is determined to adjust time information corresponding to the segment data of the second segment, adjust the time information corresponding to the segment data of the second segment to obtain an adjusted segment, where the time information corresponding to the segment data of the adjusted segment is matched with the preset time information.

In some implementations, the second generation module 430 can include a first generation unit.

The first generating unit may be configured to generate a spectrogram corresponding to the adjustment sound segment, and use the spectrogram as a second spectrogram corresponding to the second sound segment.

In some embodiments, the first determination module may include a matching unit, a first determination unit, a second determination unit, and a third determination unit.

The matching unit may be configured to match time information corresponding to the segment data of the second segment with preset time information to obtain a time matching degree; the first determining unit may be configured to determine whether to adjust time information corresponding to the segment data of the second segment according to the time matching degree; the second determining unit may be configured to determine, when the time matching degree is greater than or equal to the time matching degree threshold, to adjust time information corresponding to the segment data of the second segment; the third determining unit may be configured to determine not to adjust the time information corresponding to the segment data of the second segment when the time matching degree is smaller than the time matching degree threshold.

In some embodiments, the second determination module may include an input unit and a receiving unit.

The input unit may be configured to input the second segment into a deep learning network model trained in advance when it is determined to adjust time information corresponding to segment data of the second segment, where the deep learning network model is configured to adjust the second segment into an adjusted segment in which the time information corresponding to the segment data matches preset time information; the receiving unit may be configured to receive the adjusted segment output by the deep learning network model.

In some embodiments, the third generation module 440 may include an alignment unit and a second generation unit.

The alignment unit may be configured to align the second spectrogram with the first spectrogram to obtain an aligned spectrogram; the second generating unit may be configured to generate a target spectrogram according to the first spectrogram and the alignment spectrogram.

In some embodiments, the alignment unit may include an acquisition subunit, a calculation subunit, and a generation subunit.

The obtaining subunit is configured to obtain a horizontal pixel corresponding to the second spectrogram, where the horizontal pixel corresponds to a time position of the second spectrogram; the calculation subunit is configured to calculate a target fourier transform spectrum corresponding to the horizontal pixel according to a preset rule, and a time position corresponding to the target fourier transform spectrum is aligned with a time position of the first spectrogram; the generating subunit may be configured to generate a corresponding alignment spectrogram according to the target fourier transform spectrum.

In some embodiments, the preset rule may be a frame extraction calculation rule, and the calculation subunit may include a calculation sub-subunit.

The calculation sub-unit can be used for calculating a target Fourier transform frequency spectrum corresponding to the transverse pixel according to a frame extraction calculation rule, and obtaining a corresponding spectrogram energy data matrix.

In some embodiments, generating the sub-unit may include mapping the sub-unit.

The mapping sub-unit may be configured to map an energy value in the spectrogram energy data matrix to a gray value of the image to obtain a mapping spectrogram, and use the mapping spectrogram as a corresponding spectrogram.

According to the scheme, the first sound segment corresponding to the comparison voice and the second sound segment corresponding to the target voice are obtained, the first spectrogram corresponding to the first sound segment is generated, the second spectrogram corresponding to the second sound segment is generated, the alignment is performed according to the first spectrogram and the second spectrogram, the target spectrogram is generated, the target spectrogram comprises the alignment spectrogram corresponding to the first spectrogram and the second spectrogram, the time position of the alignment spectrogram is aligned with the time position of the first spectrogram, automatic generation of the target spectrogram comprising the alignment spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the target voice according to the first sound segment corresponding to the comparison voice and the second sound segment corresponding to the target voice is achieved, manual alignment of the first spectrogram corresponding to the first sound segment and the second spectrogram corresponding to the second sound segment is not needed, and the identification efficiency of voice identity identification is improved.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. For any processing manner described in the method embodiment, all the processing manners may be implemented by corresponding processing modules in the apparatus embodiment, and details in the apparatus embodiment are not described again.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 5, which shows a functional block diagram of an electronic device 500 provided in an embodiment of the present application, the electronic device 500 may include one or more of the following components: memory 510, processor 520, and one or more applications, wherein the one or more applications may be stored in the memory 510 and configured to be executed by the one or more processors 520, the one or more applications configured to perform a method as described in the aforementioned method embodiments.

The Memory 510 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 510 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 510 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., obtaining a first sound segment, obtaining a second sound segment, generating a first spectrogram, generating a second spectrogram, aligning the first spectrogram and the second spectrogram, generating a target spectrogram, determining whether to adjust time information, determining adjust time information, adjusting time information, obtaining an adjusted sound segment, generating a spectrogram corresponding to an adjusted sound segment, matching time information and preset time information of the second sound segment, inputting the second sound segment to a pre-trained deep learning network model, receiving an adjusted sound segment, obtaining an aligned spectrogram, obtaining transverse pixels, calculating a target fourier transform spectrum, obtaining a spectrogram energy matrix, and mapping energy values in the spectrogram energy data matrix, etc.), instructions for implementing various method embodiments described below, and the like. The stored data area may also store data created by the electronic device 500 during use (such as comparison speech, first speech segment, target speech, second speech segment, first speech spectrogram, second speech spectrogram, target speech spectrogram, alignment speech spectrogram, time position, speech segment data, time information, phoneme data, syllable data, adjusted speech segment, preset time information, time matching degree threshold, pre-trained deep learning network model, alignment speech spectrogram, horizontal pixel, preset rule, target fourier transform spectrum, frame extraction calculation rule, speech spectrogram energy data matrix, energy value, image, gray value, and mapping speech spectrogram), and the like.

Processor 520 may include one or more processing cores. The processor 520, using various interfaces and connections throughout the electronic device 500, performs various functions and processes data for the electronic device 500 by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 510, and invoking data stored in the memory 510. Alternatively, the processor 520 may be implemented in at least one hardware form of Digital Signal Processing (DSP), field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 520 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 520, but may be implemented by a communication chip.

Referring to fig. 6, a block diagram of a computer-readable storage medium provided in an embodiment of the present application is shown. The computer-readable storage medium 600 has program code 610 stored therein, and the program code 610 can be called by the processor to execute the method described in the above method embodiments.

The computer-readable storage medium 600 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 600 includes a non-volatile computer-readable storage medium. The computer readable storage medium 600 has storage space for program code 610 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 610 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of speech processing, comprising:

acquiring a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice;

generating a first spectrogram corresponding to the first sound segment;

generating a second spectrogram corresponding to the second sound segment;

and aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram, wherein the target spectrogram comprises an alignment spectrogram corresponding to the first spectrogram and the second spectrogram, and the time position of the alignment spectrogram is aligned with the time position of the first spectrogram.

2. The speech processing method according to claim 1, wherein before generating the second spectrogram corresponding to the second speech segment, the method further comprises:

determining whether to adjust time information corresponding to segment data of the second segment, wherein the segment data comprises phoneme data and syllable data;

when the time information corresponding to the segment data of the second segment is determined to be adjusted, adjusting the time information corresponding to the segment data of the second segment to obtain an adjusted segment, wherein the time information corresponding to the segment data of the adjusted segment is matched with preset time information;

the generating of the second spectrogram corresponding to the second sound segment includes:

and generating a spectrogram corresponding to the adjusted sound segment, and using the spectrogram as a second spectrogram corresponding to the second sound segment.

3. The method of claim 2, wherein the determining whether to adjust the time information corresponding to the segment data of the second segment comprises:

matching time information corresponding to the segment data of the second segment with preset time information to obtain time matching degree;

determining whether to adjust time information corresponding to the segment data of the second segment according to the time matching degree;

when the time matching degree is greater than or equal to a time matching degree threshold value, determining to adjust time information corresponding to the segment data of the second segment;

and when the time matching degree is smaller than a time matching degree threshold value, determining not to adjust the time information corresponding to the segment data of the second segment.

4. The speech processing method according to claim 2, wherein when it is determined to adjust the time information corresponding to the segment data of the second segment, adjusting the time information corresponding to the segment data of the second segment to obtain an adjusted segment, comprises:

when time information corresponding to the segment data of the second segment is determined to be adjusted, inputting the second segment into a pre-trained deep learning network model, wherein the deep learning network model is used for adjusting the second segment into an adjusted segment, and the time information corresponding to the segment data of the adjusted segment is matched with preset time information;

and receiving the adjusted sound segment output by the deep learning network model.

5. The speech processing method of claim 1, wherein the aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram comprises:

aligning the second spectrogram with the first spectrogram to obtain an aligned spectrogram;

and generating the target spectrogram according to the first spectrogram and the alignment spectrogram.

6. The speech processing method according to claim 5, wherein the aligning the second spectrogram with the first spectrogram to obtain the aligned spectrogram comprises:

acquiring a transverse pixel corresponding to the second spectrogram, wherein the transverse pixel corresponds to the time position of the second spectrogram;

according to a preset rule, calculating a target Fourier transform spectrum corresponding to the transverse pixel, wherein the time position corresponding to the target Fourier transform spectrum is aligned with the time position of the first spectrogram;

and generating the corresponding alignment spectrogram according to the target Fourier transform spectrum.

7. The speech processing method according to claim 6, wherein the preset rule is a frame-decimation calculation rule, and the calculating the target fourier transform spectrum corresponding to the horizontal pixel according to the preset rule comprises:

calculating a target Fourier transform spectrum corresponding to the transverse pixel according to a frame extraction calculation rule to obtain a corresponding spectrogram energy data matrix;

generating the corresponding alignment spectrogram according to the target Fourier transform spectrum, wherein the generating of the alignment spectrogram comprises:

and mapping the energy value in the spectrogram energy data matrix to a gray value of an image to obtain a mapping spectrogram, and taking the mapping spectrogram as the corresponding spectrogram.

8. A speech processing apparatus, comprising:

the acquisition module is used for acquiring a first sound segment corresponding to the comparison voice and a second sound segment corresponding to the target voice;

the first generating module is used for generating a first spectrogram corresponding to the first sound segment;

the second generating module is used for generating a second spectrogram corresponding to the second sound segment;

and the third generation module is used for aligning according to the first spectrogram and the second spectrogram to generate a target spectrogram, wherein the target spectrogram comprises an alignment spectrogram corresponding to the first spectrogram and the second spectrogram, and the boundary of the alignment spectrogram is aligned with the boundary of the first spectrogram.

9. An electronic device, comprising:

a memory;

one or more processors coupled with the memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by one or more processors, the one or more application programs configured to perform the speech processing method of any of claims 1-7.

10. A computer-readable storage medium, in which a program code is stored, the program code being invokable by a processor to execute the speech processing method according to any of claims 1 to 7.