CN108847217A

CN108847217A - A kind of phonetic segmentation method, apparatus, computer equipment and storage medium

Info

Publication number: CN108847217A
Application number: CN201810548508.3A
Authority: CN
Inventors: 黄锦伦
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2018-11-20
Also published as: WO2019227547A1

Abstract

The invention discloses a kind of phonetic segmentation method, apparatus, computer equipment and storage medium, the method includes：It obtains voice document to be pre-processed, obtains audio data；Audio data is normalized, and framing is carried out to audio data, calculates the frame energy of every frame speech frame；If the frame energy of speech frame is less than preset frame energy threshold, marking the speech frame is mute frame；If detecting, the quantity of continuous mute frame is greater than preset mute frame amount threshold, and marking it is mute section；According to the cutting frame of mute section of determining voice document, cutting is carried out to voice document using cutting frame, obtains file destination.Technical solution of the present invention does not need manual intervention by using the mode that frame energy carries out phonetic segmentation as cutting standard, and complexity is low, mute and pause in sentence can be accurately identified, realize while accurate cutting voice document, effectively improve cutting efficiency.

Description

A kind of phonetic segmentation method, apparatus, computer equipment and storage medium

Technical field

The present invention relates to voice processing technology field more particularly to a kind of phonetic segmentation method, apparatus, computer equipment and Storage medium.

Background technique

In speech processes field, carrying out cutting to voice document is a relatively crucial problem, because of longer language Sound file can generate biggish consumption to system resource during speech recognition conversion, and recognition accuracy is not high.To language After sound file is split processing, the calculation amount of speech recognition can be reduced and improve the accuracy of identification of speech recognition system.Together When, to the accuracy of phonetic segmentation will have a direct impact on speech recognition as a result, if there is mistake, voice signal in phonetic segmentation Identification may just will appear very big deviation, the identification for even resulting in voice signal cannot achieve.

But voice document is when needing to carry out cutting by sentence to voice content at present, is needed mostly using manual type Manual cutting is carried out, the accuracy rate for causing voice to be divided is not high, or needs to be handled by complicated algorithm, so that cutting Efficiency it is lower.

Summary of the invention

The embodiment of the present invention provides a kind of phonetic segmentation method, apparatus, computer equipment and storage medium, current to solve The problem that cutting efficiency is low and accuracy rate is low is carried out to voice document.

A kind of phonetic segmentation method, including：

Obtain voice document to be slit；

Institute's voice file is pre-processed, audio data is obtained, wherein the audio data includes n sampled point Sampled value, n is positive integer；

The audio data is normalized, obtains the corresponding normal data of the audio data, wherein described Normal data includes the corresponding standard value of each sampled value；

Sub-frame processing is carried out to the audio data according to preset frame length and preset step-length, obtains K frame speech frame, In, K is positive integer；

According to the normal data calculate every frame described in speech frame frame energy；

The language is marked if the frame energy of the speech frame is less than preset frame energy threshold for speech frame described in every frame Sound frame is mute frame；

If detecting, the quantity of continuous mute frame is greater than preset mute frame amount threshold, marks this continuous mute Frame is mute section；

According to the cutting frame of described mute section determining institute voice file, and using the cutting frame to institute's voice file Cutting is carried out, file destination is obtained.

A kind of phonetic segmentation device, including：

Voice document obtains module, for obtaining voice document to be slit；

Voice document preprocessing module obtains audio data, wherein institute for pre-processing to institute's voice file The sampled value that audio data includes n sampled point is stated, n is positive integer；

Audio data processing module obtains the audio data pair for the audio data to be normalized The normal data answered, wherein the normal data includes the corresponding standard value of each sampled value；

Audio data framing module, for carrying out framing to the audio data according to preset frame length and preset step-length Processing, obtains K frame speech frame, wherein K is positive integer；

Frame energy computation module, for according to the normal data calculate every frame described in speech frame frame energy；

Mute frame mark module, for being directed to speech frame described in every frame, if the frame energy of the speech frame is less than preset frame Energy threshold, then marking the speech frame is mute frame；

Mute segment mark module, if for detecting that the quantity of continuous mute frame is greater than preset mute number of frames threshold Value, then marking the continuous mute frame is mute section；

File destination obtains module, for the cutting frame according to described mute section determining institute voice file, and uses institute It states cutting frame and cutting is carried out to institute's voice file, obtain file destination.

A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize the step of above-mentioned phonetic segmentation method when executing the computer program Suddenly.

A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter The step of calculation machine program realizes above-mentioned phonetic segmentation method when being executed by processor.

In above-mentioned phonetic segmentation method, apparatus, computer equipment and storage medium, by the way that voice document is carried out at framing The normalization of reason and audio data, improves the treatment effeciency of voice data, then calculate the frame energy of every frame speech frame, The short-time rating of speech frame is judged, and determines mute section of audio data according to frame energy, so as to in sentence Mute and pause accurately identified, cutting is carried out to voice document according to determining cutting frame, is realized to the correct of sentence Cutting avoids damage to the integrality of sentence, improves the accuracy rate of phonetic segmentation, meanwhile, using frame energy as cutting standard The mode for carrying out phonetic segmentation does not need manual intervention, and complexity is low, realizes while accurate cutting voice document, has Effect improves cutting efficiency.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is an application environment schematic diagram of phonetic segmentation method in one embodiment of the invention；

Fig. 2 is a flow chart of phonetic segmentation method in one embodiment of the invention；

Fig. 3 is a specific flow chart of step S2 in Fig. 2；

Fig. 4 is a functional block diagram of phonetic segmentation device in one embodiment of the invention；

Fig. 5 is a schematic diagram of computer equipment in one embodiment of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Referring to Fig. 1, Fig. 1 shows the application environment of phonetic segmentation method provided in an embodiment of the present invention.The phonetic segmentation Method is applied in speech recognition system, for training speech recognition modeling.The speech recognition system includes server-side and client End, wherein be attached between server-side and client by network, user carries out voice input, client by client It specifically can be, but not limited to be various personal computers, laptop, smart phone, tablet computer and portable wearable set Standby, server-side can specifically be realized with the server cluster that independent server or multiple servers form.The present invention is implemented The phonetic segmentation method that example provides is applied to server-side.

Fig. 2 shows a flow charts of phonetic segmentation method in the present embodiment in one of the embodiments, as shown in Fig. 2, The phonetic segmentation method includes step S1 to step S8, and details are as follows：

S1：Obtain voice document to be slit.

In embodiments of the present invention, voice document to be slit can be obtained from the corpus for voice training, or It is obtained at the voice document that person is collected by third party's tool, such as wechat or disclosed sound bank, herein with no restrictions.

Further, whether the audio format for the voice document that server-side detection is got is wav format, wherein voice The audio format of file is that the extension of wav format is entitled " * .wav ", and wav is a kind of lossless audio file formats, can be complete The data information of ground preservation voice.Specifically, if the audio format of audio file is wav format, which is used for Identify simultaneously cutting, the audio file is otherwise converted into wav format using audio format converter, by the audio after conversion File is as voice document to be slit.

S2：Voice document is pre-processed, audio data is obtained, wherein audio data includes the sampling of n sampled point Value, n is positive integer.

Specifically, after getting voice document to be slit, pulse code modulation scheme is used to voice document (pulse code modulation, PCM) is encoded, by the analog signal of voice document every the preset time to one Sampled point is sampled, its discretization is made, which is determined according to the sample frequency of pcm encoder, is specifically adopted Sample frequency can be set according to historical experience, if sample frequency can be set to 8000Hz per second, indicate acquisition 8000 per second Sampled signal can also be configured, herein with no restrictions according to practical application.

Further, the sampled signal of n sampled point is quantified, the number in a manner of binary system code character after output quantization Signal obtains voice signal corresponding with voice document, wherein sampled point is n, n be voice document time span with The product of sample frequency.

Further, preemphasis processing is carried out for voice signal, enhances the radio-frequency component of voice signal, avoids due to height Excessive decaying of the frequency component in transmission process, obtained voice is not clear and accurate enough, influences the accuracy of speech recognition, specifically Time-Domain Technique can be used or frequency domain technique is realized to the preemphasis of voice signal, reinforce the speech energy of voice segments, obtain Audio data.

S3：Audio data is normalized, obtains the corresponding normal data of audio data, wherein normal data Including the corresponding standard value of each sampled value.

In embodiments of the present invention, the audio data obtained for step S2, is normalized audio data, tool Body normalized mode can be the sampled value of each sampled point divided by the maximum value in the sampled value of audio data, can also With the mean value by the sampled value of each sampled point divided by the sampled value of corresponding audio data, by Data Convergence to specific sections, side Just data processing is carried out.

Specifically, after normalized, sampled point sampled value each in audio data is converted into corresponding standard Value, to obtain normal data corresponding with audio data.

S4：Sub-frame processing is carried out to audio data according to preset frame length and preset step-length, obtains K frame speech frame, In, K is positive integer.

In embodiments of the present invention, according to preset frame length and step-length, the nonoverlapping framing of interframe is carried out to audio data, Frame length is the length of the speech frame obtained, and step-length is to obtain the time interval of speech frame, when frame length is equal to step-length, is enabled to Be not in overlapping phenomenon between each speech frame obtained after framing, obtain K frame speech frame, K is the time of voice document Length, will not frame energy meter to speech frame while improving data-handling efficiency divided by the quotient of the time span of speech frame It impacts.

Specifically, the value of usual frame length setting can be in the range of 0.01s-0.03s, and the voice in this section of short time is believed It is number relatively steady, it can also be configured according to the needs of practical application, herein with no restriction.

For example, step-length is set as 0.01s, sample frequency 8000Hz, acquisition 8000 per second if frame length is set as 0.01s Audio data is then determined as a frame speech frame according to 80 sampled values and carries out sub-frame processing, if last frame by a sampled signal The sampled value of speech frame is then added the information data that sampled value is 0 to last frame speech frame, so that finally less than 80 One speech frame includes 80 sampled values.

S5：The frame energy of every frame speech frame is calculated according to normal data.

In embodiments of the present invention, frame energy is the short-time energy of voice signal, reflects the voice messaging of speech frame Data volume is able to carry out by frame energy and judges whether the speech frame is sentence frame or mute frame.

Further, it since the Data Convergence of standard value is relatively good, carries out calculating every frame speech frame using normal data Frame energy, can be improved data-handling efficiency.

S6：The voice is marked if the frame energy of the speech frame is less than preset frame energy threshold for every frame speech frame Frame is mute frame.

In embodiments of the present invention, frame energy threshold is preset parameter, if the frame energy being calculated is less than frame Energy threshold, then be labeled as mute frame for corresponding speech frame, which can specifically be set according to historical experience It sets, if frame energy threshold is set as 0.5, concrete analysis can also be carried out according to the frame energy that each speech frame is calculated and is set It sets, herein with no restrictions.

S7：If detecting, the quantity of continuous mute frame is greater than preset mute frame amount threshold, marks this continuous Mute frame is mute section.

In embodiments of the present invention, mute frame amount threshold is preset parameter, if detecting the presence of continuous quiet The quantity of sound frame is greater than preset mute frame amount threshold, then marking the continuous mute frame is mute section, the frame energy threshold It can be specifically configured according to historical experience, it, can also be according to each language be calculated if mute frame amount threshold is set as 5 The frame energy of sound frame carries out concrete analysis setting, herein with no restrictions.

S8：Cutting is carried out to voice document according to the cutting frame of mute section of determining voice document, and using cutting frame, is obtained File destination.

In embodiments of the present invention, in order to ensure that will not be sliced into voice segments, and it is certain to guarantee that voice segments front and back has Duration, if the number of continuous frame number is even number, takes continuous frame number using the intermediate frame of mute section of continuous frame number as separation Intermediate wherein lesser frame number is labeled as cutting frame, can also take among continuous frame number that wherein lesser frame number is labeled as cutting Frame, herein with no restrictions.

For example, mute frame amount threshold is 5, then screening obtains frame ENERGY E ne1, Ene2 if frame energy threshold is 0.5, Ene8, Ene13, Ene14, Ene15, Ene16, Ene17, Ene18 are to obtain screening to be less than frame energy threshold less than 0.5 The frame number of speech frame be labeled as mute frame, then filter out the frame number that continuous frame number is greater than 5 frames, by Ene13, Ene14, The corresponding frame number of Ene15, Ene16, Ene17, Ene18 is labeled as mute section, obtains among continuous frame number wherein lesser frame number, And the 15th frame speech frame is labeled as cutting frame.

According to the cutting frame of label, audio data is subjected to cutting according to cutting frame, the frame between each cut-off is merged For an independent voice section, obtain include voice document after multiple cuttings file destination.

In one of the embodiments, by pre-processing to voice document, audio data is obtained, so that voice document It is converted into the data format that sound card is directly supported.Audio data is divided into multiple by audio data to being normalized again Speech frame improves the efficiency of data processing.The frame energy of every frame speech frame is calculated according to the corresponding normal data of audio data, if The frame energy of speech frame is less than preset frame energy threshold, then marking the speech frame is mute frame, further, if the company of detecting Continuous mute number of frames is greater than preset mute frame amount threshold, then marking the continuous mute frame is mute section, and determination is cut The frame number of framing finally carries out cutting to voice document according to cutting frame, obtains file destination.By the way that voice document is divided Frame processing, calculates the frame energy of every frame speech frame, mute section of audio data is determined according to frame energy, so as to sentence In mute and pause accurately identified, realize the correct cutting to sentence, avoid damage to the integrality of sentence, improve language The accuracy rate of sound cutting, meanwhile, it carries out not needing manual intervention by the way of phonetic segmentation as cutting standard using frame energy, and Complexity is low, realizes while accurate cutting voice document, effectively improves cutting efficiency.

The present embodiment proposition in one of the embodiments, pre-processes voice document, obtains the tool of audio data Body implementation method is described in detail.

Referring to Fig. 3, Fig. 3 shows a specific flow chart of step S2, details are as follows：

S21：Voice document is encoded using pulse code modulation scheme, obtains voice signal.

In embodiments of the present invention, using pulse code modulation scheme to the coding of voice document include sampling, quantization and The analog signal of continuous time, is converted into the digital signal of discrete amplitudes, and in a manner of binary system code character by the processes such as coding Digital signal after output quantization obtains voice signal corresponding with voice document.

Specifically, if voice document is monophonic voices file, the sampled value of each sampled point is one 8 without symbol Number integer, if voice document is the voice document of stereophony, the sampled value of each sampled point is one 16 whole Number, wherein the specific value range of the data format of 8 and 16 PCM waveform codings is as shown in Table 1.

Table one

Sample type	Data format	Minimum value	Maximum value
				Monophonic voices file	unsigned int	0	225
Two-channel voice document	int	-32767	32767

For example, using the coding mode of PCM, then audio is literary so that monophonic, sample rate is the wav files of 8000Hz as an example Part coded format is (8000Hz, 8bit, Unsigned), i.e., sample frequency is 8000 times per second, uses 8 signless integers A sampled value is represented, for the range of sampled value between [0,225], speech volume size is directly proportional to the sampled value of pcm encoder Relationship, when speech volume is bigger, the level for sampling acquisition is higher, and the signless integer of quantization encoding is bigger.

S22：Preemphasis processing is carried out to voice signal, obtains audio data.

In embodiments of the present invention, voice signal low-frequency range energy is big, and high frequency band signal energy is obviously small, and voice is caused to believe Number low frequency signal-to-noise ratio it is very big, and signal to noise in high frequency is smaller, weak so as to cause high-frequency transmission, keeps high-frequency transmission difficult.

Specifically, it is aggravated using high frequency band signal of the high-pass filter to voice document, obtains audio data, it can Enhance the high-frequency energy of voice signal, increases the amplitude of high band voice signal rising edge and falling edge, to increase high frequency Signal-to-noise ratio improves the quality of voice signal.

Voice document is encoded by using pulse code modulation scheme in one of the embodiments, it will be continuous The analog signal of time is converted into the digital signal of discrete amplitudes, the number letter in a manner of binary system code character after output quantization Number, voice signal is obtained, is convenient to the information data of identification with processing voice document, then carry out at preemphasis to voice signal Reason, enhances the high-frequency energy of voice signal, obtains audio data, improve the quality of voice signal.

What step S3 was referred in one of the embodiments, is normalized audio data, obtains audio data Corresponding normal data can specifically realize that details are as follows in the following way：

Normal data is calculated according to formula (1)：

X=Y/max (Y) formula (1)

Wherein, Y is audio data, and X is normal data, and max (Y) is the amplitude of audio data.

It should be noted that the amplitude of audio data is the maximum value in audio data in the sampled value of each sampled point.

In embodiments of the present invention, the corresponding normal data of audio data can be accurately calculated by formula (1), it will Data Convergence improves the efficiency of data processing to specific sections.

The frame energy that every frame speech frame is calculated according to normal data that step S5 is referred in one of the embodiments, tool Body can realize that details are as follows in the following way：

The frame energy of every frame speech frame is calculated according to formula (2)：

Ene [i]=A × sum (Xi²) formula (2)

Wherein, Ene [i] is the frame energy of the i-th frame speech frame, and A is preset regulatory factor, sum (Xi²) it is the i-th frame language The quadratic sum of the standard value for the sampled point for including in sound frame.

It should be noted that A is preset regulatory factor, which is preset according to the characteristic of voice document, It avoids since the volume of sentence in voice document is too small or ambient noise is excessive, so that sentence and mute discrimination be not high, And influence the accuracy rate of phonetic segmentation.

In embodiments of the present invention, the frame energy that every frame speech frame can be rapidly calculated by formula (2), reflects The data volume size of the voice messaging of each speech frame improves the accuracy rate of phonetic segmentation, and can further utilize frame energy Amount goes to judge whether the speech frame is sentence frame or mute frame.

In one of the embodiments, in step S8 according to the cutting frame of mute section of determining voice document, and according to cutting Frame carries out cutting to voice document, after obtaining file destination, further can also carry out speech recognition using file destination Model training, the phonetic segmentation method further include：

Speech recognition modeling training is carried out using file destination.

Specifically, the file destination that the model training module receiving step S8 of speech recognition system is obtained, in batches to target Voice document in file after cutting is identified corresponding identification information, the model training for speech recognition system.

It should be noted that if the voice document time span used is longer, then in the training process of speech recognition modeling In due to system automatic aligning etc. influence training effect so that the discrimination for the voice document is not high, and voice Identify that the longer voice document of transcription time span can generate biggish consumption to system resource.

In embodiments of the present invention, speech recognition modeling training is carried out using file destination, so that being used for speech recognition system The sentence of voice document in the model training of system is all short sentence, improves the efficiency and accuracy rate of model training.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

In one embodiment, Fig. 4 is shown fills with the one-to-one phonetic segmentation of phonetic segmentation method in above-described embodiment It sets.For ease of description, only parts related to embodiments of the present invention are shown.

As shown in figure 4, the phonetic segmentation device include voice document obtain module 31, voice document preprocessing module 32, It is audio data processing module 33, audio data framing module 34, frame energy computation module 35, mute frame mark module 36, mute Segment mark module 37 and file destination obtain module 38.Detailed description are as follows for each functional module：

Voice document obtains module 31, for obtaining voice document to be slit；

Voice document preprocessing module 32 obtains audio data for pre-processing to voice document, wherein audio Data include the sampled value of n sampled point, and n is positive integer；

Audio data processing module 33 obtains the corresponding mark of audio data for audio data to be normalized Quasi- data, wherein normal data includes the corresponding standard value of each sampled value；

Audio data framing module 34, for being carried out at framing according to preset frame length and preset step-length to audio data Reason, obtains K frame speech frame, wherein K is positive integer；

Frame energy computation module 35, for calculating the frame energy of every frame speech frame according to normal data；

Mute frame mark module 36, for being directed to every frame speech frame, if the frame energy of the speech frame is less than preset frame energy Threshold value is measured, then marking the speech frame is mute frame；

Mute segment mark module 37, if for detecting that the quantity of continuous mute frame is greater than preset mute number of frames threshold Value, then marking the continuous mute frame is mute section；

File destination obtains module 38, for the cutting frame according to mute section of determining voice document, and uses cutting frame pair Voice document carries out cutting, obtains file destination.

Further, voice document preprocessing module 32 includes：

Encoding submodule 321 obtains voice letter for encoding using pulse code modulation scheme to voice document Number；

Preemphasis submodule 322 obtains audio data for carrying out preemphasis processing to voice signal.

Further, audio data processing module 33 includes：

Normal data computational submodule 331, for calculating normal data according to formula (1)：

X=Y/max (Y) formula (1)

Further, frame energy computation module 35 includes：

Frame energy balane submodule 351, for calculating the frame energy of every frame speech frame according to formula (2)：

Ene [i]=A × sum (Xi²) formula (2)

Further, which further includes：

Model training module 39, for carrying out speech recognition modeling training using file destination.

Each module realizes the process of respective function in a kind of phonetic segmentation device provided in this embodiment, specifically refers to The description of embodiment is stated, details are not described herein again.

The present embodiment provides a computer readable storage medium, computer journey is stored on the computer readable storage medium Sequence, the computer program realize phonetic segmentation method in above-described embodiment when being executed by processor, alternatively, the computer program quilt Processor realizes the function of each module in phonetic segmentation device in above-described embodiment when executing.It is no longer superfluous here to avoid repeating It states.

It is to be appreciated that the computer readable storage medium may include：The computer program code can be carried Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), electric carrier signal and Telecommunication signal etc..

Fig. 5 is the schematic diagram for the computer equipment that one embodiment of the invention provides.As shown in figure 5, the calculating of the embodiment Machine equipment 5 includes：Processor 51, memory 52 and it is stored in the computer that can be run in memory 52 and on processor 51 Program 53.Processor 51 realizes the step in above-mentioned each phonetic segmentation embodiment of the method when executing computer program 53, such as Step S1 shown in Fig. 2 to step S8.Alternatively, processor 51 realizes that voice is cut in above-described embodiment when executing computer program 53 The function of each module of separating device, such as module 31 shown in Fig. 4 is to the function of module 38.

Illustratively, computer program 53 can be divided into one or more modules, one or more module is deposited Storage executes in memory 52, and by processor 51, to complete the present invention.One or more modules can be can complete it is specific The series of computation machine program instruction section of function, the instruction segment is for describing the holding in computer equipment 5 of computer program 53 Row process.For example, computer program 53, which can be divided into voice document, obtains module, voice document preprocessing module, audio Data processing module, audio data framing module, frame energy computation module, mute frame mark module, mute segment mark module and File destination obtains module, and the concrete function of each module, to avoid repeating, is not gone to live in the household of one's in-laws on getting married one by one herein as shown in above-mentioned Installation practice It states.

Computer equipment 5 can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment. Computer equipment 5 may include, but be not limited only to, processor 51, memory 52.It will be understood by those skilled in the art that Fig. 5 is only It is the example of computer equipment 5, does not constitute the restriction to computer equipment 5, may include than illustrating more or fewer portions Part perhaps combines certain components or different components, such as computer equipment 5 can also include input-output equipment, network Access device, bus etc..

Alleged processor 51 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

Memory 52 can be the internal storage unit of computer equipment 5, such as the hard disk or memory of computer equipment 5. Memory 52 is also possible to the plug-in type hard disk being equipped on the External memory equipment of computer equipment 5, such as computer equipment 5, Intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 52 can also both including computer equipment 5 internal storage unit and also including external storage Equipment.Memory 52 is for other programs and data needed for storing computer program and computer equipment 5.Memory 52 is also It can be used for temporarily storing the data that has exported or will export.

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that：It still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims

1. a kind of phonetic segmentation method, which is characterized in that the phonetic segmentation method includes：

Obtain voice document to be slit；

Institute's voice file is pre-processed, audio data is obtained, wherein the audio data includes adopting for n sampled point Sample value, n are positive integer；

The audio data is normalized, obtains the corresponding normal data of the audio data, wherein the standard Data include the corresponding standard value of each sampled value；

Sub-frame processing is carried out to the audio data according to preset frame length and preset step-length, obtains K frame speech frame, wherein K For positive integer；

The speech frame is marked if the frame energy of the speech frame is less than preset frame energy threshold for speech frame described in every frame For mute frame；

If detecting, the quantity of continuous mute frame is greater than preset mute frame amount threshold, marks the continuous mute frame to be Mute section；

Institute's voice file is carried out according to the cutting frame of described mute section determining institute voice file, and using the cutting frame Cutting obtains file destination.

2. phonetic segmentation method as described in claim 1, which is characterized in that it is described that institute's voice file is pre-processed, Obtaining audio data includes：

Institute's voice file is encoded using pulse code modulation scheme, obtains voice signal；

Preemphasis processing is carried out to the voice signal, obtains the audio data.

3. phonetic segmentation method as described in claim 1, which is characterized in that it is described to the audio data to being normalized Processing, obtaining the corresponding normal data of the audio data includes：

The normal data is calculated according to the following formula：

X=Y/max (Y)

Wherein, Y is the audio data, and X is the normal data, and max (Y) is the amplitude of the audio data.

4. phonetic segmentation method as described in claim 1, which is characterized in that described to calculate every frame institute according to the normal data The frame energy for stating speech frame includes：

The frame energy of speech frame described in every frame is calculated according to the following formula：

Ene [i]=A × sum (Xi²)

Wherein, Ene [i] is the frame energy of the i-th frame speech frame, and A is preset regulatory factor, sum (Xi²) it is in the i-th frame speech frame The quadratic sum of the standard value for the sampled point for including.

5. phonetic segmentation method as described in claim 1, which is characterized in that according to the mute section of determination voice text The cutting frame of part, and cutting is carried out to institute's voice file using the cutting frame, after obtaining file destination, the voice is cut Point method further includes：

Speech recognition modeling training is carried out using the file destination.

6. a kind of phonetic segmentation device, which is characterized in that the phonetic segmentation device includes：

Voice document obtains module, for obtaining voice document to be slit；

Voice document preprocessing module obtains audio data, wherein the sound for pre-processing to institute's voice file For frequency according to the sampled value comprising n sampled point, n is positive integer；

It is corresponding to obtain the audio data for the audio data to be normalized for audio data processing module Normal data, wherein the normal data includes the corresponding standard value of each sampled value；

Audio data framing module, for being carried out at framing according to preset frame length and preset step-length to the audio data Reason, obtains K frame speech frame, wherein K is positive integer；

Mute frame mark module, for being directed to speech frame described in every frame, if the frame energy of the speech frame is less than preset frame energy Threshold value, then marking the speech frame is mute frame；

Mute segment mark module, if for detecting that the quantity of continuous mute frame is greater than preset mute frame amount threshold, Marking the continuous mute frame is mute section；

File destination obtains module, for the cutting frame according to described mute section determining institute voice file, and is cut using described Framing carries out cutting to institute's voice file, obtains file destination.

7. phonetic segmentation device as claimed in claim 6, which is characterized in that institute's voice file preprocessing module includes：

Encoding submodule obtains voice signal for encoding using pulse code modulation scheme to institute's voice file；

Preemphasis submodule obtains the audio data for carrying out preemphasis processing to the voice signal.

8. phonetic segmentation device as claimed in claim 6, which is characterized in that the frame energy computation module includes：

Frame energy balane submodule, for calculating the frame energy of speech frame described in every frame according to the following formula：

Ene [i]=A × sum (Xi²)

9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to The step of any one of 5 phonetic segmentation method.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In the step of realization phonetic segmentation method as described in any one of claim 1 to 5 when the computer program is executed by processor Suddenly.