CN114512118A

CN114512118A - Intelligent sentence dividing method based on sound spectrogram, computer device and storage medium

Info

Publication number: CN114512118A
Application number: CN202210005950.8A
Authority: CN
Inventors: 柯韦; 许立文
Original assignee: Macao Polytechnic Institute
Current assignee: Macao Polytechnic Institute
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-05-17

Abstract

The invention provides an intelligent sentence dividing method based on a sound spectrogram, a computer device and a storage medium, wherein the method comprises the following steps: obtaining voice data to be sentence-divided, and converting the voice data to be sentence-divided into spectrogram data to be sentence-divided; identifying a silent section of the frequency spectrum according to the spectrogram data to be divided into sentences; acquiring a pre-frequency spectrum with a first preset time length before a frequency spectrum mute section and a post-frequency spectrum with a second preset time length after the frequency spectrum mute section, and combining the pre-frequency spectrum and the post-frequency spectrum into a spectrogram to be identified; identifying the spectrogram to be identified by using a preset classification model, and confirming the pause category of the silent section of the frequency spectrum; and carrying out sentence segmentation on the voice file according to the pause category. By applying the intelligent sentence dividing method based on the sound spectrogram, the accuracy of voice sentence division can be effectively improved.

Description

Intelligent sentence dividing method based on sound spectrogram, computer device and storage medium

Technical Field

The invention relates to the technical field of voice recognition, in particular to an intelligent sentence segmentation method based on a sound spectrogram, a computer device applying the intelligent sentence segmentation method based on the sound spectrogram, and a computer readable storage medium applying the intelligent sentence segmentation method based on the sound spectrogram.

Background

Language is an important tool for person-to-person communication and retrieval of information. When the sound is transmitted through the medium to the ear of a person, the brain processes the sound and forms his own understanding, and then responds with speech or action. The computer is required to understand human language by means of speech recognition technology, which is an important technology of human-computer interaction. Computer speech recognition, i.e., speech-to-text (STT) or Automatic Speech Recognition (ASR), is the process by which a computer recognizes and translates spoken language into text.

One of the core factors in the accuracy of Automatic Speech Recognition (ASR) is the collection of parallel corpora (speech and text) and quality assurance. Although the corpus preprocessing work is tedious and time-consuming, it is still a critical step. The preprocessing work comprises data grabbing, sentence segmentation, corpus alignment, noise removal, voice normalization and the like.

There are many ways to automatically segment a piece of sound into smaller files with a computer, and there are also methods in the field of natural language processing to automatically segment the speech of human speech into sentences (speech segmentation algorithms). At present, there are several common methods for splitting speech: the first is to divide according to fixed time length, the second is to divide according to fixed file size, the third is to automatically detect the mute section in the speech and divide by the mute section, the fourth is to recognize the speech into words first, then to divide the words into sentences by using the corresponding language model.

The existing methods have certain problems: 1. since the content of a sentence is not of a fixed length, the first and second are not suitable for most cases; 2. the third method can only divide a section of voice into a plurality of small sentences, but in many cases, the voice is short sentences or phrases, and cannot be intelligently divided into complete sentences; 3. the fourth case relates to speech recognition and sentence segmentation techniques, i.e. requiring a two-step conversion and processing, the effect of the segmentation depending on the result of the speech recognition. Although the accuracy of speech recognition is good at present, it is limited to general languages such as english, mandarin, etc., and the accuracy of speech recognition in other languages such as european portuguese, etc., is still not ideal. Therefore, a more accurate speech clause method is required.

Disclosure of Invention

The first purpose of the invention is to provide an intelligent sentence dividing method based on a sound spectrogram, which can effectively improve the accuracy of voice sentence division.

The second objective of the present invention is to provide a computer device for effectively improving the accuracy of speech clause.

It is a third object of the present invention to provide a computer-readable storage medium for effectively improving speech clause accuracy.

In order to achieve the first object, the intelligent sentence segmentation method based on the sound spectrogram provided by the invention comprises the following steps: obtaining voice data to be sentence-divided, and converting the voice data to be sentence-divided into spectrogram data to be sentence-divided; identifying a silent section of the frequency spectrum according to the spectrogram data to be divided into sentences; acquiring a pre-frequency spectrum with a first preset time length before a frequency spectrum mute section and a post-frequency spectrum with a second preset time length after the frequency spectrum mute section, and combining the pre-frequency spectrum and the post-frequency spectrum into a spectrogram to be identified; identifying the spectrogram to be identified by using a preset classification model, and confirming the pause category of the silent section of the frequency spectrum; and carrying out sentence segmentation on the voice file according to the pause category.

According to the scheme, when the intelligent sentence dividing method based on the sound spectrogram preprocesses the voice data to be divided, the pre-frequency spectrum and the post-frequency spectrum are combined into the spectrogram to be recognized by acquiring the pre-frequency spectrum with the first preset time length before the silent frequency spectrum section and the post-frequency spectrum with the second preset time length after the silent frequency spectrum section, the preset classification model analyzes and recognizes the spectrogram to be recognized, the pause category of the silent frequency spectrum section is determined by utilizing the characteristics of the voice spectrogram before and after voice pause, and the sentence division of the voice data to be divided can be further carried out, so that the accuracy of voice division is effectively improved.

In a further scheme, the step of combining the pre-spectrum and the post-spectrum into a spectrogram to be identified comprises: and adding a mute frequency spectrum with a third preset time length between the front frequency spectrum and the rear frequency spectrum to obtain a spectrogram to be identified.

Therefore, when the spectrogram to be identified is obtained, the mute frequency spectrum with the third preset time length is added between the front frequency spectrum and the rear frequency spectrum, so that the audio before and after pause can be identified conveniently, the spectrogram to be identified can accord with the identification standard, and the model analysis is facilitated.

In a further aspect, the third preset duration ranges from 1/5 to 1/4 of the total duration of the spectrum in the spectrogram to be identified.

Therefore, the value range of the third preset time length is one fourth or one fifth of the total time length of the frequency spectrum in the spectrogram to be identified, and the identification accuracy and the identification rate can be guaranteed.

In a further aspect, the second preset duration is three times longer than the first preset duration.

Therefore, the recognition accuracy can be improved by increasing the proportion of the audio at the beginning of the sentence, which is more representative than the audio at the end of the sentence.

In a further aspect, the step of identifying the quiet segment of the spectrum from the spectrogram data comprises: and when the frequency amplitude appearing in the spectrogram data is smaller than a preset value and continues for a preset time length, the spectrogram segment is regarded as a silent frequency spectrum segment.

Therefore, when the spectrum mute section is identified, the frequency amplitude of the audio is identified, and the frequency amplitude is smaller than the preset value and lasts for the preset time length, the spectrum mute section can be considered to appear.

In a further aspect, the predetermined classification model is obtained by convolutional neural network learning.

Therefore, the preset classification model is obtained by utilizing the convolutional neural network learning, and the accuracy of model identification can be improved.

In a further aspect, the step of learning the convolutional neural network comprises: acquiring spectrogram data corresponding to training voice data; carrying out pause class labeling on all the silent sections of the frequency spectrum in the spectrogram data; acquiring a pre-frequency spectrum with a first preset time length before a spectrum mute section and a post-frequency spectrum with a second preset time length after the spectrum mute section in the spectrogram data to form a training spectrogram; and performing model training on the training spectrogram by using a convolutional neural network algorithm to obtain a preset classification model.

Therefore, when the convolutional neural network learning is carried out to obtain the preset classification model, the pause class labeling is carried out on all the silent frequency bands of the frequency spectrum in the frequency spectrum data, so that the recognition, classification and judgment can be facilitated, meanwhile, a front frequency spectrum with a first preset time length before the silent frequency band of each frequency spectrum in the frequency spectrum data and a rear frequency spectrum with a second preset time length after the silent frequency spectrum band of each frequency spectrum data form a training frequency spectrum, the analysis amount of data can be reduced, and the model training efficiency can be improved.

In a further aspect, after the step of sentence segmentation is performed on the to-be-sentence voice data according to the pause category, the method further includes: and storing the sentences obtained by segmentation in a preset format.

Therefore, the sentences obtained by segmentation are stored in a preset format, and subsequent application processing can be facilitated.

In order to achieve the second objective of the present invention, the present invention provides a computer device comprising a processor and a memory, wherein the memory stores a computer program, and the computer program realizes the steps of the above intelligent sentence dividing method based on sound spectrogram when being executed by the processor.

In order to achieve the third object of the present invention, the present invention provides a computer readable storage medium, on which a computer program is stored, the computer program, when being executed by a controller, implementing the steps of the above-mentioned intelligent sentence dividing method based on sound spectrogram.

Drawings

Fig. 1 is a flowchart of an embodiment of an intelligent sentence segmentation method based on a sound spectrogram.

Fig. 2 is a schematic diagram of a spectrogram to be identified in the embodiment of the intelligent sentence segmentation method based on the sound spectrogram.

FIG. 3 is a flowchart of the learning step of the convolutional neural network in the embodiment of the intelligent sentence segmentation method based on the voice spectrogram.

The invention is further explained with reference to the drawings and the embodiments.

Detailed Description

The embodiment of the intelligent sentence dividing method based on the sound spectrogram comprises the following steps:

the intelligent sentence dividing method based on the sound spectrogram is applied to an application program of a computer and is used for carrying out sentence dividing operation on voice data.

As shown in fig. 1, in this embodiment, when the intelligent sentence dividing method based on the sound spectrogram operates, step S1 is first executed to obtain the speech data to be divided, and convert the speech data to be divided into the spectrogram data to be divided. Before the voice data is divided into sentences, the voice data to be divided is obtained, and the voice data to be divided can be the voice data obtained in real time or the voice data recorded in advance. After the speech data to be sentence-divided is obtained, in order to facilitate analysis of the speech data, the speech data needs to be converted into spectrogram data. The conversion of speech data into spectrogram data is well known to those skilled in the art and will not be described herein.

And after converting the speech data to be sentence-divided into spectrogram data to be sentence-divided, executing step S2, and identifying the silent frequency spectrum segment according to the spectrogram data to be sentence-divided. In natural language, because of the habit of human speaking or the rules of language, when a complete sentence is ended, the sentence is paused for a period of time and then the next sentence is executed, so that usually pauses exist in the voice data, that is, a spectrum silence segment appears in the audio, and whether the sentence is ended or not can be judged through the spectrum silence segment. However, a silent segment of the spectrum occurs when a non-end pause such as a comma, a semicolon, or an insertion occurs, and therefore, it is necessary to identify an end pause and a non-end pause, but it is necessary to identify a silent segment in the audio first.

In this embodiment, the step of identifying the silent section of the spectrum according to the spectrogram data includes: and when the frequency amplitude appearing in the spectrogram data is smaller than a preset value and continues for a preset time length, the spectrogram segment is regarded as a silent frequency spectrum segment. When the pause occurs, the frequency amplitude in the spectrogram can continue for a section of lower amplitude, and by identifying the frequency amplitude of the audio, the occurrence frequency amplitude is smaller than a preset value and continues for a preset duration, so that the occurrence of a spectrum silence section can be considered.

After the spectrum silence segment is identified, step S3 is executed to obtain a pre-spectrum with a first preset duration before the spectrum silence segment and a post-spectrum with a second preset duration after the spectrum silence segment, and the pre-spectrum and the post-spectrum are combined into a spectrogram to be identified. It has been studied that in some languages, the words and phrases used at the beginning and end of the sentence are common, for example, in portuguese, the frequency of "O" and "a" in the most common words used at the beginning of the sentence is 7.88% and 5.15% respectively,

and

the frequencies of (a) and (b) were 0.54% and 0.47%, respectively. Characters and words which are frequently used at the beginning and the end of the speech have corresponding frequency spectrums, so that whether the frequency spectrum silent section is a sentence end pause of a sentence or not can be conveniently identified by utilizing the characteristics of the speech spectrogram before and after the speech pause. The first preset time and the second preset time are preset according to experimental data. Since the audio at the beginning of the sentence is more representative than the audio at the end of the sentence, the recognition accuracy can be improved by increasing the specific gravity of the audio at the beginning of the sentence, in this embodiment, the second preset duration is three times as long as the first preset duration, preferably, the first preset duration is 100ms, and the second preset duration is 300 ms.

In this embodiment, the step of combining the pre-spectrum and the post-spectrum into the spectrogram to be identified includes: and adding a mute frequency spectrum with a third preset time length between the front frequency spectrum and the rear frequency spectrum to obtain a spectrogram to be identified. As shown in fig. 2, the pre-spectrum 1, the mute spectrum 2 and the post-spectrum 3 are combined into a spectrogram to be identified. In order to facilitate the identification of the front frequency spectrum and the rear frequency spectrum, a section of mute frequency spectrum is arranged between the front frequency spectrum and the rear frequency spectrum, so that the audio frequency before and after pause can be conveniently identified, and the third preset time length is set so that the spectrogram to be identified can accord with the model identification standard, so that the model analysis is convenient. In order to ensure the recognition accuracy and the recognition rate, the third preset duration may be preset according to the experimental data, and in this embodiment, a value range of the third preset duration is 1/5 to 1/4 of the total duration of the spectrum in the spectrogram to be recognized.

After obtaining the spectrogram to be identified, step S4 is executed to identify the spectrogram to be identified by using the preset classification model, and determine the pause category of the silent section of the spectrum. Wherein the pause category includes a full pause and a non-full pause. In this embodiment, the preset classification model is obtained by convolutional neural network learning. The preset classification model is obtained by utilizing convolutional neural network learning, and the accuracy of model identification can be improved.

Referring to fig. 3, in the present embodiment, when performing convolutional neural network learning, step S41 is first executed to obtain spectrogram data corresponding to training speech data. In order to make the training model able to be classified accurately, a large amount of speech data needs to be learned. In the field of deep learning, image classification applications appear earlier and the technology is more mature than others, a training set can train an accurate model with a smaller amount of content, and the efficiency of application of speech recognition is higher, so that training speech data needs to be acquired to process to obtain corresponding spectrogram data, and the spectrogram data is used for model training.

After spectrogram data corresponding to the training voice data is acquired, step S42 is executed to perform pause class labeling on all the silent frequency spectrum segments in the spectrogram data. And when pause category marking is carried out, manually marking each silent frequency spectrum segment. When the silent segment of the frequency spectrum corresponds to the end of a sentence of a speech sentence, the number "1" is marked to represent the pause at the end of the sentence, and when the end of the speech is not the end of the sentence, namely the pause caused by comma, semicolon or inserted words occurs or the sentence is not finished according to the judgment of the tone of the speech, the number "0" is marked to represent the pause at the end of the non-sentence.

After the pause class labeling is performed, step S43 is executed to obtain a pre-spectrum with a first preset duration before the silent section of the spectrum of each piece of spectrogram data, and a post-spectrum with a second preset duration after the silent section of the spectrum to form a training spectrogram. In order to reduce the analysis amount of data and improve the model training speed, a pre-frequency spectrum with a first preset time length before a spectrum mute section and a post-frequency spectrum with a second preset time length after the spectrum mute section of each piece of spectrogram data are obtained to form a training spectrogram, the training spectrogram has the same flat-frequency spectrum structure with the spectrogram to be identified, namely, a mute frequency spectrum with a third preset time length is arranged between the pre-frequency spectrum and the post-frequency spectrum. The obtained training frequency spectrum diagram is stored in a training frequency spectrum diagram library for training use.

After the training spectrogram is obtained, step S44 is executed, and a convolutional neural network algorithm is used to perform model training on the training spectrogram to obtain a preset classification model. Training of classification models by using a convolutional neural network algorithm is a technique known to those skilled in the art, and is not described herein again, and in this embodiment, a preset classification model that can identify a silent section of a spectrum as a pause at a full stop or a pause at a non-full stop is obtained by using a convolutional neural network algorithm.

After the pause classification of the silent segment of the spectrum is confirmed, step S5 is executed to perform sentence division on the voice file according to the pause classification. After confirming that the pause category of the silent spectrum segment is a pause at the end of a sentence or a pause at the end of a non-sentence, the speech file can be segmented according to the pause category so as to complete the segmentation of the sentence, for example, if the pause category of the silent spectrum segment is a pause at the end of a sentence, the sentence is complete, and the silent spectrum segment is segmented as a segmentation point; if the pause category of the current silent spectrum segment is a non-stop sentence, the sentence is not complete, and the next silent spectrum segment needs to be continuously judged until the pause category of the silent spectrum segment is a stop sentence. By sentence-segmenting the speech file, all sentences in the speech file can be identified.

Of course, if there is no spectrum silence segment in a segment of speech, it means that there is only one sentence in the segment of speech, and there is no need to divide the segment.

After the sentence division, step S6 is executed to store the divided sentence in a predetermined format. The preset format may be set according to application requirements, and in this embodiment, the preset format is an MP3 format. The sentence obtained by segmentation is stored in a preset format, which can facilitate subsequent applications, such as text conversion, translation, editing, and the like.

According to the intelligent sentence dividing method based on the sound spectrogram, when the voice data to be divided is preprocessed, the pre-frequency spectrum with the first preset time length before the silent frequency spectrum section and the post-frequency spectrum with the second preset time length after the silent frequency spectrum section are obtained, the pre-frequency spectrum and the post-frequency spectrum are combined into the spectrogram to be recognized, the preset classification model analyzes and recognizes the spectrogram to be recognized, the pause type of the silent frequency spectrum section is determined by utilizing the characteristics of the voice spectrogram before and after voice pause, sentences of the voice data to be divided can be further divided, and therefore the accuracy of voice sentences is effectively improved.

The embodiment of the computer device comprises:

the computer device of this embodiment includes a controller, and the controller implements the steps of the above-mentioned intelligent sentence segmentation method embodiment based on the sound spectrogram when executing the computer program.

For example, a computer program may be partitioned into one or more modules, which are stored in a memory and executed by a controller to implement the present invention. One or more of the modules may be a sequence of computer program instruction segments for describing the execution of a computer program in a computer device that is capable of performing certain functions.

The computer device may include, but is not limited to, a controller, a memory. Those skilled in the art will appreciate that the computer apparatus may include more or fewer components, or combine certain components, or different components, e.g., the computer apparatus may also include input-output devices, network access devices, buses, etc.

For example, the controller may be a Central Processing Unit (CPU), other general purpose controller, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable Gate Array (FPGA) or other programmable logic device, discrete Gate or transistor logic, discrete hardware components, and so on. The general controller may be a microcontroller or the controller may be any conventional controller or the like. The controller is the control center of the computer device and connects the various parts of the entire computer device using various interfaces and lines.

The memory may be used to store computer programs and/or modules, and the controller may implement various functions of the computer apparatus by executing or otherwise executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. For example, the memory may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Computer-readable storage medium embodiments:

the modules integrated by the computer apparatus of the above embodiments, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on such understanding, all or part of the flow in the above-mentioned intelligent sentence dividing method embodiment based on the sound spectrogram can also be completed by instructing the related hardware through a computer program, and the computer program can be stored in a computer readable storage medium, and when being executed by a controller, the computer program can realize the steps of the above-mentioned intelligent sentence dividing method embodiment based on the sound spectrogram. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The storage medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.

It should be noted that the above is only a preferred embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept also fall within the protection scope of the present invention.

Claims

1. An intelligent sentence dividing method based on a sound spectrogram is characterized in that: the method comprises the following steps:

obtaining voice data to be sentence-divided, and converting the voice data to be sentence-divided into spectrogram data to be sentence-divided;

identifying a silent spectrum segment according to the spectrogram data of the sentence to be divided;

acquiring a pre-frequency spectrum with a first preset time length before the silent section of the frequency spectrum and a post-frequency spectrum with a second preset time length after the silent section of the frequency spectrum, and combining the pre-frequency spectrum and the post-frequency spectrum into a spectrogram to be identified;

identifying the spectrogram to be identified by using a preset classification model, and confirming the pause category of the silent section of the frequency spectrum;

and carrying out sentence segmentation on the speech data to be subjected to sentence segmentation according to the pause category.

2. The intelligent sentence segmentation method based on sound spectrogram of claim 1, wherein:

the step of combining the pre-spectrum and the post-spectrum into a spectrogram to be identified comprises:

and adding a mute frequency spectrum with a third preset time length between the front frequency spectrum and the rear frequency spectrum to obtain the spectrogram to be identified.

3. The intelligent sentence segmentation method based on the sound spectrogram of claim 2, wherein:

the value range of the third preset duration is 1/5-1/4 of the total duration of the frequency spectrum in the spectrogram to be identified.

4. The intelligent sentence segmentation method based on sound spectrogram according to claim 3, wherein:

the second preset duration is three times the first preset duration.

5. The intelligent sentence segmentation method based on sound spectrogram according to any one of claims 1 to 4, wherein:

the step of identifying a silent segment of a spectrum from the spectrogram data comprises:

and when the frequency amplitude appearing in the spectrogram data is smaller than a preset value and lasts for a preset time length, the spectrogram segment is regarded as a silent frequency spectrum segment.

6. The intelligent sentence segmentation method based on sound spectrogram according to any one of claims 1 to 4, wherein:

the preset classification model is obtained by convolutional neural network learning.

7. The intelligent sentence segmentation method based on sound spectrogram according to claim 6, wherein:

the step of convolutional neural network learning comprises:

acquiring spectrogram data corresponding to training voice data;

performing pause class labeling on all the silent sections of the frequency spectrum in the spectrogram data;

acquiring a pre-frequency spectrum of a first preset time length before a frequency spectrum mute section of each piece of frequency spectrum data and a post-frequency spectrum of a second preset time length after the frequency spectrum mute section to form a training frequency spectrum;

and performing model training on the training spectrogram by using a convolutional neural network algorithm to obtain the preset classification model.

8. The intelligent sentence segmentation method based on sound spectrogram according to any one of claims 1 to 4, wherein:

after the step of sentence segmentation is performed on the speech data to be segmented according to the pause category, the method further comprises the following steps of:

and storing the sentences obtained by segmentation in a preset format.

9. A computer device comprising a processor and a memory, wherein: the memory stores a computer program which, when executed by the processor, implements the steps of the intelligent sound spectrogram-based sentence segmentation method as defined in any one of claims 1 to 8.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a controller implements the steps of the intelligent sentence segmentation method based on sound spectrogram according to any one of claims 1 to 8.