KR20170100705A

KR20170100705A - Apparatus and computer program stored in computer-readable medium for improving of voice recognition performance

Info

Publication number: KR20170100705A
Application number: KR1020160022510A
Authority: KR
Inventors: 이강규; 이항섭; 윤재선; 한명수; 김수환; 금명철
Original assignee: 주식회사 셀바스에이아이
Priority date: 2016-02-25
Filing date: 2016-02-25
Publication date: 2017-09-05
Also published as: KR101780932B1

Abstract

According to an embodiment of the present invention, disclosed is a computer program for the improvement of voice recognition performance, stored in a computer-readable medium including commands which are able to be executed by at least one processor and make the processor execute the following operations. The operations include: an operation of receiving voice data; an operation of generating at least one voice segment having start and end points by segmenting the received voice data through a voice area detection algorithm; an operation of generating a segment speaker model matched with each of the voice segments and a frame speaker model matched with at least one frame related with the voice segments by using a speaker recognition algorithm; an operation of determining similarity between the segment speaker model and the frame speaker model; and an operation of re-segmenting the voice segments based on the determined similarity.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer program and an apparatus for improving speech recognition performance,

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition technology, and more particularly, to a technique for enhancing speech recognition performance.

Voice is the most universal and convenient means of information delivery used by humans. Speech represented by voice plays an important role not only as means of communication between human beings and human beings, but also as a means of operating machines and apparatuses using human voice. Recently, speech recognition technology has been developed due to development of computer performance, development of various media, and development of signal and information processing technology.

A speech recognition technology is a technique for a computer to analyze or understand a human voice. The speech recognition technique converts a voiced speech into an electric signal using a human voice having a specific frequency by changing a mouth shape and a tongue position according to pronunciation It is a technology that extracts frequency characteristics of speech signal and recognizes pronunciation.

U.S. Pat. No. 4,867,778 discloses a device for searching for a valid speech recognition result by presenting the most efficient and effective alternative retrieved, and by suggesting the following alternatives when a viable alternative is false.

When a speech signal is received, only the speech portion of the actual speaker should be detected. This speech detection portion greatly affects speech recognition performance. The actual speech recognition environment is very poor due to ambient noise and the like, so that noise is often included in the detected region in the environment where most speech recognition is performed.

Accordingly, there is a demand in the art for increasing the voice recognition rate.

There is also a need in the art for accurate voice segment detection.

The present invention has been devised in response to the above-described background art, and is intended to detect an accurate voice interval and improve the voice recognition rate.

There is provided a computer program product executable by one or more processors in accordance with an embodiment of the present invention to solve the foregoing problems and comprising instructions for causing the one or more processors to perform the following operations: A computer program for improving speech recognition performance is disclosed. The operations include: receiving voice data; Segmenting the received speech data using a speech region detection algorithm to generate one or more speech segments each having a starting point and an ending point; Generating a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of the one or more frames associated with the speech segment using a speaker recognition algorithm; Determining a degree of similarity between the segmented speaker model and the frame speaker model; And performing re-segmentation of the speech segment based on the determined similarity.

Also disclosed is an apparatus according to an embodiment of the present invention. The apparatus comprises: an input for receiving voice data; A voice segment generation unit for segmenting the received voice data using a voice domain detection algorithm to generate at least one voice segment having a start point and an end point, respectively; A speaker model generation unit that generates a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of one or more frames associated with the speech segment using a speaker recognition algorithm; A similarity determination unit for determining a similarity between the segmented speaker model and the frame speaker model; And a re-segmentation processor for performing re-segmentation of the speech segment based on the determined similarity.

According to an embodiment of the present invention, accurate voice intervals can be detected and the voice recognition rate can be improved.

Various aspects are now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following examples, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more aspects. However, it will be apparent that such aspect (s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing one or more aspects.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram for explaining a problem of conventional speech recognition to be solved by the present invention;
2 shows a flow chart of a program according to an embodiment of the present invention.
3 shows a block diagram of an apparatus according to an embodiment of the present invention.
4 illustrates one or more voice segments generated in accordance with an embodiment of the present invention.
5 is a diagram for explaining a first embodiment of a program according to an embodiment of the present invention.
Figure 6 illustrates a re-segmented speech segment according to a first embodiment of a program according to an embodiment of the present invention.
7 is a diagram for explaining a second embodiment of a program according to an embodiment of the present invention.
Figure 8 illustrates a second re-segmented speech segment according to a second embodiment of the program according to an embodiment of the present invention.
Figure 9 shows a speech segment that can be detected and a speech segment that is detected by a conventional technique according to an embodiment of the present invention.
10 is a diagram for explaining a first embodiment of a program according to another embodiment of the present invention.
Figure 11 shows a re-segmented speech segment according to a first embodiment of a program according to another embodiment of the present invention.
12 is a diagram for explaining a second embodiment of a program according to another embodiment of the present invention.
Figure 13 shows a second re-segmented speech segment according to a second embodiment of the program according to another embodiment of the present invention.
Figure 14 shows a speech segment that can be detected and a speech segment that is detected by a conventional technique according to one embodiment of the present invention.

Various embodiments and / or aspects are now described with reference to the drawings. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more aspects. However, it will also be appreciated by those of ordinary skill in the art that such aspect (s) may be practiced without these specific details. The following description and the annexed drawings set forth in detail certain illustrative aspects of one or more aspects. It is to be understood, however, that such aspects are illustrative and that some of the various ways of practicing various aspects of the principles of various aspects may be utilized, and that the description set forth is intended to include all such aspects and their equivalents.

As used herein, the terms "an embodiment," "an embodiment," " an embodiment, "" an embodiment ", etc. are intended to indicate that any aspect or design described is better or worse than other aspects or designs. .

In addition, the term "or" is intended to mean " exclusive or " That is, it is intended to mean one of the natural inclusive substitutions "X uses A or B ", unless otherwise specified or unclear in context. That is, X uses A; X uses B; Or when X uses both A and B, "X uses A or B" can be applied to either of these cases. It should also be understood that the term "and / or" as used herein refers to and includes all possible combinations of one or more of the listed related items.

It is also to be understood that the term " comprises "and / or" comprising " means that the feature and / or component is present, but does not exclude the presence or addition of one or more other features, components and / It should be understood that it does not. Also, unless the context clearly dictates otherwise or to the contrary, the singular forms in this specification and claims should generally be construed to mean "one or more. &Quot;

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram for explaining a problem of conventional speech recognition to be solved by the present invention;

With reference to Fig. 1 (a), it is assumed that there is at least one voice segment, preferably the first voice segment 10 and the second voice segment 20, which are preferably segmented among the received voice data. That is, the first speech segment 10 and the second speech segment 20 refer to a speech segment that is ultimately to be detected using a speech region detection algorithm.

Figs. 1 (b) to 1 (c) show a speech segment in which the starting point is detected incorrectly. Although FIGS. 1B and 1C illustrate that the starting point is incorrectly detected and the voice segment is erroneously segmented, the scope of the right of the present invention is not limited thereto. As described above, the present invention is for detecting an accurate voice segment. That is, the received voice data is segmented into voice segments having an accurate starting point and / or ending point.

Referring to Figures 1 (a) to 1 (b), in the case of the second speech segment 20a, a starting point was detected before the preferred second speech segment 20. That is, the second audio segment 20a further includes a certain region (a) compared to the second preferred audio segment 20 and includes, for example, noise, noise, and the like. More specifically, the predetermined region (a) may correspond to "um ... ". The above-mentioned error is referred to as "insertion error" or "insertion error ".

Referring to Figures 1 (a) to 1 (c), in the case of the second speech segment 20b, a starting point was detected at a later point than the preferred second speech segment 20. [ That is, the second audio segment 20a contains less area b than the preferred second audio segment 20. That is, for example, when the preferred second segment of speech 20b is "I", the speech region corresponding to "b" in the case of the second segment of speech 20b is not included. The above-mentioned error is referred to as a "removal error" or a "deletion error".

Such insertion errors and / or removal errors may have a negative impact on correct speech recognition.

According to the present invention described below, an accurate voice interval can be detected, so that the voice recognition rate can be improved. That is, according to the present invention, an accurate starting point and / or ending point can be detected.

Hereinafter, a method of accurately detecting a start point and / or an end point and generating a desired speech segment according to an embodiment of the present invention will be described.

2 shows a flow chart of a program according to an embodiment of the present invention.

The steps shown in FIG. 2 may be performed by device 300 (see FIG. 3). For example, the method shown in FIG. 2 may be performed by the hardware of the apparatus or the OS itself. That is, some or all of the steps shown in FIG. 2 may be computed or generated by device 300 (see FIG. 3). Alternatively, some or all of the steps shown in FIG. 2 may be executed by one or more processors, and may include instructions to cause the one or more processors to perform the operations, 300) (see FIG. 3). Optionally, or alternatively, some or all of the steps shown in FIG. 2 may be computed or generated by the server to implement it by receiving information that the device 300 (see FIG. 3) has computed or generated.

A computer program stored in a computer-readable medium according to an embodiment of the present invention may be executable on one or more processors and may include instructions that cause the one or more processors to perform the following operations. The following operations are shown in Fig.

The program according to an embodiment of the present invention may include an operation (S110) of receiving voice data. The voice data may be received, for example, by input 310 of device 300 (see FIG. 3), described below. Alternatively, or alternatively, such voice data may be received at device 300 (see FIG. 3) and sent to a server.

The program according to an embodiment of the present invention may include an operation (S120) of generating one or more voice segments from the voice data received after the operation (S110) of receiving voice data. One or more voice segments generated by voice segment creation operation S120 are shown in FIG. The voice segment generation operation S120 may be performed, for example, by the voice segment generation unit 320 of the device 300 (see FIG. 3).

More specifically, by segmenting the received voice data using a voice domain detection algorithm, one or more voice segments are generated. Each voice segment has a start point and an end point, respectively. It will be appreciated by those skilled in the art that the starting point here is the point at which the speech segment begins and the ending point is where the speech segment ends.

The speech region detection algorithm according to an embodiment of the present invention is an EPD (End-Point Detection) based on at least one of a rule base method and a machine learning method. The EPD is employed to find the starting and ending points of the speech region.

The rule base scheme is based on at least one of Frame energy, Zerocrossing rate, Energy entropy, TEO energy and Melscale filter bank. The machine learning method is based on at least one of GMM (Gaussian Mixture Model), HMM (Hidden Markov Model), SVM (Support Vector Machine) and DNN (Deep Neural Net). The algorithms as described above are exemplary algorithms for detecting a speech region from speech data, and the scope of rights of the present invention is not limited thereto.

As described above with reference to FIG. 1, if the speech region is not accurate, that is, if the start point and / or the end point of the speech region are not accurately detected, the speech recognition rate will be degraded regardless of whether the speech recognition engine has a good speech recognition engine.

According to the operations (S130, S140, and S150) described below according to an embodiment of the present invention, the present invention can detect an accurate starting point and an ending point of a speech region, thereby improving the speech recognition rate. Hereinafter, S130 to S150 will be described in turn.

The program according to an embodiment of the present invention may include an operation (S130) of generating a segmented speaker model and a frame speaker model.

More specifically, using the speaker recognition algorithm, a segment speaker model corresponding to each of the voice segments can be generated. In addition, a frame-speaker model corresponding to each of one or more frames associated with the speech segment may be generated.

Here, the one or more frames associated with the voice segment may be, for example, one or more frames related to the starting point of the voice segment. Alternatively, one or more frames associated with the voice segment may be one or more frames associated with the endpoint of the voice segment.

More specifically, one or more frames associated with the voice segment may include a first frame having a first section from the starting point to an outer region of the voice segment, a second frame having a second section to an inner region of the voice segment, A third frame having a third section from an end point to an outer region of the voice segment, and a fourth frame having a fourth section as an inner region of the voice segment.

The first section, the second section, the third section and the fourth section as described above may all have the same section. Alternatively, at least one of the first section, the second section, the third section, and the fourth section may be determined as a different section, and the scope of the present invention is not limited thereto.

The speaker recognition algorithm may employ, for example, at least one of GMM, HMM, DNN, and I-vector, but the scope of rights of the present invention is not limited thereto.

The speaker model is generated by performing a pre-stored algorithm on a UBM (Universal Background Model). Here, the pre-stored algorithm may include at least one of MAP, MLLR, and Eigenvocie modes, but various algorithms not described above may be employed to generate the speaker model.

The operation S130 as described above can be performed, for example, by the speaker model generation unit 330 of the device 300 (see FIG. 3).

After the operation of generating the segment speaker model and the frame speaker model (S130), a program according to an embodiment of the present invention may be performed in operation (S140) of determining the degree of similarity between the segment speaker model and the frame speaker model. In order to determine the degree of similarity according to an embodiment of the present invention, for example, a probability value may be calculated based on the extracted feature vector, and the scope of rights of the present invention is not limited thereto. In other words, those skilled in the art will appreciate that known algorithms can be employed to measure the similarity between different speaker models.

That is, according to an embodiment of the present invention, the received voice data S110 is generated as one or more voice segments S120, a speaker model of the generated voice segment is created S130, The speaker model of the related frame can be generated (S130). Thereafter, the similarity degree of the generated speech segmentation speaker model and the frame speaker model can be determined (S140).

Re-segmentation of the speech segment may be performed based on the similarity determined by the above-described operations (S150). That is, the re-segmentation operation (S150) can re-detect the start and / or end points of the speech segment, thereby improving the speech recognition rate.

In more detail, the operation of performing re-segmentation (S150) may include determining whether the frame speaker model and the segment speaker model are identical by comparing the similarity with a predetermined threshold value .

For example, when the degree of similarity is equal to or greater than a predetermined threshold value, it may be determined that the speaker model of the frame and the speaker model of the segment are the same. Alternatively, if the similarity is less than the predetermined threshold value, the speaker model of the frame and the speaker model of the segment may be determined to be the same.

In addition, when it is determined that the frame speaker model and the segment speaker model are the same, re-segmentation can be performed so that the voice segment includes the frame. In addition, when it is determined that the frame speaker model and the segment speaker model are not the same, the voice segment may perform the re-segment such that it does not include the frame.

Additionally, an operation may be performed to generate a frame-speaker model corresponding to each of the one or more frames associated with the re-segmented voice segment for the re-segmented voice segment. Further, an operation of determining the degree of similarity between the segment speaker model and the frame speaker model may be performed. Further, an operation of performing a second re-segmentation of the re-segmented speech segment based on the determined similarity may be further performed. That is, it may be re-segmented again for the re-segmented voice segment. Reference is made to Figs. 5 to 14 in this regard.

Through the operations as described above, the speech segment generated by operation S120 can be re-segmented to solve insertion errors and / or erasure errors and the like. This will be described later with reference to FIG. 4 to FIG.

The operation S150 for performing the re-segmentation as described above may be performed by the re-segment processor 350 of the apparatus 300 (see FIG. 3).

Although not shown, an operation to remove at least one of noise, inter-view, and noise for the re-segmented voice segment may additionally be performed.

The operation (s) of some of the operations shown may be omitted in accordance with one embodiment of the present invention. Further, the operations shown in FIG. 2 are exemplary and additional operations may also be included within the scope of the present invention.

3 shows a block diagram of an apparatus according to an embodiment of the present invention.

The device 300 according to the present invention includes any device having various types of speech recognition capabilities not previously described in the context of a cell phone, tablet, PC, wearable device, laptop, stick PC, PMP and MP3 player.

The device 300 includes an input unit 310, a voice segment generation unit 320, a speaker model generation unit 330, a similarity determination unit 340, a re-segmentation processing unit 350, and a memory unit 360.

The input unit 310 receives a voice signal uttered from an arbitrary speaker. The input unit 310 may be, for example, a microphone included in the device 300, and the scope of rights of the present invention is not limited thereto.

The input unit 310 may further include a module such as a filter to remove noise included in the input voice signal.

The speech segment generation unit 320 may generate one or more speech segments each having a start point and an end point by segmenting the received speech data using a speech region detection algorithm.

The speaker model generation unit 330 may generate a segmented speaker model corresponding to each of the voice segments and a frame speaker model corresponding to each of one or more frames associated with the voice segment using a character recognition algorithm.

The similarity determination unit 340 may determine the similarity between the segmented speaker model and the frame speaker model.

In addition, re-segmentation of the speech segment may be performed based on the determined similarity by the re-segment processing unit 350. [

The memory unit 360 according to an embodiment of the present invention includes an input unit 310, a voice segment generation unit 320, a speaker model generation unit 330, a similarity determination unit 340, and a re- And may store software code that may be executed by at least one module by at least one module. In addition, the memory unit 360 may store all kinds of software codes for operating a program according to an embodiment of the present invention.

For example, the memory 360 may be implemented by one or more processors and may include instructions for causing the one or more processors to perform the following operations: Operations for a computer program may be stored. The operations include: receiving voice data; Segmenting the received speech data using a speech region detection algorithm to generate one or more speech segments each having a starting point and an ending point; Generating a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of the one or more frames associated with the speech segment using a speaker recognition algorithm; Determining a degree of similarity between the segmented speaker model and the frame speaker model; And performing re-segmentation of the speech segment based on the determined similarity.

The memory unit 360 may store various information for executing a program according to an embodiment of the present invention.

For example, the memory unit 360 may store a speech region detection algorithm and a speaker recognition algorithm. Also, the memory unit 360 may store at least one of a speech model, a speaker model, and a language model. The information that the memory unit 360 can store is not limited to the above-described contents. In addition, the information as described above can be stored for each voice segment, and the scope of rights of the present invention is not limited thereto.

The memory 360 may store data for operation of the device 300, and temporarily store input / output data.

In a further aspect of the present invention, the memory unit 360 may store various data needed to provide to the device 300, and may provide the requested data upon request of other components.

The memory unit 360 may be a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, SD or XD memory) A random access memory (SRAM), a read only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM) , A magnetic disk, or an optical disk. The character input program for touch input may operate in association with a web storage that performs a storage function of the memory unit 360 on the Internet.

To this end, the device 300 may include a communication unit. The communication unit may include a wired / wireless communication module for network connection.

WLAN (Wi-Fi), Wibro (Wireless broadband), Wimax (World Interoperability for Microwave Access), HSDPA (High Speed Downlink Packet Access) and the like can be used as wireless Internet technologies. Wired Internet technologies include XDSL (Digital Subscriber Line), FTTH (Fiber to the home), and PLC (Power Line Communication).

The communication unit may transmit and receive data to and from the electronic device including the short-range communication module, including the short-range communication module. Bluetooth, Radio Frequency Identification (RFID), infrared data association (IrDA), Ultra Wideband (UWB), ZigBee, and the like can be used as a short range communication technology. The above-described communication technologies are merely examples, and the scope of rights of the present invention is not limited thereto.

In an aspect of the present invention, data transmitted and / or received via the communication unit may be stored in the memory unit 360 or transmitted to other devices in close proximity via a local communication module.

In accordance with a further embodiment of the present invention, the device 300 may further include components not shown. For example, the device 300 may include a display portion. The display unit may display all of the information that can be displayed on the device 300. The display unit may include a liquid crystal display (LCD), a thin film transistor-liquid crystal display (TFT LCD), an organic light-emitting diode (OLED), a flexible display, And a three-dimensional display (3D display).

Some of these displays may be transparent or light transmissive so that they can be seen through. This can be referred to as a transparent display, and a typical example of the transparent display is TOLED (Transparent OLED) and the like.

The various embodiments described herein may be embodied in a recording medium or storage medium readable by a computer or similar device using, for example, software, hardware, or a combination thereof.

For example, in accordance with a hardware implementation, the embodiments described herein may be implemented as application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs) may be implemented using at least one of field programmable gate arrays, processors, controllers, micro-controllers, microprocessors, and electrical units for performing other functions. The embodiments described herein can be implemented by the control unit itself.

In another example, according to a software implementation, embodiments such as the procedures and functions described herein may be implemented with separate software modules. Each of the software modules may perform one or more of the functions and operations described herein. Software code may be implemented in a software application written in a suitable programming language. The software code is stored in the memory unit 360 and can be executed by the control unit.

4 illustrates one or more voice segments generated in accordance with an embodiment of the present invention.

By using the speech region detection algorithm, segmenting the received speech data, one or more speech segments (100, 200), each having a starting point and an ending point, can be generated.

5 is a diagram for explaining a first embodiment of a program according to an embodiment of the present invention.

Figure 6 illustrates a re-segmented speech segment according to a first embodiment of a program according to an embodiment of the present invention.

Figure 5 shows a voice segment 100 and one or more frames 101, 102, 103 and 104 associated with the voice segment 100. [

More specifically, Figure 5 shows a first frame 101 having a first section from the starting point to an outer region of the voice segment, a second frame 102 having a second section to an inner region of the voice segment, A third frame 103 having a third section from the outer region of the voice segment and a fourth frame 104 having a fourth section into the inner region of the voice segment.

The first to fourth frames as described above may be determined to have the same and / or different sizes. In addition, the size of the above-described frames may be set based on the distance from the adjacent segment, and the scope of right of the present invention is not limited thereto.

As described above, using the speaker recognition algorithm, a segment speaker model corresponding to the speech segment 100 can be generated.

Also, a frame speaker model corresponding to each of one or more frames (here, 101, 102, 103, and 104) associated with the speech segment may be generated.

According to one embodiment of the present invention, the degree of similarity between the segmented speaker model and the frame speaker model can be determined. The re-segmentation of the speech segment is performed based on the determined similarity.

In more detail, when it is determined that the frame speaker model and the segment speaker model are the same, the re-segment is made such that the speech segment includes the frame. Also, when it is determined that the frame speaker model and the segment speaker model are not the same, a re-segment is made such that the speech segment does not include the frame.

5 and 6, it is determined that the segment speaker model of the voice segment 100 and the frame speaker model of the frame 101 are not the same, so that the voice segment 100 includes the frame 101 I never do that. It is also determined that the segmentation speaker model of the speech segment 100 and the frame speaker model of the frame 102 are not the same so that the speech segment 100 does not include the frame 102.

5 and 6, it has been determined that the segment speaker model of the speech segment 100 and the frame speaker model of the frame 103 and the frame speaker model of the frame 104 are the same. Thus, the voice segment 100 includes a frame 103 and a frame 104.

The process as described above is referred to as re-segmentation, and Fig. 6 shows a re-segmented voice segment 100a through re-segmentation.

7 is a diagram for explaining a second embodiment of a program according to an embodiment of the present invention.

Figure 8 illustrates a second re-segmented speech segment according to a second embodiment of the program according to an embodiment of the present invention.

An additional re-segment (e.g., a second re-segment) may be performed for the re-segmented speech segment 100a.

In more detail, a frame speaker model corresponding to each of one or more frames associated with the re-segmented voice segment is generated, and the similarity between the segment speaker model and the frame speaker model can be determined. And may be a second re-segmentation of the re-segmented speech segment based on the determined similarity.

Figure 7 shows a re-segmented speech segment 100a and one or more frames 111, 112 and 113 associated with the re-segmented speech segment 100a.

More specifically, Figure 7 shows a first frame 111 having a first section from the starting point to an interior region of the re-segmented voice segment 100a, a first frame 111 having a first section from the end point to the interior region of the re-segmented voice segment 100a A second frame 112 having a second section as an outer region of the re-segmented voice segment 100a and a third frame 113 having a third section as an inner region of the re-segmented voice segment 100a.

When the re-segmented voice segment is re-segmented again, the previous re-segmentation contents may be referenced. For example, referring again to FIG. 5 and FIG. 6, in FIG. 5, a first frame 101 having a first section as an outer region of a starting point of the voice segment 100 and a second section as an inner region of a starting point Since all of the second frames 102 have been removed, the frame associated with the starting point of the re-segmented voice segment 100a in the second re-segment is determined only for the first frame 111 having the first section as the inner region It is possible.

In addition, the size of the frame associated with the re-segmented voice segment when the second re-segmentation is performed may differ from the size of the frame associated with the voice segment when re-segmentation is performed. As shown, the size of the frame associated with the re-segmented voice segment may be set to be less than the size of the frame associated with the voice segment, but the scope of the rights of the present invention is not limited thereto.

As described above, using the speaker recognition algorithm, a segment speaker model corresponding to the re-segmented voice segment 100a can be generated.

Also, a frame-speaker model corresponding to each of one or more frames (here, 111, 112 and 113) associated with the re-segmented voice segment 100a may be generated.

According to one embodiment of the present invention, the degree of similarity between the segmented speaker model and the frame speaker model can be determined. A second re-segmentation of the speech segment is performed based on the determined similarity.

7 and 8, it is determined that the segment speaker model of the re-segmented voice segment 100a and the frame speaker model of the frame 111 are not the same and the re-segmented voice segment 100a do not include the frame 111. [ It is also determined that the segmented speaker model of the re-segmented speech segment 100a and the frame speaker model of the frame 113 are not identical so that the re-segmented speech segment 100a includes the frame 113 I never do that. It is also determined that the re-segmented speech segment 100a is not the same as the segmented speaker model and the frame-speaker model of the frame 112 so that the re-segmented speech segment 100a includes the frame 112 I never do that.

The process as described above is referred to as second re-segmentation, and a second re-segmented speech segment 100b is shown through re-segmentation in Fig. Segmented speech segment 100a shown in FIG. 6, it can be confirmed that the insertion error and / or erasure error has been solved.

Figure 9 shows a speech segment that can be detected and a speech segment that is detected by a conventional technique according to an embodiment of the present invention.

Referring to FIG. 9, there is shown a voice segment 100 detected by a conventional technique and a voice segment 100c that can be detected according to an embodiment of the present invention.

Referring to FIG. 9, according to the present invention, it is possible to overcome erasure errors and to correct for noise, noise, inter-tour, and silence periods. Through this, a more accurate speech region can be detected and the speech recognition rate can be improved. It will be apparent to those skilled in the art that the effects of the present invention are not limited to those described above.

10 is a diagram for explaining a first embodiment of a program according to another embodiment of the present invention.

Figure 11 shows a re-segmented speech segment according to a first embodiment of a program according to another embodiment of the present invention.

Figure 10 shows a speech segment 200 and one or more frames 201, 202, 203 and 204 associated with the speech segment 200.

10 shows a first frame 201 having a first section from the starting point to the outer region of the voice segment, a second frame 202 having a second section to the inner region of the voice segment, A third frame 203 having a third section to the outer region of the voice segment and a fourth frame 204 having a fourth section to the inner region of the voice segment.

As described above, using the speaker recognition algorithm, a segment speaker model corresponding to the speech segment 200 can be generated.

In addition, a frame-speaker model corresponding to each of one or more frames (here, 201, 202, 203, and 204) associated with the speech segment may be generated.

The segmentation speaker model of the speech segment 200 and the frame speaker model of the frame 201 are determined not to be the same so that the speech segment 200 does not include the frame 201. [ In addition, it is determined that the segment speaker model of speech segment 200 and the frame speaker model of frame 202 are the same, and speech segment 200 includes said frame 202. It has also been determined that the segment speaker model of the speech segment 200 and the frame speaker model of the frame 203 and the frame speaker model of the frame 204 are the same. The speech segment 200 thus includes a frame 203 and a frame 204.

The process as described above is referred to as re-segmentation, and Fig. 11 shows a re-segmented speech segment 200a through re-segmentation.

12 is a diagram for explaining a second embodiment of a program according to another embodiment of the present invention.

Figure 13 shows a second re-segmented speech segment according to a second embodiment of the program according to another embodiment of the present invention.

12 shows a re-segmented speech segment 200a and one or more frames 211, 212 and 213 associated with the re-segmented speech segment 200a.

The segmented speaker model of the re-segmented speech segment 200a and the frame speaker model of the frame 211 are determined not to be the same and the re-segmented speech segment 200a does not include the frame 211. [ It is also determined that the re-segmented speech segment 1200a is not the same as the segmented speaker model and the frame speaker model of frame 212 so that the re-segmented speech segment 200a includes the frame 212 I never do that. It is also determined that the segmented speaker model of the re-segmented speech segment 200a and the frame speaker model of the frame 213 are not identical so that the re-segmented speech segment 200a includes the frame 213 I never do that.

The process as described above is referred to as second re-segmentation, and a second re-segmented speech segment 200b is shown through re-segmentation in Fig.

Figure 14 illustrates another voice segment that may be detected in accordance with one embodiment of the present invention and another voice segment detected by a conventional technique.

Referring to FIG. 14, there is shown another speech segment 200 detected by a conventional technique and another speech segment 200c that may be detected in accordance with an embodiment of the present invention.

Referring to FIG. 14, according to the present invention, an insertion error can be overcome.

Through this, a more accurate speech region can be detected and the speech recognition rate can be improved. It will be apparent to those skilled in the art that the effects of the present invention are not limited to those described above.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the appended claims. It will be understood that the invention may be varied and varied without departing from the scope of the invention.

Those of ordinary skill in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced in the above description may include voltages, currents, electromagnetic waves, magnetic fields or particles, Or particles, or any combination thereof.

Those skilled in the art will appreciate that the various illustrative logical blocks, modules, processors, means, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be embodied directly in electronic hardware, (Which may be referred to herein as "software") or a combination of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the design constraints imposed on the particular application and the overall system. Those skilled in the art may implement the described functionality in various ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various embodiments presented herein may be implemented as a method, apparatus, or article of manufacture using standard programming and / or engineering techniques. The term "article of manufacture" includes a computer program, carrier, or media accessible from any computer-readable device. For example, the computer-readable medium can be a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip, etc.), an optical disk (e.g., CD, DVD, etc.), a smart card, But are not limited to, devices (e. G., EEPROM, cards, sticks, key drives, etc.). The various storage media presented herein also include one or more devices and / or other machine-readable media for storing information. The term "machine-readable medium" includes, but is not limited to, a wireless channel and various other media capable of storing, holding, and / or transferring instruction (s) and / or data.

It will be appreciated that the particular order or hierarchy of steps in the presented processes is an example of exemplary approaches. It will be appreciated that, based on design priorities, certain orders or hierarchies of steps in processes may be rearranged within the scope of the present invention. The appended method claims provide elements of the various steps in a sample order, but are not meant to be limited to the specific order or hierarchy presented.

The description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features presented herein.

Claims

31. A computer program for enhancing speech recognition performance, comprising: instructions executable by one or more processors to cause the one or more processors to perform the following operations:
The operations include:
Receiving voice data;
Segmenting the received speech data using a speech region detection algorithm to generate one or more speech segments each having a starting point and an ending point;
Generating a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of the one or more frames associated with the speech segment using a speaker recognition algorithm;
Determining a degree of similarity between the segmented speaker model and the frame speaker model; And
Performing re-segmentation of the speech segment based on the determined similarity;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.

The method according to claim 1,
The operation of performing the re-segmentation comprises:
Determining whether the frame speaker model and the segment speaker model are identical by comparing the similarity with a predetermined threshold value;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.

3. The method of claim 2,
The operation of performing the re-segmentation comprises:
Performing a re-segment such that the speech segment includes the frame if the frame-speaker model and the segment-speaker model are determined to be the same; And
Performing a re-segment such that the speech segment does not include the frame if it is determined that the frame-speaker model and the segment-speaker model are not the same;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.

The method according to claim 1,
Generating a frame-speaker model corresponding to each of the one or more frames associated with the re-segmented voice segment;
Determining a degree of similarity between the segmented speaker model and the frame speaker model; And
Performing a second re-segmentation of the re-segmented speech segment based on the determined similarity;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.

The method according to claim 1,
Wherein the one or more frames associated with the voice segment comprises:
A first frame having a first section from the starting point to an outer region of the voice segment, a second frame having a second section to an inner region of the voice segment, and a third section from the end point to an outer region of the voice segment A third frame and a fourth frame having a fourth section as an inner region of the voice segment,
A computer program stored on a computer readable medium for enhancing speech recognition performance.

The method according to claim 1,
Removing at least one of noise, interannulation, and noise for the re-segmented speech segment;
&Lt; / RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.

The method according to claim 1,
Wherein the speech area detection algorithm comprises:
An EPD (End-Point Detection) based on at least one of a rule base method and a machine learning method,
In the rule base method,
Frame energy, Zerocrossing rate, Energy entropy, TEO energy and Melscale filter bank, and
In the machine learning method,
Based on at least one of a Gaussian Mixture Model (GMM), a Hidden Markov Model (HMM), a Support Vector Machine (SVM), and a Deep Neural Net (DNN)
A computer program stored on a computer readable medium for enhancing speech recognition performance.

The method according to claim 1,
The speaker recognition algorithm includes:
GMM, HMM, DNN, and I-vector.
A computer program stored on a computer readable medium for enhancing speech recognition performance.

The method according to claim 1,
In the speaker model,
Generated by executing a pre-stored algorithm for UBM (Universal Background Model)
The pre-
MAP, MLLR, and Eigenvocie modes.
A computer program stored on a computer readable medium for enhancing speech recognition performance.

As an apparatus,
An input unit for receiving voice data;
A voice segment generation unit for segmenting the received voice data using a voice domain detection algorithm to generate at least one voice segment having a start point and an end point, respectively;
A speaker model generation unit that generates a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of one or more frames associated with the speech segment using a speaker recognition algorithm;
A similarity determination unit for determining a similarity between the segmented speaker model and the frame speaker model; And
A re-segmentation processor for performing re-segmentation of the speech segment based on the determined similarity;
/ RTI >
Device.