KR20170100705A - Apparatus and computer program stored in computer-readable medium for improving of voice recognition performance - Google Patents
Apparatus and computer program stored in computer-readable medium for improving of voice recognition performance Download PDFInfo
- Publication number
- KR20170100705A KR20170100705A KR1020160022510A KR20160022510A KR20170100705A KR 20170100705 A KR20170100705 A KR 20170100705A KR 1020160022510 A KR1020160022510 A KR 1020160022510A KR 20160022510 A KR20160022510 A KR 20160022510A KR 20170100705 A KR20170100705 A KR 20170100705A
- Authority
- KR
- South Korea
- Prior art keywords
- segment
- frame
- speaker model
- speech
- voice
- Prior art date
Links
- 238000004590 computer program Methods 0.000 title claims abstract description 19
- 238000001514 detection method Methods 0.000 claims abstract description 17
- 238000000034 method Methods 0.000 claims description 33
- 230000002708 enhancing effect Effects 0.000 claims description 11
- 238000010801 machine learning Methods 0.000 claims description 4
- 238000012706 support-vector machine Methods 0.000 claims description 4
- 239000000203 mixture Substances 0.000 claims description 2
- 230000001537 neural effect Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 11
- 238000007796 conventional method Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 6
- 238000003780 insertion Methods 0.000 description 6
- 230000037431 insertion Effects 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 239000002245 particle Substances 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000010409 thin film Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Telephone Function (AREA)
Abstract
Description
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition technology, and more particularly, to a technique for enhancing speech recognition performance.
Voice is the most universal and convenient means of information delivery used by humans. Speech represented by voice plays an important role not only as means of communication between human beings and human beings, but also as a means of operating machines and apparatuses using human voice. Recently, speech recognition technology has been developed due to development of computer performance, development of various media, and development of signal and information processing technology.
A speech recognition technology is a technique for a computer to analyze or understand a human voice. The speech recognition technique converts a voiced speech into an electric signal using a human voice having a specific frequency by changing a mouth shape and a tongue position according to pronunciation It is a technology that extracts frequency characteristics of speech signal and recognizes pronunciation.
U.S. Pat. No. 4,867,778 discloses a device for searching for a valid speech recognition result by presenting the most efficient and effective alternative retrieved, and by suggesting the following alternatives when a viable alternative is false.
When a speech signal is received, only the speech portion of the actual speaker should be detected. This speech detection portion greatly affects speech recognition performance. The actual speech recognition environment is very poor due to ambient noise and the like, so that noise is often included in the detected region in the environment where most speech recognition is performed.
Accordingly, there is a demand in the art for increasing the voice recognition rate.
There is also a need in the art for accurate voice segment detection.
The present invention has been devised in response to the above-described background art, and is intended to detect an accurate voice interval and improve the voice recognition rate.
There is provided a computer program product executable by one or more processors in accordance with an embodiment of the present invention to solve the foregoing problems and comprising instructions for causing the one or more processors to perform the following operations: A computer program for improving speech recognition performance is disclosed. The operations include: receiving voice data; Segmenting the received speech data using a speech region detection algorithm to generate one or more speech segments each having a starting point and an ending point; Generating a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of the one or more frames associated with the speech segment using a speaker recognition algorithm; Determining a degree of similarity between the segmented speaker model and the frame speaker model; And performing re-segmentation of the speech segment based on the determined similarity.
Also disclosed is an apparatus according to an embodiment of the present invention. The apparatus comprises: an input for receiving voice data; A voice segment generation unit for segmenting the received voice data using a voice domain detection algorithm to generate at least one voice segment having a start point and an end point, respectively; A speaker model generation unit that generates a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of one or more frames associated with the speech segment using a speaker recognition algorithm; A similarity determination unit for determining a similarity between the segmented speaker model and the frame speaker model; And a re-segmentation processor for performing re-segmentation of the speech segment based on the determined similarity.
According to an embodiment of the present invention, accurate voice intervals can be detected and the voice recognition rate can be improved.
Various aspects are now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following examples, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more aspects. However, it will be apparent that such aspect (s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing one or more aspects.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram for explaining a problem of conventional speech recognition to be solved by the present invention;
2 shows a flow chart of a program according to an embodiment of the present invention.
3 shows a block diagram of an apparatus according to an embodiment of the present invention.
4 illustrates one or more voice segments generated in accordance with an embodiment of the present invention.
5 is a diagram for explaining a first embodiment of a program according to an embodiment of the present invention.
Figure 6 illustrates a re-segmented speech segment according to a first embodiment of a program according to an embodiment of the present invention.
7 is a diagram for explaining a second embodiment of a program according to an embodiment of the present invention.
Figure 8 illustrates a second re-segmented speech segment according to a second embodiment of the program according to an embodiment of the present invention.
Figure 9 shows a speech segment that can be detected and a speech segment that is detected by a conventional technique according to an embodiment of the present invention.
10 is a diagram for explaining a first embodiment of a program according to another embodiment of the present invention.
Figure 11 shows a re-segmented speech segment according to a first embodiment of a program according to another embodiment of the present invention.
12 is a diagram for explaining a second embodiment of a program according to another embodiment of the present invention.
Figure 13 shows a second re-segmented speech segment according to a second embodiment of the program according to another embodiment of the present invention.
Figure 14 shows a speech segment that can be detected and a speech segment that is detected by a conventional technique according to one embodiment of the present invention.
Various embodiments and / or aspects are now described with reference to the drawings. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more aspects. However, it will also be appreciated by those of ordinary skill in the art that such aspect (s) may be practiced without these specific details. The following description and the annexed drawings set forth in detail certain illustrative aspects of one or more aspects. It is to be understood, however, that such aspects are illustrative and that some of the various ways of practicing various aspects of the principles of various aspects may be utilized, and that the description set forth is intended to include all such aspects and their equivalents.
As used herein, the terms "an embodiment," "an embodiment," " an embodiment, "" an embodiment ", etc. are intended to indicate that any aspect or design described is better or worse than other aspects or designs. .
In addition, the term "or" is intended to mean " exclusive or " That is, it is intended to mean one of the natural inclusive substitutions "X uses A or B ", unless otherwise specified or unclear in context. That is, X uses A; X uses B; Or when X uses both A and B, "X uses A or B" can be applied to either of these cases. It should also be understood that the term "and / or" as used herein refers to and includes all possible combinations of one or more of the listed related items.
It is also to be understood that the term " comprises "and / or" comprising " means that the feature and / or component is present, but does not exclude the presence or addition of one or more other features, components and / It should be understood that it does not. Also, unless the context clearly dictates otherwise or to the contrary, the singular forms in this specification and claims should generally be construed to mean "one or more. &Quot;
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram for explaining a problem of conventional speech recognition to be solved by the present invention;
With reference to Fig. 1 (a), it is assumed that there is at least one voice segment, preferably the
Figs. 1 (b) to 1 (c) show a speech segment in which the starting point is detected incorrectly. Although FIGS. 1B and 1C illustrate that the starting point is incorrectly detected and the voice segment is erroneously segmented, the scope of the right of the present invention is not limited thereto. As described above, the present invention is for detecting an accurate voice segment. That is, the received voice data is segmented into voice segments having an accurate starting point and / or ending point.
Referring to Figures 1 (a) to 1 (b), in the case of the
Referring to Figures 1 (a) to 1 (c), in the case of the second speech segment 20b, a starting point was detected at a later point than the preferred
Such insertion errors and / or removal errors may have a negative impact on correct speech recognition.
According to the present invention described below, an accurate voice interval can be detected, so that the voice recognition rate can be improved. That is, according to the present invention, an accurate starting point and / or ending point can be detected.
Hereinafter, a method of accurately detecting a start point and / or an end point and generating a desired speech segment according to an embodiment of the present invention will be described.
2 shows a flow chart of a program according to an embodiment of the present invention.
The steps shown in FIG. 2 may be performed by device 300 (see FIG. 3). For example, the method shown in FIG. 2 may be performed by the hardware of the apparatus or the OS itself. That is, some or all of the steps shown in FIG. 2 may be computed or generated by device 300 (see FIG. 3). Alternatively, some or all of the steps shown in FIG. 2 may be executed by one or more processors, and may include instructions to cause the one or more processors to perform the operations, 300) (see FIG. 3). Optionally, or alternatively, some or all of the steps shown in FIG. 2 may be computed or generated by the server to implement it by receiving information that the device 300 (see FIG. 3) has computed or generated.
A computer program stored in a computer-readable medium according to an embodiment of the present invention may be executable on one or more processors and may include instructions that cause the one or more processors to perform the following operations. The following operations are shown in Fig.
The program according to an embodiment of the present invention may include an operation (S110) of receiving voice data. The voice data may be received, for example, by
The program according to an embodiment of the present invention may include an operation (S120) of generating one or more voice segments from the voice data received after the operation (S110) of receiving voice data. One or more voice segments generated by voice segment creation operation S120 are shown in FIG. The voice segment generation operation S120 may be performed, for example, by the voice
More specifically, by segmenting the received voice data using a voice domain detection algorithm, one or more voice segments are generated. Each voice segment has a start point and an end point, respectively. It will be appreciated by those skilled in the art that the starting point here is the point at which the speech segment begins and the ending point is where the speech segment ends.
The speech region detection algorithm according to an embodiment of the present invention is an EPD (End-Point Detection) based on at least one of a rule base method and a machine learning method. The EPD is employed to find the starting and ending points of the speech region.
The rule base scheme is based on at least one of Frame energy, Zerocrossing rate, Energy entropy, TEO energy and Melscale filter bank. The machine learning method is based on at least one of GMM (Gaussian Mixture Model), HMM (Hidden Markov Model), SVM (Support Vector Machine) and DNN (Deep Neural Net). The algorithms as described above are exemplary algorithms for detecting a speech region from speech data, and the scope of rights of the present invention is not limited thereto.
As described above with reference to FIG. 1, if the speech region is not accurate, that is, if the start point and / or the end point of the speech region are not accurately detected, the speech recognition rate will be degraded regardless of whether the speech recognition engine has a good speech recognition engine.
According to the operations (S130, S140, and S150) described below according to an embodiment of the present invention, the present invention can detect an accurate starting point and an ending point of a speech region, thereby improving the speech recognition rate. Hereinafter, S130 to S150 will be described in turn.
The program according to an embodiment of the present invention may include an operation (S130) of generating a segmented speaker model and a frame speaker model.
More specifically, using the speaker recognition algorithm, a segment speaker model corresponding to each of the voice segments can be generated. In addition, a frame-speaker model corresponding to each of one or more frames associated with the speech segment may be generated.
Here, the one or more frames associated with the voice segment may be, for example, one or more frames related to the starting point of the voice segment. Alternatively, one or more frames associated with the voice segment may be one or more frames associated with the endpoint of the voice segment.
More specifically, one or more frames associated with the voice segment may include a first frame having a first section from the starting point to an outer region of the voice segment, a second frame having a second section to an inner region of the voice segment, A third frame having a third section from an end point to an outer region of the voice segment, and a fourth frame having a fourth section as an inner region of the voice segment.
The first section, the second section, the third section and the fourth section as described above may all have the same section. Alternatively, at least one of the first section, the second section, the third section, and the fourth section may be determined as a different section, and the scope of the present invention is not limited thereto.
The speaker recognition algorithm may employ, for example, at least one of GMM, HMM, DNN, and I-vector, but the scope of rights of the present invention is not limited thereto.
The speaker model is generated by performing a pre-stored algorithm on a UBM (Universal Background Model). Here, the pre-stored algorithm may include at least one of MAP, MLLR, and Eigenvocie modes, but various algorithms not described above may be employed to generate the speaker model.
The operation S130 as described above can be performed, for example, by the speaker
After the operation of generating the segment speaker model and the frame speaker model (S130), a program according to an embodiment of the present invention may be performed in operation (S140) of determining the degree of similarity between the segment speaker model and the frame speaker model. In order to determine the degree of similarity according to an embodiment of the present invention, for example, a probability value may be calculated based on the extracted feature vector, and the scope of rights of the present invention is not limited thereto. In other words, those skilled in the art will appreciate that known algorithms can be employed to measure the similarity between different speaker models.
That is, according to an embodiment of the present invention, the received voice data S110 is generated as one or more voice segments S120, a speaker model of the generated voice segment is created S130, The speaker model of the related frame can be generated (S130). Thereafter, the similarity degree of the generated speech segmentation speaker model and the frame speaker model can be determined (S140).
Re-segmentation of the speech segment may be performed based on the similarity determined by the above-described operations (S150). That is, the re-segmentation operation (S150) can re-detect the start and / or end points of the speech segment, thereby improving the speech recognition rate.
In more detail, the operation of performing re-segmentation (S150) may include determining whether the frame speaker model and the segment speaker model are identical by comparing the similarity with a predetermined threshold value .
For example, when the degree of similarity is equal to or greater than a predetermined threshold value, it may be determined that the speaker model of the frame and the speaker model of the segment are the same. Alternatively, if the similarity is less than the predetermined threshold value, the speaker model of the frame and the speaker model of the segment may be determined to be the same.
In addition, when it is determined that the frame speaker model and the segment speaker model are the same, re-segmentation can be performed so that the voice segment includes the frame. In addition, when it is determined that the frame speaker model and the segment speaker model are not the same, the voice segment may perform the re-segment such that it does not include the frame.
Additionally, an operation may be performed to generate a frame-speaker model corresponding to each of the one or more frames associated with the re-segmented voice segment for the re-segmented voice segment. Further, an operation of determining the degree of similarity between the segment speaker model and the frame speaker model may be performed. Further, an operation of performing a second re-segmentation of the re-segmented speech segment based on the determined similarity may be further performed. That is, it may be re-segmented again for the re-segmented voice segment. Reference is made to Figs. 5 to 14 in this regard.
Through the operations as described above, the speech segment generated by operation S120 can be re-segmented to solve insertion errors and / or erasure errors and the like. This will be described later with reference to FIG. 4 to FIG.
The operation S150 for performing the re-segmentation as described above may be performed by the
Although not shown, an operation to remove at least one of noise, inter-view, and noise for the re-segmented voice segment may additionally be performed.
The operation (s) of some of the operations shown may be omitted in accordance with one embodiment of the present invention. Further, the operations shown in FIG. 2 are exemplary and additional operations may also be included within the scope of the present invention.
3 shows a block diagram of an apparatus according to an embodiment of the present invention.
The
The
The
The
The speech
The speaker
The
In addition, re-segmentation of the speech segment may be performed based on the determined similarity by the
The
For example, the
The
For example, the
The
In a further aspect of the present invention, the
The
To this end, the
WLAN (Wi-Fi), Wibro (Wireless broadband), Wimax (World Interoperability for Microwave Access), HSDPA (High Speed Downlink Packet Access) and the like can be used as wireless Internet technologies. Wired Internet technologies include XDSL (Digital Subscriber Line), FTTH (Fiber to the home), and PLC (Power Line Communication).
The communication unit may transmit and receive data to and from the electronic device including the short-range communication module, including the short-range communication module. Bluetooth, Radio Frequency Identification (RFID), infrared data association (IrDA), Ultra Wideband (UWB), ZigBee, and the like can be used as a short range communication technology. The above-described communication technologies are merely examples, and the scope of rights of the present invention is not limited thereto.
In an aspect of the present invention, data transmitted and / or received via the communication unit may be stored in the
In accordance with a further embodiment of the present invention, the
Some of these displays may be transparent or light transmissive so that they can be seen through. This can be referred to as a transparent display, and a typical example of the transparent display is TOLED (Transparent OLED) and the like.
The various embodiments described herein may be embodied in a recording medium or storage medium readable by a computer or similar device using, for example, software, hardware, or a combination thereof.
For example, in accordance with a hardware implementation, the embodiments described herein may be implemented as application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs) may be implemented using at least one of field programmable gate arrays, processors, controllers, micro-controllers, microprocessors, and electrical units for performing other functions. The embodiments described herein can be implemented by the control unit itself.
In another example, according to a software implementation, embodiments such as the procedures and functions described herein may be implemented with separate software modules. Each of the software modules may perform one or more of the functions and operations described herein. Software code may be implemented in a software application written in a suitable programming language. The software code is stored in the
4 illustrates one or more voice segments generated in accordance with an embodiment of the present invention.
By using the speech region detection algorithm, segmenting the received speech data, one or more speech segments (100, 200), each having a starting point and an ending point, can be generated.
5 is a diagram for explaining a first embodiment of a program according to an embodiment of the present invention.
Figure 6 illustrates a re-segmented speech segment according to a first embodiment of a program according to an embodiment of the present invention.
Figure 5 shows a
More specifically, Figure 5 shows a
The first to fourth frames as described above may be determined to have the same and / or different sizes. In addition, the size of the above-described frames may be set based on the distance from the adjacent segment, and the scope of right of the present invention is not limited thereto.
As described above, using the speaker recognition algorithm, a segment speaker model corresponding to the
Also, a frame speaker model corresponding to each of one or more frames (here, 101, 102, 103, and 104) associated with the speech segment may be generated.
According to one embodiment of the present invention, the degree of similarity between the segmented speaker model and the frame speaker model can be determined. The re-segmentation of the speech segment is performed based on the determined similarity.
In more detail, when it is determined that the frame speaker model and the segment speaker model are the same, the re-segment is made such that the speech segment includes the frame. Also, when it is determined that the frame speaker model and the segment speaker model are not the same, a re-segment is made such that the speech segment does not include the frame.
5 and 6, it is determined that the segment speaker model of the
5 and 6, it has been determined that the segment speaker model of the
The process as described above is referred to as re-segmentation, and Fig. 6 shows a
7 is a diagram for explaining a second embodiment of a program according to an embodiment of the present invention.
Figure 8 illustrates a second re-segmented speech segment according to a second embodiment of the program according to an embodiment of the present invention.
An additional re-segment (e.g., a second re-segment) may be performed for the
In more detail, a frame speaker model corresponding to each of one or more frames associated with the re-segmented voice segment is generated, and the similarity between the segment speaker model and the frame speaker model can be determined. And may be a second re-segmentation of the re-segmented speech segment based on the determined similarity.
Figure 7 shows a
More specifically, Figure 7 shows a
When the re-segmented voice segment is re-segmented again, the previous re-segmentation contents may be referenced. For example, referring again to FIG. 5 and FIG. 6, in FIG. 5, a
In addition, the size of the frame associated with the re-segmented voice segment when the second re-segmentation is performed may differ from the size of the frame associated with the voice segment when re-segmentation is performed. As shown, the size of the frame associated with the re-segmented voice segment may be set to be less than the size of the frame associated with the voice segment, but the scope of the rights of the present invention is not limited thereto.
As described above, using the speaker recognition algorithm, a segment speaker model corresponding to the
Also, a frame-speaker model corresponding to each of one or more frames (here, 111, 112 and 113) associated with the
According to one embodiment of the present invention, the degree of similarity between the segmented speaker model and the frame speaker model can be determined. A second re-segmentation of the speech segment is performed based on the determined similarity.
7 and 8, it is determined that the segment speaker model of the
The process as described above is referred to as second re-segmentation, and a second
Figure 9 shows a speech segment that can be detected and a speech segment that is detected by a conventional technique according to an embodiment of the present invention.
Referring to FIG. 9, there is shown a
Referring to FIG. 9, according to the present invention, it is possible to overcome erasure errors and to correct for noise, noise, inter-tour, and silence periods. Through this, a more accurate speech region can be detected and the speech recognition rate can be improved. It will be apparent to those skilled in the art that the effects of the present invention are not limited to those described above.
10 is a diagram for explaining a first embodiment of a program according to another embodiment of the present invention.
Figure 11 shows a re-segmented speech segment according to a first embodiment of a program according to another embodiment of the present invention.
Figure 10 shows a
10 shows a
As described above, using the speaker recognition algorithm, a segment speaker model corresponding to the
In addition, a frame-speaker model corresponding to each of one or more frames (here, 201, 202, 203, and 204) associated with the speech segment may be generated.
According to one embodiment of the present invention, the degree of similarity between the segmented speaker model and the frame speaker model can be determined. The re-segmentation of the speech segment is performed based on the determined similarity.
The segmentation speaker model of the
The process as described above is referred to as re-segmentation, and Fig. 11 shows a
12 is a diagram for explaining a second embodiment of a program according to another embodiment of the present invention.
Figure 13 shows a second re-segmented speech segment according to a second embodiment of the program according to another embodiment of the present invention.
In more detail, a frame speaker model corresponding to each of one or more frames associated with the re-segmented voice segment is generated, and the similarity between the segment speaker model and the frame speaker model can be determined. And may be a second re-segmentation of the re-segmented speech segment based on the determined similarity.
12 shows a
The segmented speaker model of the
The process as described above is referred to as second re-segmentation, and a second
Figure 14 illustrates another voice segment that may be detected in accordance with one embodiment of the present invention and another voice segment detected by a conventional technique.
Referring to FIG. 14, there is shown another
Referring to FIG. 14, according to the present invention, an insertion error can be overcome.
Through this, a more accurate speech region can be detected and the speech recognition rate can be improved. It will be apparent to those skilled in the art that the effects of the present invention are not limited to those described above.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the appended claims. It will be understood that the invention may be varied and varied without departing from the scope of the invention.
Those of ordinary skill in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced in the above description may include voltages, currents, electromagnetic waves, magnetic fields or particles, Or particles, or any combination thereof.
Those skilled in the art will appreciate that the various illustrative logical blocks, modules, processors, means, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be embodied directly in electronic hardware, (Which may be referred to herein as "software") or a combination of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the design constraints imposed on the particular application and the overall system. Those skilled in the art may implement the described functionality in various ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various embodiments presented herein may be implemented as a method, apparatus, or article of manufacture using standard programming and / or engineering techniques. The term "article of manufacture" includes a computer program, carrier, or media accessible from any computer-readable device. For example, the computer-readable medium can be a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip, etc.), an optical disk (e.g., CD, DVD, etc.), a smart card, But are not limited to, devices (e. G., EEPROM, cards, sticks, key drives, etc.). The various storage media presented herein also include one or more devices and / or other machine-readable media for storing information. The term "machine-readable medium" includes, but is not limited to, a wireless channel and various other media capable of storing, holding, and / or transferring instruction (s) and / or data.
It will be appreciated that the particular order or hierarchy of steps in the presented processes is an example of exemplary approaches. It will be appreciated that, based on design priorities, certain orders or hierarchies of steps in processes may be rearranged within the scope of the present invention. The appended method claims provide elements of the various steps in a sample order, but are not meant to be limited to the specific order or hierarchy presented.
The description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features presented herein.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the appended claims. It will be understood that the invention may be varied and varied without departing from the scope of the invention.
Those of ordinary skill in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced in the above description may include voltages, currents, electromagnetic waves, magnetic fields or particles, Or particles, or any combination thereof.
Those skilled in the art will appreciate that the various illustrative logical blocks, modules, processors, means, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be embodied directly in electronic hardware, (Which may be referred to herein as "software") or a combination of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the design constraints imposed on the particular application and the overall system. Those skilled in the art may implement the described functionality in various ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various embodiments presented herein may be implemented as a method, apparatus, or article of manufacture using standard programming and / or engineering techniques. The term "article of manufacture" includes a computer program, carrier, or media accessible from any computer-readable device. For example, the computer-readable medium can be a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip, etc.), an optical disk (e.g., CD, DVD, etc.), a smart card, But are not limited to, devices (e. G., EEPROM, cards, sticks, key drives, etc.). The various storage media presented herein also include one or more devices and / or other machine-readable media for storing information. The term "machine-readable medium" includes, but is not limited to, a wireless channel and various other media capable of storing, holding, and / or transferring instruction (s) and / or data.
It will be appreciated that the particular order or hierarchy of steps in the presented processes is an example of exemplary approaches. It will be appreciated that, based on design priorities, certain orders or hierarchies of steps in processes may be rearranged within the scope of the present invention. The appended method claims provide elements of the various steps in a sample order, but are not meant to be limited to the specific order or hierarchy presented.
The description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features presented herein.
Claims (10)
The operations include:
Receiving voice data;
Segmenting the received speech data using a speech region detection algorithm to generate one or more speech segments each having a starting point and an ending point;
Generating a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of the one or more frames associated with the speech segment using a speaker recognition algorithm;
Determining a degree of similarity between the segmented speaker model and the frame speaker model; And
Performing re-segmentation of the speech segment based on the determined similarity;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.
The operation of performing the re-segmentation comprises:
Determining whether the frame speaker model and the segment speaker model are identical by comparing the similarity with a predetermined threshold value;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.
The operation of performing the re-segmentation comprises:
Performing a re-segment such that the speech segment includes the frame if the frame-speaker model and the segment-speaker model are determined to be the same; And
Performing a re-segment such that the speech segment does not include the frame if it is determined that the frame-speaker model and the segment-speaker model are not the same;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.
Generating a frame-speaker model corresponding to each of the one or more frames associated with the re-segmented voice segment;
Determining a degree of similarity between the segmented speaker model and the frame speaker model; And
Performing a second re-segmentation of the re-segmented speech segment based on the determined similarity;
/ RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.
Wherein the one or more frames associated with the voice segment comprises:
A first frame having a first section from the starting point to an outer region of the voice segment, a second frame having a second section to an inner region of the voice segment, and a third section from the end point to an outer region of the voice segment A third frame and a fourth frame having a fourth section as an inner region of the voice segment,
A computer program stored on a computer readable medium for enhancing speech recognition performance.
Removing at least one of noise, interannulation, and noise for the re-segmented speech segment;
≪ / RTI >
A computer program stored on a computer readable medium for enhancing speech recognition performance.
Wherein the speech area detection algorithm comprises:
An EPD (End-Point Detection) based on at least one of a rule base method and a machine learning method,
In the rule base method,
Frame energy, Zerocrossing rate, Energy entropy, TEO energy and Melscale filter bank, and
In the machine learning method,
Based on at least one of a Gaussian Mixture Model (GMM), a Hidden Markov Model (HMM), a Support Vector Machine (SVM), and a Deep Neural Net (DNN)
A computer program stored on a computer readable medium for enhancing speech recognition performance.
The speaker recognition algorithm includes:
GMM, HMM, DNN, and I-vector.
A computer program stored on a computer readable medium for enhancing speech recognition performance.
In the speaker model,
Generated by executing a pre-stored algorithm for UBM (Universal Background Model)
The pre-
MAP, MLLR, and Eigenvocie modes.
A computer program stored on a computer readable medium for enhancing speech recognition performance.
An input unit for receiving voice data;
A voice segment generation unit for segmenting the received voice data using a voice domain detection algorithm to generate at least one voice segment having a start point and an end point, respectively;
A speaker model generation unit that generates a segmented speaker model corresponding to each of the speech segments and a frame speaker model corresponding to each of one or more frames associated with the speech segment using a speaker recognition algorithm;
A similarity determination unit for determining a similarity between the segmented speaker model and the frame speaker model; And
A re-segmentation processor for performing re-segmentation of the speech segment based on the determined similarity;
/ RTI >
Device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160022510A KR101780932B1 (en) | 2016-02-25 | 2016-02-25 | Apparatus and computer program stored in computer-readable medium for improving of voice recognition performance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160022510A KR101780932B1 (en) | 2016-02-25 | 2016-02-25 | Apparatus and computer program stored in computer-readable medium for improving of voice recognition performance |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20170100705A true KR20170100705A (en) | 2017-09-05 |
KR101780932B1 KR101780932B1 (en) | 2017-09-27 |
Family
ID=59924645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020160022510A KR101780932B1 (en) | 2016-02-25 | 2016-02-25 | Apparatus and computer program stored in computer-readable medium for improving of voice recognition performance |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101780932B1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190099988A (en) * | 2018-02-19 | 2019-08-28 | 주식회사 셀바스에이아이 | Device for voice recognition using end point detection and method thereof |
KR20220075550A (en) * | 2020-11-30 | 2022-06-08 | 네이버 주식회사 | Method, system, and computer program to speaker diarisation using speech activity detection based on spearker embedding |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4572218B2 (en) * | 2007-06-27 | 2010-11-04 | 日本電信電話株式会社 | Music segment detection method, music segment detection device, music segment detection program, and recording medium |
JP4964204B2 (en) * | 2008-08-27 | 2012-06-27 | 日本電信電話株式会社 | Multiple signal section estimation device, multiple signal section estimation method, program thereof, and recording medium |
-
2016
- 2016-02-25 KR KR1020160022510A patent/KR101780932B1/en active IP Right Grant
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190099988A (en) * | 2018-02-19 | 2019-08-28 | 주식회사 셀바스에이아이 | Device for voice recognition using end point detection and method thereof |
KR20220075550A (en) * | 2020-11-30 | 2022-06-08 | 네이버 주식회사 | Method, system, and computer program to speaker diarisation using speech activity detection based on spearker embedding |
Also Published As
Publication number | Publication date |
---|---|
KR101780932B1 (en) | 2017-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210312681A1 (en) | Joint audio-video facial animation system | |
CN114578969B (en) | Method, apparatus, device and medium for man-machine interaction | |
WO2021232594A1 (en) | Speech emotion recognition method and apparatus, electronic device, and storage medium | |
US10540958B2 (en) | Neural network training method and apparatus using experience replay sets for recognition | |
RU2688277C1 (en) | Re-speech recognition with external data sources | |
US11955119B2 (en) | Speech recognition method and apparatus | |
CN108694940A (en) | A kind of audio recognition method, device and electronic equipment | |
CN107526967A (en) | A kind of risk Address Recognition method, apparatus and electronic equipment | |
CN110647675B (en) | Method and device for recognition of stop point and training of prediction model and storage medium | |
CN113344089B (en) | Model training method and device and electronic equipment | |
CN111160004A (en) | Method and device for establishing sentence-breaking model | |
KR101780932B1 (en) | Apparatus and computer program stored in computer-readable medium for improving of voice recognition performance | |
US10733537B2 (en) | Ensemble based labeling | |
US10529337B2 (en) | Symbol sequence estimation in speech | |
CN113658586B (en) | Training method of voice recognition model, voice interaction method and device | |
CN112185382B (en) | Method, device, equipment and medium for generating and updating wake-up model | |
CN111128134A (en) | Acoustic model training method, voice awakening method, device and electronic equipment | |
CN113408273A (en) | Entity recognition model training and entity recognition method and device | |
CN112529159A (en) | Network training method and device and electronic equipment | |
CN111984983A (en) | User privacy encryption method | |
KR20170109728A (en) | Apparatus, method and computer program stored on computer-readable medium for recognizing continuous speech | |
CN115967549A (en) | Anti-leakage method based on internal and external network information transmission and related equipment thereof | |
CN112633381A (en) | Audio recognition method and training method of audio recognition model | |
CN114724090B (en) | Training method of pedestrian re-identification model, and pedestrian re-identification method and device | |
US20230130263A1 (en) | Method For Recognizing Abnormal Sleep Audio Clip, Electronic Device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |