KR101681988B1

KR101681988B1 - Speech recognition apparatus, vehicle having the same and speech recongition method

Info

Publication number: KR101681988B1
Application number: KR1020150106376A
Authority: KR
Inventors: 임규형
Original assignee: 현대자동차주식회사
Priority date: 2015-07-28
Filing date: 2015-07-28
Publication date: 2016-12-02

Abstract

A speech recognition apparatus capable of improving the accuracy of the preprocessing for speech recognition by using a noise signal from other nearby speech recognition apparatuses for use in the preprocessing, a vehicle including the same, a server for exchanging signals with a plurality of speech recognition apparatuses And a speech recognition method.
A voice recognition device for converting the input voice into a voice signal and recognizing the voice when the voice of the user is input is characterized by comprising means for extracting a first noise signal from the voice signal, A preprocessing unit for synthesizing the first noise signal and the second noise signal extracted by the peripheral speech recognition apparatus on the basis of a second time at which the speech is input to generate a synthesized noise signal and performing preprocessing using the synthesized noise signal, ; And a recognition unit for recognizing the speech signal subjected to the preprocessing.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a speech recognition apparatus,

The disclosed invention relates to a speech recognition apparatus that recognizes a user's speech, a vehicle that performs a specific function according to the recognized speech including the same, and a speech recognition method.

Speech recognition technology is a technology that increases the usability of a device by allowing the user to control the device by simply uttering a command without physically manipulating the interface.

Such speech recognition technology can be applied to various fields. In recent years, attempts have been made to apply speech recognition technology to a vehicle in order to reduce a driver's operation load.

In order for the speech recognition technology to be effectively applied, the accuracy of speech recognition must be guaranteed to some extent. Therefore, in order to improve the accuracy of speech recognition, many researches and developments have been made. In order to improve the accuracy of speech recognition, noise removal should be performed efficiently.

The present invention provides a speech recognition apparatus, a vehicle including the same, and a speech recognition method capable of improving the accuracy of the preprocessing for speech recognition by using a noise signal from other nearby speech recognition apparatuses for use in the preprocessing.

A voice recognition device for converting the input voice into a voice signal and recognizing the voice when the voice of the user is input is characterized by comprising means for extracting a first noise signal from the voice signal, A preprocessing unit for synthesizing the first noise signal and the second noise signal extracted by the peripheral speech recognition apparatus on the basis of a second time at which the speech is input to generate a synthesized noise signal and performing preprocessing using the synthesized noise signal, ; And a recognition unit for recognizing the speech signal subjected to the preprocessing.

The preprocessor may extract the voice section by removing the composite noise signal from the voice signal, and extract the feature from the voice section.

The recognition unit can recognize the voice signal by comparing the extracted feature with a model stored in advance.

A voice input unit for receiving the voice signal; And a communication unit for receiving the second time and the second noise signal.

The communication unit may receive the second time and the second noise signal from an external server.

The communication unit may receive the second time and the second noise signal from the peripheral speech recognition apparatus.

The communication unit may be a wireless LAN, a Wi-Fi, a Bluetooth, a Zigbee, a Wi-Fi Direct, a UWB (Ultra Wideband), an Infrared Data Association ), Bluetooth low energy (BLE), Near Field Communication (NFC), and Radio Frequency Identification (RFID).

The communication unit may transmit the first time and the first noise signal to the peripheral speech recognition apparatus.

The communication unit may transmit the first time and the first noise signal to an external server.

According to an embodiment of the present invention, there is provided a vehicle including: a voice input unit for receiving voice of a user; And a second noise signal extracting unit that extracts a first noise signal from the input speech signal and outputs the first noise signal and the second noise signal to the first time and the neighboring speech recognition apparatus, A preprocessing unit for synthesizing a second noise signal extracted by the peripheral speech recognition apparatus to generate a synthesized noise signal and performing preprocessing using the synthesized noise signal; And a recognition unit for recognizing the speech signal subjected to the preprocessing.

The vehicle may further include a communication unit for receiving the second time and the second noise signal.

According to an embodiment of the present invention, there is provided a voice recognition method for converting a voice input to a voice signal into a voice signal, the voice recognition method comprising: extracting a first noise signal from the voice signal; Synthesizes the first noise signal and the second noise signal extracted by the peripheral speech recognition apparatus based on a first time at which the speech is input and a second time at which the speech is input to the peripheral speech recognition apparatus to generate a synthesized noise signal and; Performing preprocessing using the synthesized noise signal; And recognizing the speech signal on which the preprocessing has been performed.

Performing the preprocessing may include extracting a speech interval by removing the composite noise signal from the speech signal, and extracting features from the speech interval.

Recognizing the preprocessed speech signal may include recognizing the speech signal by comparing the extracted feature with a previously stored model.

And receiving the second time signal and the second noise signal from an external server or the peripheral speech recognition apparatus.

According to one aspect of the present invention, a speech recognition apparatus, a vehicle including the same, and a speech recognition method can receive a noise signal from other speech recognition devices in the vicinity and use the same in a preprocessing, thereby improving the accuracy of preprocessing for speech recognition.

FIG. 1 and FIG. 2 are views for schematically illustrating a relationship between a speech recognition apparatus according to an embodiment and other speech recognition apparatuses located around the speech recognition apparatus.
3 and 4 are control block diagrams of a speech recognition apparatus according to an embodiment.
Figs. 5 to 7 are views showing examples of voice signals input to the first voice recognition device, the second voice recognition device, and the third voice recognition device. Fig.
8 is a diagram illustrating a process of synthesizing a plurality of noise signals.
9 is a diagram showing a voice signal from which a synthesized noise signal is removed.
10 is an external view of a vehicle according to an embodiment.
11 is a view showing an internal configuration of a vehicle according to an embodiment.
12 is a control block diagram of a server according to an embodiment.
FIG. 13 is a flowchart illustrating a process in which a plurality of speech recognition apparatuses and a server exchange signals with each other.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of a speech recognition apparatus, a vehicle including the same, and a speech recognition method will be described in detail with reference to the accompanying drawings.

FIG. 1 and FIG. 2 are views for schematically illustrating a relationship between a speech recognition apparatus according to an embodiment and other speech recognition apparatuses located around the speech recognition apparatus. 1 and 2 are all plan views of the vehicle from above.

Referring to FIG. 1, a voice recognition apparatus 100 according to an embodiment may be disposed inside a vehicle 200. The voice recognition device 100 may be a component mounted on the vehicle 200 as a module equipped with a voice recognition engine or may be an electronic device independent of the vehicle 200 such as a smart phone, a smart watch, and a tablet PC. In the latter case, the passenger of the vehicle 200 can carry the voice recognition apparatus 100 and board the vehicle 200. [

Other voice recognition devices 300 and 400 may be further disposed around the voice recognition device 100. [ In order to distinguish from the speech recognition apparatus 100 according to one embodiment, the other speech recognition apparatuses 300 and 400 will be referred to as peripheral speech recognition apparatuses.

In the example of FIG. 1, there are two voice recognition devices 300 and 400, but it is also possible that only one voice recognition device exists in the vicinity, and that three or more voice recognition devices exist in the vicinity.

In order to recognize the voice of the user aboard the vehicle 200, the voice recognition device 100 extracts and removes the noise signal from the voice signal. At this time, the peripheral speech recognition apparatuses 300 and 400 also receive the user's voice and extract the noise signal from the voice signal. The extracted noise signal can be transmitted to the speech recognition apparatus 100.

When the speech recognition apparatus 100 removes the noise signal from the speech signal, the noise signal extracted by the speech recognition apparatus 100 and the noise signal extracted by the peripheral speech recognition apparatuses 300 and 400 may be synthesized and removed together. In this case, the accuracy of the preprocessing can be improved by reflecting the noise extracted from other positions.

As shown in FIG. 1, the speech recognition apparatus 100 and the peripheral speech recognition apparatuses 300 and 400 can directly communicate with each other and send and receive signals. However, as shown in FIG. 2, the server 500, It is also possible to send and receive signals through

The speech recognition apparatus 100 and the peripheral speech recognition apparatuses 300 and 400 are all connected to the server 500 through wireless communication and when the peripheral speech recognition apparatuses 300 and 400 extract the noise signals and transmit them to the server 500, (500) transmits the noise signal extracted by the peripheral speech recognition devices (300, 400) to the speech recognition device (100).

When the noise signals extracted by the peripheral speech recognition apparatuses 300 and 400 are directly transmitted to the speech recognition apparatus 100 or through the server 500, the time information in which the speech signals are input can be transmitted together.

The speech recognition apparatus 100 may synthesize the noise signal in consideration of the time difference in which the speech signal is input to the peripheral speech recognition apparatuses 300 and 400. [

The peripheral speech recognition devices 300 and 400 may be electronic devices capable of voice recognition such as a smart phone, a smart watch, and a tablet PC, and a plurality of peripheral speech recognition devices may be different types of electronic devices, have. In addition, the peripheral speech recognition apparatuses 300 and 400 may also be components mounted on the vehicle 200.

3 and 4 are control block diagrams of a speech recognition apparatus according to an embodiment.

3, the speech recognition apparatus 100 includes a preprocessor 110 for performing a preprocess on an input speech signal, a recognition unit 120 for recognizing the preprocessed speech signal, Processing unit 130 for performing a post-process on the speech recognition and a storage unit 130 for storing a model used for speech recognition.

The preprocessing unit 110 can remove the noise signal from the voice signal and extract the feature vector. The preprocessed speech signal may be a digital signal. The preprocessor 110 may perform the analog-to-digital conversion, and the speech signal converted into the digital signal may be input to the preprocessor 110.

For example, the preprocessing unit 110 may regard an initial period of the input voice signal as a noise period, not an actual voice period, and extract a signal included in the noise period as a noise signal. However, the above method is merely an example of noise signal extraction, and it goes without saying that noise can be extracted by other methods.

The preprocessing unit 110 removes the noise signal from the speech signal. Instead of removing only the noise signal extracted by the speech signal, the preprocessing unit 110 synthesizes the noise signal extracted by the peripheral speech recognition apparatuses 300 and 400 with the noise signal extracted by the peripheral speech recognition apparatuses 300 and 400 The noise signal can be removed.

When synthesizing the noise signal, it is possible to consider the time information when the speech is input to the speech recognition apparatus 100 and the peripheral speech recognition apparatuses 300 and 400. In this case, the preprocessing unit 110 synthesizes the noise signal, The operation is described in more detail later.

Then, the preprocessing unit 110 can extract the feature from the speech signal, and the extracted feature may be a vector form. For example, the preprocessing unit 110 may add a cepstrum, a linear predictive coefficient (LPC), a mel-frequency cepstral coefficient (MFCC) A feature vector extracting technique such as a filter bank energy can be used to extract a feature vector.

The recognition unit 120 can recognize the speech signal by comparing the extracted feature vector with the trained reference pattern. For example, a language model may be used, which models an acoustic model for modeling and comparing the signal characteristics of speech and a linguistic order relation of words or syllables corresponding to the recognition vocabulary. The reference pattern or model used for the comparison of the feature vectors may be stored in the storage unit 140.

The acoustic model can be divided into a direct comparison method of setting a recognition object as a feature vector model and comparing it with a feature vector of voice data, and a statistical method of using the feature vector of the recognition object statistically.

The direct comparison method is a method of setting a unit of a recognition target word, a phoneme, etc. as a feature vector model and comparing how similar the input speech is, and a vector quantization method is typically used. According to the vector quantization method, a feature vector of input speech data is mapped to a codebook, which is a reference model, and is encoded into a representative value, thereby comparing the code values with each other.

The statistical model method is a method of constructing a unit of a recognition object as a state sequence and using the relation between the state strings. The state column may consist of a plurality of nodes. The method of using the relation between state columns is again Dynamic Time Warping (DTW), Hidden Markov Model (HMM), and a method using a neural network.

Dynamic time warping is a method of compensating for differences in the time axis when compared to a reference model, taking into account the dynamic characteristics of the speech, in which the length of the signal varies with time even if the same person pronounces the same. Hidden Markov models, Assuming a Markov process with probabilities and observation probabilities of nodes (output symbols) in each state, we estimate the state transition probability and observation probability of nodes through the learning data, and calculate the probability that the input speech is generated in the estimated model .

On the other hand, the language model modeling the linguistic order relations of words and syllables can reduce the acoustical ambiguity and reduce the errors of recognition by applying the order relation between the units constituting the language to the units obtained by speech recognition. There are statistical language models and Finite State Automata (FSA) based models for language models, and the chained probabilities of words such as Unigram, Bigram, and Trigram are used for statistical language models.

The recognition unit 120 may use any of the above-described methods in recognizing the voice.

The recognition result of the recognition unit 120 may include an error. Accordingly, the speech recognition apparatus 100 may further include a post-processing unit 130, and the post-processing unit 130 may detect and correct an error included in the recognition result by applying one of various post-processing algorithms. However, the post-processing unit 130 is not necessarily included in the speech recognition apparatus 100, and may be omitted.

At least one of the preprocessing unit 110, the recognizing unit 120, and the post-processing unit 130 executes a program stored in a memory and a memory that stores various data such as programs and algorithms necessary for performing the respective operations, Lt; RTI ID = 0.0 > a < / RTI >

Some or all of the preprocessing unit 110, the recognition unit 120, and the post-processing unit 130 may share a processor or a memory. That is, one processor performs functions or all functions of a part of the preprocessing unit 110, the recognition unit 120, and the post-processing unit 130, or one memory performs the functions of the preprocessing unit 110, And the post-processing unit 130 can be stored.

The storage unit 140 may include at least one of a nonvolatile memory such as a flash memory, a read only memory, an erasable programmable read only memory (EPROM), and an electrically erasable programmable read only memory (EEPROM) And may further include at least one of a volatile memory such as a random access memory (RAM), a static random access memory (S-RAM), a dynamic random access memory (D-RAM) It is possible.

The pre-processing unit 110, the recognition unit 120, and the post-processing unit 130 may share a memory with the storage unit 140 or may have a separate memory.

In addition, the processor and the memory may be provided in a single configuration, a plurality of configurations, physically separated configurations, or a single chip, depending on their capacities.

As described above, the voice recognition device 100 may be a component mounted on the vehicle 200 or an electronic device provided independently of the vehicle 200. [

4, the voice recognition apparatus 100 communicates with the voice input unit 160 and the peripheral voice recognition apparatuses 300 and 400 or the server 500 through which a voice is input from a user to give a signal And a receiving communication unit 150 may be further included.

The voice input unit 160 may include a microphone. If a voice uttered by the user is input, the voice input unit 160 converts the utterance voice into an electrical signal and outputs the electrical signal to the preprocessing unit 110. This electrical signal is called a voice signal.

The speech input unit 160 may further include an analog-to-digital converter to transmit the digital speech signal to the preprocessing unit 110, but the embodiment of the speech recognition apparatus 100 is not limited thereto. When the analog-to-digital converter is included in the preprocessing unit 110, the speech input unit 160 outputs an analog speech signal and the preprocessing unit 110 converts the analog speech signal into a digital speech signal.

The communication unit 150 may include a wireless communication module. For example, the wireless communication module may be a radio frequency identification (RFID), a wireless LAN, a Wi-Fi, a Bluetooth, a Zigbee, a WFD, and may be a communication module capable of exchanging wireless signals with other devices through communication methods such as wideband, infrared data association (IrDA), Bluetooth low energy (BLE), and NFC (Near Field Communication)

The communication unit 150 can receive the noise signal extracted by the peripheral speech recognition apparatuses 300 and 400 and the time information in which the voice is input to the peripheral speech recognition apparatuses 300 and 400. [ When the peripheral speech recognition apparatuses 300 and 400 are directly connected to the speech recognition apparatus 100, they directly receive noise signals and speech input time information from the peripheral speech recognition apparatuses 300 and 400 and indirectly connect them through the server 500 The server 500 receives the noise signal and the voice input time information.

Also, the communication unit 150 may communicate with the vehicle 200 to exchange signals. For example, when the speech recognition apparatus 100 is connected to the vehicle 200, the speech recognized by the user can be recognized and the recognition result can be transmitted to the vehicle 200. [ The vehicle 200 can perform control according to the recognized voice.

On the other hand, there may be cases where other voice recognition apparatuses 300 and 400 do not exist in the vicinity of the voice recognition apparatus 100. [ In this case, the speech recognition apparatus 100 may perform preprocessing by removing the noise signal extracted by the preprocessing unit 110 from the speech signal without synthesizing the noise signal. A process of determining whether or not other voice recognition apparatuses 300 and 400 exist in the vicinity will be described later.

Hereinafter, the process of removing the noise signal from the speech signal by the speech recognition apparatus 100 will be described in detail. For convenience of explanation, the speech recognition apparatus 100 will be referred to as a first speech recognition apparatus, and the peripheral speech recognition apparatuses 300 and 400 will be referred to as a first peripheral speech recognition apparatus and a second peripheral speech recognition apparatus.

Figs. 5 to 7 are views showing examples of voice signals input to the first voice recognition device, the second voice recognition device, and the third voice recognition device. Fig.

When the voice signal input to the first voice recognition apparatus 100 is as shown in FIG. 5, the time at which the actual voice is input, that is, the time at which the voice section starts, can be estimated as T ₁ . The preprocessing unit 110 may extract a signal input before T ₁ as a first noise signal.

6, the time at which the speech interval starts can be estimated as T ₂ , and the second speech recognition apparatus 300 can estimate the time before the T ₂ Can be extracted as a second noise signal.

7, the time at which the voice section starts can be estimated to be T ₃ , and the third voice recognition apparatus 400 can estimate the time at which the voice section starts before T ₃ Can be extracted as the third noise signal.

FIG. 8 is a diagram illustrating a process of synthesizing a plurality of noise signals, and FIG. 9 is a diagram illustrating a voice signal from which a synthesized noise signal is removed.

The second noise signal extracted by the second speech recognition apparatus 300 and the third noise signal extracted by the third speech recognition apparatus 400 are transmitted to the first speech recognition apparatus 100. As described above, the voice recognition apparatuses may be directly communicated with each other, or may be transmitted through the server.

The preprocessing unit 110 synthesizes the first noise signal, the second noise signal, and the third noise signal to generate a composite noise signal. At this time, the time when the actual voice is input to each voice recognition apparatus, The noise signal can be synthesized on the basis of the time when the noise signal is generated.

Specifically, as shown in Fig. 8, the noise signal can be synthesized according to the time sequence. The first noise signal may be generated from T ₁ , the second noise signal may be generated from T ₂ , and the third noise signal may be generated from T ₃ to generate a composite noise signal.

The preprocessing unit 110 may perform noise preprocessing by removing the composite noise signal from the voice signal as shown in FIG. Then, the feature vector is extracted from the speech signal from which the noise signal has been removed, that is, the speech signal on which the noise preprocessing has been performed.

When the user selects the first speech recognition apparatus 100 among the plurality of speech recognition apparatuses 100, 300, and 400 to turn on the speech recognition mode, the first speech recognition apparatus 100 can operate as described above.

However, when the user selects the second voice recognition device 300 or the third voice recognition device 400, the first voice recognition device 100 may operate as a peripheral voice recognition device.

In this case, the first speech recognition apparatus 100 extracts the noise signal from the user's speech signal, estimates the actual speech input time, transmits the extracted noise signal and speech input time information to the server 500, 2 speech recognition apparatus 300 or the third speech recognition apparatus 400. [

FIG. 10 is an external view of a vehicle according to an embodiment, and FIG. 11 is a diagram illustrating an internal configuration of a vehicle according to an embodiment.

The vehicle 200 according to the present invention may include the voice recognition apparatus 100. [

10, the outer appearance of the vehicle 200 according to one embodiment includes wheels 202 and 203 for moving the main body 201, a door 205L for shielding the inside of the main body 201 from the outside, A front glass 206 for providing a driver with front vision, and side mirrors 204L and 204R for providing rear vision to the driver.

The wheels 202 and 203 include a front wheel 202 provided at the front of the main body 201 and a rear wheel 203 provided at the rear of the main body 201. A driving device (not shown) Provides rotational force to the front wheel 202 or the rear wheel 203 so that the main body 201 moves forward or backward. Such a drive system may employ an engine that generates a rotational force by burning fossil fuel, or a motor that generates power by receiving power from a capacitor.

The doors 205L and 205R (see FIG. 6) are rotatably provided on the left and right sides of the main body 201 so that a driver or a passenger can ride inside the vehicle 200 when the door is opened. 200 are shielded from the outside.

The front glass 206 is provided on the upper side of the main body 201 so that an internal driver can obtain time information in front of the vehicle 200 and is also referred to as a windshield glass.

The side mirrors 204L and 204R include a left side mirror 204L provided on the left side of the vehicle 200 and a right side mirror 204R provided on the right side. 200 can be obtained.

In addition, the vehicle 200 may include a proximity sensor for detecting rear or side obstacles or other vehicles, a rain sensor for detecting rainfall and precipitation, and the like.

The proximity sensor can send a detection signal to the side or rear of the vehicle and receive a reflection signal reflected from an obstacle such as another vehicle. The presence or absence of an obstacle on the side or rear of the vehicle 200 can be detected based on the waveform of the received reflected signal and the position of the obstacle can be detected. As an example of such a proximity sensor, a method of transmitting an ultrasonic wave or an infrared ray and detecting a distance to an obstacle by using ultrasonic waves or infrared rays reflected by the obstacle may be employed.

Referring to FIG. 11, an AVN (Audio Video Navigation) terminal 270 may be provided in a central area of the dashboard 209. The AVN terminal 270 includes an AVN display 271 and an AVN input 272.

The AVN display 271 can display an audio screen, a video screen, a navigation screen, and the like, as well as a screen related to various control screens or additional functions related to the vehicle 200.

The AVN display 271 may be implemented as an LCD (Liquid Crystal Display), a LED (Light Emitting Diode), a PDP (Plasma Display Panel), an OLED (Organic Light Emitting Diode), a CRT (Cathode Ray Tube)

The AVN input unit 272 may be provided in a hard key type area adjacent to the AVN display 271. When the AVN display 271 is implemented as a touch screen type, .

A center input unit 273 of a jog shuttle type may be provided between the driver's seat 208L and the passenger's seat 208R. The user can input the control command by turning the center input unit 273, pushing it, or pushing it in the upward, downward, leftward or rightward direction.

The vehicle 200 may be provided with an acoustic output unit 280 capable of outputting sound, and the acoustic output unit 280 may be a speaker. The sound output unit 280 can output sound necessary for performing an audio function, a video function, a navigation function, and other additional functions.

For example, the sound output unit 280 may be provided in the left door 205L and the right door 205R, respectively, and may be provided in other areas such as a door in a rear seat and a dashboard 209 as needed.

On the other hand, the vehicle 100 is provided with an air conditioner to perform both heating and cooling, and the temperature inside the vehicle 200 can be controlled by discharging heated or cooled air through the air vents 221.

The speech recognition apparatus 100 may be included in the AVN terminal 270 or may be mounted on the vehicle 200 as a separate module.

The microphone for receiving voice from the user may be provided in the head lining 209, the steering wheel 207, or the AVN terminal 270. There is no restriction on the position of the microphone.

As described above, the peripheral speech recognition apparatuses 300 and 400 may be an electronic apparatus independent of the vehicle 200, or may be a component mounted on the vehicle 200, like the speech recognition apparatus 100. [ In the latter case, the speech recognition apparatus 100 and the peripheral speech recognition apparatuses 300 and 400 are connected to each other through a communication protocol such as a CAN (Controller Area Network), a LIN (Local Interconnection Network), a FlexRay, It is also possible to send and receive signals through

When the voice recognition apparatus 100 is included in the vehicle 200, a communication module provided in the vehicle 200 may be used to communicate with the server 500 or the peripheral voice recognition apparatuses 300 and 400, 100 may itself have a communication module.

12 is a control block diagram of a server according to an embodiment.

The server 500 according to an embodiment may temporarily or periodically output a noise signal received from the communication unit 510 and the peripheral speech recognition apparatuses 300 and 400 which communicate with the speech recognition apparatus 100 and the peripheral speech recognition apparatuses 300 and 400 A storage unit 530 for temporarily storing data, and a control unit 520 for controlling transmission and reception of signals.

The communication unit 510 may include a wireless communication module or a short-range communication module.

The wireless communication module may include an antenna or a wireless communication chip for transmitting and receiving a wireless signal with at least one of a base station and an external device on a mobile communication network. For example, the wireless communication module may include a wireless LAN standard (IEEE802.11x May be a wireless communication module.

The short-range communication module means a module for performing short-range communication with a device located within a predetermined distance. Examples of the local communication technology that can be applied to one embodiment include a wireless LAN, a Wi-Fi, a Bluetooth, a zigbee, a Wi-Fi Direct, an ultra wideband (UWB) , infrared data association (BDE), Bluetooth low energy (BLE), and Near Field Communication (NFC).

When at least one of the speech recognition apparatus 100 and the peripheral speech recognition apparatuses 300 and 400 is mounted on the vehicle 200, the communication unit 510 communicates with the vehicle 200 to recognize the speech recognition apparatus 100, (300, 400). The vehicle 200 may include the wireless communication module or the short-range communication module described above.

When the communication unit 510 receives the noise signal and the voice input time information from the peripheral voice recognition apparatuses 300 and 400, the storage unit 530 temporarily or temporarily stores the received noise signal and voice input time information.

The control unit 520 may transmit the stored noise signal and voice input time information to the voice recognition apparatus 100. [ The specific operation in which the control unit 520 transmits and receives a noise signal will be described in detail with reference to FIG. 13 to be described later.

It is also possible that the server 500 is included in the vehicle 200. The server 500 is also composed of a processor and a memory so that a processor and a memory that perform the function of the server 500 may be mounted on the vehicle 200 or a function of the server 500 may be installed in a processor or a memory mounted on the vehicle 200. [ Can be added. In this case, the components of the server 500 that are overlapped with the components of the vehicle 200 can be shared with the vehicle 200. For example, the communication unit 510 of the server 500 may be a communication module provided in the vehicle 200.

The speech recognition program or the speech recognition application that executes the operation to perform the above-described operations of the preprocessing unit 110, the recognition unit 120, and the post-processing unit 130 of the speech recognition apparatus 100, (Not shown).

The speech recognition program may be installed at the time of manufacturing the speech recognition apparatus 100, or may be installed by a user after manufacture. Alternatively, a basic speech recognition program may be installed at the time of manufacture, and a speech recognition program that shares the noise signal with other speech recognition devices to remove the composite noise signal may be installed after manufacture. When a speech recognition program is installed by a user, the speech recognition program may be recorded on a computer readable storage medium.

The speech recognition apparatus 100 can install the speech recognition program by down loading from the storage medium. Here, downloading refers to an operation in which a device fetches a program recorded on an external storage medium to install it.

When the storage medium is included in the application or the server providing the program, it is possible to access the server through the Internet and download the speech recognition program. Here, the server providing the program may be the same as or different from the server 500 described later.

In the case where the storage medium is implemented as an auxiliary storage device such as a magnetic disk, an optical disk, a CD-ROM, or a DVD, it is also possible to download a speech recognition program by inserting an auxiliary storage device into the speech recognition device 100 Do.

FIG. 13 is a flowchart illustrating a process in which a plurality of speech recognition apparatuses and a server exchange signals with each other. All or some of the following processes may be included in the speech recognition method according to an embodiment.

In the example of FIG. 13, it is assumed that there are two peripheral speech recognition devices, and the speech recognition devices exchange signals through the server. 5 to 9, the speech recognition apparatus 100 is referred to as a first speech recognition apparatus and the peripheral speech recognition apparatuses 300 and 400 are referred to as a second speech recognition apparatus and a third speech recognition apparatus, respectively.

First, the user turns on the voice recognition mode of the first voice recognition apparatus 100. For example, it is possible to turn on the power of the first speech recognition apparatus 100, or to operate the input section such as a button or a touch pad provided in the first speech recognition apparatus 100 to execute the speech recognition program.

The first voice recognition apparatus 100 in which the voice recognition mode is turned on is switched to the standby state 610. [ Here, the standby state may mean a state in which the voice recognition apparatus is activated so as to recognize the voice.

The first voice recognition apparatus 100 notifies the server 500 of the waiting state (611). That is, when the voice recognition mode is turned on, the user can access the server 500 and notify that the user is in the standby state.

The server 500 turns on the voice recognition mode of the second voice recognition device 300 and the third voice recognition device 400 (620). When the second and third voice recognition apparatuses 300 and 400 are turned on, they can be activated to recognize the voice.

Here, the second voice recognition device 300 and the third voice recognition device 400 may be registered in advance in the server 500 or registered in the first voice recognition device 100 in advance. For example, when another speech recognition apparatus performs speech recognition at the time of installation of a speech recognition program, it may receive a consent to provide a noise signal in advance.

Information regarding which speech recognition apparatus has agreed to provide the noise signal may be stored in the storage unit 140 of the first speech recognition apparatus 100. [ When the second voice recognition device 300 and the third voice recognition device 400 agree to provide a noise signal, identification information such as a MAC (Media Access Control) address and an IP (Internet Protocol) And can be registered in the speech recognition apparatus 100. In this case, when the first voice recognition apparatus 100 transmits the standby state notification signal to the server 500, it may transmit the identification information of the second voice recognition apparatus 300 and the third voice recognition apparatus 400 together.

Alternatively, it is also possible that information regarding which speech recognition apparatus has agreed to provide the noise signal is stored in the storage unit 530 of the server 500. [

If the second voice recognition device 300 and the third voice recognition device 400 are not located adjacent to the first voice recognition device 100 regardless of the agreement of the provision of the noise signal, The noise signal can not be extracted. Accordingly, it is also possible that the server 500 is further provided with the positional information from the first speech recognition device 100, the second speech recognition device 300, and the third speech recognition device 400.

The control unit 510 of the server 500 can activate only the voice recognition device located at a distance equal to or less than a preset reference value to the first voice recognition device 100. [

Alternatively, when the server 500 is included in the vehicle 200, it is also possible to transmit a speech recognition mode ON signal to the peripheral speech recognition apparatus connected via short-range communication such as Bluetooth.

The above examples show a method of turning on the voice recognition mode of the second voice recognition device 300 and the third voice recognition device 400 which are located at a distance from the first voice recognition device 100 and can recognize the same voice signal, And the embodiments of the invention are not limited thereto.

When the first, second and third voice recognition apparatuses 100, 300 and 400 are enabled to receive the mode voice signals, the first voice recognition apparatus 100 receives the beeps BEEP) (612).

When a beep sound is generated, the user inputs a voice, the first voice recognition device 100 converts the input voice into a voice signal that is an electrical signal, and extracts a first noise signal from the voice signal (613).

The second speech recognition device 300 and the third speech recognition device 400 can simultaneously receive speech uttered by the user.

The second speech recognition apparatus 300 extracts a second noise signal from the speech signal 630, and the third speech recognition apparatus 400 may extract the third noise signal 640.

The second speech recognition apparatus 300 may transmit the extracted second noise signal to the server 500 (631). At this time, the time information in which the actual voice is input to the second voice recognition apparatus 300 may be transmitted together.

The third speech recognition apparatus 400 transmits a third noise signal to the server 500 (641). At this time, the time information in which the actual voice is input to the third voice recognition device 400 may be transmitted together.

The server 500 transmits a second noise signal and a third noise signal to the first speech recognition apparatus 100 (621), and the first speech recognition apparatus 100 transmits a first noise signal, a second noise signal, 3 noise signals are synthesized according to the time order in which actual speech is input to each speech recognition apparatus to generate a synthesized noise signal (614).

Then, the noise preprocessing is performed by removing the composite noise signal from the speech signal (615).

On the other hand, when the first speech recognition apparatus 100 is directly connected to the second speech recognition apparatus 300 and the third speech recognition apparatus 400, the first speech recognition apparatus 100 is connected to the second speech recognition apparatus 300 and the third voice recognition device 400 through a short distance communication such as Bluetooth. Then, the first speech recognition apparatus 100 sends a speech recognition mode ON signal to the second speech recognition apparatus 300 and the third speech recognition apparatus 400, and provides a second noise signal and a third noise signal from them Can receive.

According to the above-described embodiments, various noise environments can be reflected in speech recognition, and when different speech recognition engines of a plurality of speech recognition devices existing at different positions are used, It is possible to obtain an effect to utilize.

100: Speech recognition device
200: vehicle
300,400: Peripheral speech recognition device
110:
120:
130: Post-
140:
500: Server

Claims

A speech recognition apparatus for converting an input speech into a speech signal and recognizing the speech when the speech of the user is input,
A communication unit for transmitting a signal for turning on a voice recognition mode of the peripheral voice recognition device to an external server connected to the peripheral voice recognition device or the peripheral voice recognition device when the voice recognition mode is turned on;
Wherein the first noise signal and the peripheral speech recognition apparatus are configured to extract a first noise signal from the speech signal and to generate a first noise signal based on a first time at which the speech is input and a second time at which the speech is input to the peripheral speech recognition apparatus A preprocessor for synthesizing the extracted second noise signals to generate a synthesized noise signal and performing preprocessing using the synthesized noise signal; And
And a recognition unit for recognizing the speech signal on which the preprocessing has been performed.

The method of claim 1, wherein
The pre-
Extracts a synthesized noise signal from the speech signal to extract a speech section, and extracts a feature from the speech section.

3. The method of claim 2,
Wherein,
And comparing the extracted feature with a previously stored model to recognize the speech signal.

The method of claim 1, wherein
And a voice input unit for receiving the voice signal.

5. The method of claim 4,
Wherein,
And receives the second time signal and the second noise signal from the external server.

The method according to claim 1,
Wherein,
And receives the second time and the second noise signal from the peripheral speech recognition apparatus.

The method according to claim 1,
Wherein,
Wireless LAN, Wi-Fi, Bluetooth, Zigbee, Wi-Fi Direct, UWB, Infrared Data Association (IrDA), BLE Bluetooth Low Energy), NFC (Near Field Communication), and Radio Frequency Identification (RFID).

The method of claim 1, wherein
Wherein,
And transmits the first time and the first noise signal to the peripheral speech recognition apparatus.

The method according to claim 1,
Wherein,
And transmits the first time and the first noise signal to the external server.

A voice input unit for inputting voice of a user;
A communication unit for transmitting a signal for turning on a voice recognition mode of the peripheral voice recognition device to an external server connected to the peripheral voice recognition device or the peripheral voice recognition device when the voice recognition mode is turned on;
Extracting a first noise signal from the input voice-converted voice signal, and outputting the first noise signal and the second noise signal based on a first time when the voice is input and a second time when the voice is input to the peripheral voice recognition device A preprocessing unit for synthesizing a second noise signal extracted by the peripheral speech recognition apparatus to generate a synthesized noise signal and performing preprocessing using the synthesized noise signal; And
And a recognition unit for recognizing the speech signal on which the preprocessing has been performed.

The method of claim 10, wherein
The pre-
Extracting the synthesized noise signal from the voice signal to extract a voice section, and extracting features from the voice section.

12. The method of claim 11,
Wherein,
And comparing the extracted feature with a previously stored model to recognize the speech signal.

delete

11. The method of claim 10,
Wherein,
And receives the second time and the second noise signal from the external server.

11. The method of claim 10,
Wherein,
And receives the second time and the second noise signal from the peripheral speech recognition apparatus.

A speech recognition method for converting an input speech into a speech signal and recognizing it when a user's speech is input,
Transmitting a signal for turning on the voice recognition mode of the peripheral voice recognition device to an external server connected to the peripheral voice recognition device or the peripheral voice recognition device when the voice recognition mode is turned on;
Extracting a first noise signal from the speech signal;
The first noise signal and the second noise signal extracted by the peripheral speech recognition apparatus are synthesized based on a first time at which the speech is input and a second time at which the speech is input to the peripheral speech recognition apparatus, Generate;
Performing preprocessing using the synthesized noise signal;
And recognizing the speech signal on which the preprocessing has been performed.

17. The method of claim 16, wherein
Performing the pre-
Extracting a synthesized noise signal from the speech signal to extract a speech section, and extracting features from the speech section.

18. The method of claim 17,
Recognizing the speech signal on which the preprocess has been performed,
And comparing the extracted feature with a previously stored model to recognize the speech signal.

The method of claim 16, wherein
And receiving the second time and the second noise signal.

20. The method of claim 19,
Wherein the second time and the second noise signal comprise:
And receiving the speech from the external server or the peripheral speech recognition apparatus.