WO2014088377A1

WO2014088377A1 - Voice recognition device and method of controlling same

Info

Publication number: WO2014088377A1
Application number: PCT/KR2013/011321
Authority: WO
Inventors: 박은상; 김경덕; 김명재; 리우유; 류성한; 이근배
Original assignee: 삼성전자 주식회사
Priority date: 2012-12-07
Filing date: 2013-12-09
Publication date: 2014-06-12

Abstract

A voice recognition device and a method of controlling same are disclosed. According to the present invention, a voice recognition device includes: an extracting unit extracting, from a user's utterance voice, at least one of a first utterance element representing an execution command and a second utterance element representing a subject; a domain determining unit determining the current domain for providing response information on an utterance voice based on the first and second utterance elements; and a control unit determining a candidate conversation frame for providing response information on the utterance voice in at least one of the current domain and a previous domain based on a conversation state of the current domain and the previous domain pre-determined from the user's previous utterance voice. Thus, the voice recognition device may provide response information suitable for a user's intention in consideration of the number of various cases on the user's utterance voice.

Description

Speech recognition device and control method thereof

The present invention relates to a speech recognition apparatus and a response information providing method, and more particularly, to a speech recognition apparatus and a response information providing method for providing response information corresponding to a spoken voice of a user.

A conventional speech recognition apparatus providing response information to a user's spoken voice analyzes the received spoken voice to determine a domain intended by the user, and based on the determined domain, the user's spoken voice. Provide response information for voice.

However, the conventional speech recognition apparatus determines a domain based on the user's current spoken voice and provides response information on the user's spoken voice based on the determined domain. That is, the conventional speech recognition apparatus recognizes the user's intention according to the user's current spoken voice and provides response information about the user's spoken voice without considering the dialogue context between the user's previous spoken voice and the current spoken voice.

For example, the previous spoken voice "What is an action movie?" May include user intentions for an action movie provided by a TV program. Subsequently, when a current spoken voice is inputted, "What is the VOD?", The speech recognition apparatus determines the user's intention based on the currently input spoken voice without considering the dialogue context associated with the previous spoken voice. However, as in the above-described example, in the case of the current spoken voice of "what is VOD?", Since there is no execution target to be executed, the speech recognition apparatus 100 receives the user's intention from the current spoken voice of "what is the VOD?". You won't get it right. Accordingly, the speech recognition apparatus provides response information that is different from the user's intention or requests the user to speak again. Accordingly, the user must bear the inconvenience of providing more detailed speech in order to receive the intended response information.

SUMMARY OF THE INVENTION The present invention has been made in accordance with the above-described needs, and an object of the present invention is to consider various cases of the user's spoken voice in a speech recognition device that provides response information about the user's spoken voice in an interactive system. The purpose is to provide response information appropriate to the user's intention.

According to an aspect of the present invention, there is provided a speech recognition apparatus, including an extractor configured to extract at least one of a first speech element representing an execution command and a second speech element representing a target from a speech of a user; A domain determination unit that determines a current domain for providing response information for the spoken voice based on the first and second spoken elements, and a conversation state on the previous domain predetermined from the previous spoken voice of the current domain and the user And a controller configured to determine a candidate conversation frame for providing response information for the spoken voice on at least one of the current domain and the previous domain based on the.

The domain determiner may determine a current domain for providing response information to the speech voice based on driving and parameters corresponding to the first and second speech elements extracted from the extractor.

The controller may determine whether the current context and the previous domain are the same, and whether the dialogue context is switched from the current dialogue frame and the previous dialogue frame generated in association with the previous domain. A candidate dialogue frame for providing response information about the spoken voice may be determined on at least one domain of a previous domain.

If the current domain and the previous domain are the same and the conversation context on the two domains is not switched, the controller may determine a candidate conversation frame for the current conversation frame based on a previous conversation frame.

The controller may further include a candidate for the current conversation frame on the previous domain and the current domain based on the previous conversation frame if the current domain and the previous domain are different and the conversation context on the two domains is not switched. The conversation frame can be determined.

The control unit, when the current domain and the previous domain are the same and the conversation context on the two domains is switched, at least one of the current conversation frame and the initialization conversation frame initialized with respect to the current conversation frame on the previous domain. Candidate conversation frames associated with one conversation frame may be determined.

The controller may further include: a candidate conversation frame for the current conversation frame based on the previous conversation frame, the current on the previous domain, when the current domain and the previous domain are different, and the conversation context on the two domains is switched. At least one of a candidate conversation frame for a conversation frame and a candidate conversation frame for an initialization conversation frame initialized with respect to the current conversation frame on the current domain may be determined.

The apparatus may further include a storage unit configured to match and store the conversation example information related to the previous conversation frame matched with each domain and the counting information according to the frequency degree of the spoken voice related to the conversation example information.

The controller may determine the priority of the candidate conversation frame based on counting information matched to at least one conversation example information for each previous conversation frame stored in the storage unit, and in order of the candidate conversation frames having the highest priority. Response information about the candidate conversation frame may be provided.

The storage unit may further store indexing information for indexing at least one speech element included in the conversation example information for each of the at least one previous conversation frame, and the controller may include at least one previous conversation stored in the storage unit. The response information for the candidate conversation frame may be provided in the order of candidate conversation frames having the largest number of indexing information among candidate conversation frames for providing response information for the spoken voice with reference to the indexing information of the dialogue example information for each frame.

Meanwhile, according to an embodiment of the present invention, in the method of controlling a speech recognition apparatus, the method extracts at least one of a first speech element representing an execution command and a second speech element representing a target from a user's speech voice. Determining a current domain for providing response information for the spoken voice based on the first and second spoken elements; conversation state on a previous domain predetermined from the current spoken voice of the current domain and the user; Determining a candidate dialogue frame for providing response information for the spoken speech on at least one of the current domain and the previous domain based on the response information for the spoken speech based on the candidate dialogue frame; Providing a step.

The determining may include determining a current domain for providing response information about the speech voice based on driving and parameters corresponding to the extracted first and second speech elements.

The providing may include determining whether the current context is identical to the previous domain and whether to switch a conversation context from a previous conversation frame generated in relation to the current conversation frame and the previous domain. A candidate dialog frame for providing response information for the spoken voice may be determined on at least one of a current domain and the previous domain.

The providing may include determining a candidate conversation frame for the current conversation frame based on a previous conversation frame if the current domain and the previous domain are the same and the conversation context on the two domains is not switched.

In addition, the providing may include: if the current domain and the previous domain are different, and the conversation context on the two domains is not switched, on the current conversation frame on the previous domain and the current domain based on the previous conversation frame. The candidate conversation frame for the message may be determined.

The providing may include: an initializing conversation frame initialized with respect to the current conversation frame and the current conversation frame on the previous domain when the current domain and the previous domain are the same and the conversation context on the two domains is switched. The candidate conversation frame associated with at least one conversation frame may be determined.

The providing may include: when the current domain and the previous domain are different, and a conversation context on the two domains is switched, a candidate conversation frame for the current conversation frame, on the previous domain, based on the previous conversation frame. At least one of the candidate conversation frame for the current conversation frame and the candidate conversation frame for the initialization conversation frame initialized with respect to the current conversation frame on the current domain may be determined.

The method may further include matching and storing the conversation example information related to the previous conversation frame matched with each domain and the counting information according to the frequency degree of the spoken voice related to the conversation example information.

The providing may include determining a priority of the candidate conversation frame based on counting information matched with the pre-stored conversation example information for each of the at least one previous conversation frame, and in order of the candidate conversation frames having the highest priority. Response information for the candidate conversation frame may be provided.

The storing may further include indexing information for indexing at least one speech element included in the conversation example information for each of the at least one previous conversation frame, and the providing may include: storing the at least one pre-stored information. The response information for the candidate conversation frame may be provided in the order of candidate conversation frames having the highest number of indexing information among candidate conversation frames for providing response information for the spoken voice with reference to the indexing information of the dialogue example information for each conversation frame. have.

As described above, according to various embodiments of the present disclosure, the speech recognition apparatus in the interactive system may provide response information suitable for the user's intention in consideration of the number of various cases with respect to the spoken voice of the user.

1 is an exemplary diagram of an interactive system according to an embodiment of the present invention;

2 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention;

3 is a first exemplified diagram for determining a candidate dialogue frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to an embodiment of the present invention;

4 is a second exemplary view of determining a candidate dialogue frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to another embodiment of the present invention;

FIG. 5 is a third exemplary view of determining a candidate conversation frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to another embodiment of the present invention; FIG.

6 is a fourth exemplary diagram of determining a candidate dialogue frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to another embodiment of the present invention;

7 is a flowchart illustrating a method of providing response information corresponding to a spoken voice of a user in a speech recognition apparatus according to an exemplary embodiment of the present invention.

-

Hereinafter, with reference to the accompanying drawings will be described in detail the present invention.

1 is an exemplary diagram of an interactive system according to an embodiment of the present invention.

As shown in FIG. 1, the interactive system includes a speech recognition apparatus 100 and a display apparatus 200. The voice recognition apparatus 100 receives a spoken voice signal (hereinafter referred to as a spoken voice) of the user received from the display apparatus 200 and determines which domain the received spoken voice belongs to. Thereafter, the speech recognition apparatus 100 generates response information about the user's spoken voice based on the determined domain (hereinafter referred to as the current domain) and the conversation pattern on the predetermined previous domain from the user's previous spoken voice. 200).

The display device 200 may be a smart TV, but this is only an example and may be implemented as various electronic devices such as a mobile phone such as a smartphone, a desktop PC, a notebook, and a navigation device. The display apparatus 200 collects the user's spoken voice and transmits the collected user's spoken voice to the voice recognition apparatus 100. Accordingly, as described above, the voice recognition apparatus 100 determines a current domain belonging to the user's spoken voice received from the display apparatus 200, and determines the current domain on the current domain determined from the determined current domain and the user's previous spoken voice. Based on the conversation pattern, response information about the spoken voice of the user is generated and transmitted to the display apparatus 200. Accordingly, the display apparatus 200 may output the response information received from the speech recognition apparatus 100 to the speaker or display it on the screen.

In detail, when a speech of a user is received from the display apparatus 200, the speech recognition apparatus 100 analyzes the received speech to determine a current domain for the speech. Subsequently, the speech recognition apparatus 100 may provide response information about the user's spoken voice on at least one of the current domain and the previous domain based on the conversation state on the current domain and the previous domain predetermined from the previous spoken voice of the user. to provide.

Specifically, the speech recognition apparatus 100 determines whether the previous domain and the current domain are the same, and if the two domains are the same, analyzes the conversation patterns on the two domains to determine whether the same conversation context is maintained. As a result of the determination, when the same dialogue context is maintained, the voice recognition apparatus 100 may generate response information about the spoken voice of the current user on the previous domain and transmit the response information to the display apparatus 200.

However, if the conversation context is switched through analysis of conversation patterns on two domains, the same conversation context is maintained on different domains, or the conversation context is switched on different domains, the current user's speech for the current user on both domains Response information about the user's spoken voice may be provided based on the conversation frame and the previous conversation frame for the user's previous spoken voice.

For example, while the previous domain called VDO domain is determined from the previous spoken voice of the user, “What is the animation VOD?”, The spoken voice of the user “What is a TV program?” May be received. In this case, the speech recognition apparatus 100 extracts a first speech element indicating an execution command of "TV program" from the spoken voice "What is a TV program?", And based on the extracted first speech element, "search_program ()". You can create a current conversation frame called ". In addition, the voice recognition apparatus 100 may determine that the current domain for providing the user's spoken voice is the TV program domain from the spoken voice "What is a TV program?"

When the current domain is determined, the speech recognition apparatus 100 compares the previous domain and the current domain, and if the two domains are different from each other, analyzes the conversation patterns on the two domains and determines whether to switch the conversation context. As in the above example, the spoken voice spoken by the user on the previous domain called the VOD domain may be "What is the animation VOD?" Can be. As such, when it is determined that the two domains are different from each other and the conversation context on the two domains is switched, the speech recognition apparatus 100 may provide a plurality of candidate conversations for providing response information about the user's current speech voice on the two domains. The frame can be determined. Here, the candidate conversation frame may be a previous conversation frame generated from the user's previous spoken speech, a current conversation frame generated from the current spoken speech, and an initialization conversation frame initialized with respect to the current conversation frame.

When the plurality of candidate conversation frames are determined, the speech recognition apparatus 100 generates response information about the spoken voice of the user based on the candidate conversation frames determined for each domain and transmits the response information to the display apparatus 200.

As described above, the voice recognition apparatus 100 according to the present invention applies various numbers even when the current speech of the user is not related to the previous speech or the domains related to the two speeches are different from each other and the intention of the user is unclear. The response information may be provided for the speech of the speaker.

So far, the interactive system according to the present invention has been outlined. Hereinafter, the speech recognition apparatus 100 that provides response information corresponding to the spoken voice of the user in the interactive system according to the present invention will be described in detail.

2 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.

As shown in FIG. 2, the voice recognition apparatus 100 includes a communication unit 110, a voice recognition unit 120, an extraction unit 130, a domain determination unit 140, a control unit 150, and a storage unit 160. It includes.

The communicator 110 performs data communication with the display apparatus 200 by wire or wirelessly to receive a spoken voice of a user recognized through the display apparatus 200, and generates and displays response information corresponding to the received spoken voice. Send to device 200. Here, the response information may include content related information or keyword search result information requested by the user.

The communication unit 110 may include various communication modules such as a short range wireless communication module (not shown), a wireless communication module (not shown), and the like. Here, the short range wireless communication module is a module for performing communication with an external device located in a short range according to a short range wireless communication scheme such as Bluetooth, ZigBee. In addition, the wireless communication module is a module connected to an external network and performing communication according to a wireless communication protocol such as WIFI, IEEE, and the like. In addition, the wireless communication module further includes a mobile communication module for accessing and communicating with the mobile communication network according to various mobile communication standards such as 3rd generation (3G), 3rd generation partnership project (3GPP), long term evolution (LTE), and the like. You may.

The voice recognition unit 120 recognizes the user's spoken voice received from the display apparatus 200 through the communication unit 110 and converts the spoken voice into text. According to an embodiment, the speech recognizer 120 may convert the received speech of the user into text using a speech to text (STT) algorithm. When the user's spoken voice is converted into text through the voice recognition unit 120, the extractor 130 extracts a spoken element from the user's spoken voice converted into text. In detail, the extractor 130 may extract a spoken element from the text converted from the spoken voice of the user based on the corpus table previously stored in the storage 160. Here, the utterance element is a keyword for performing an operation requested by the user in the utterance voice of the user. Such a utterance element indicates a first utterance element representing a user action and a main feature, that is, a target. Can be classified as a second ignition element. For example, in the case of the utterance voice of the user "Show action movie!", The extraction unit 130 may include a first speech element indicating an execution command "Show me!" And a second speech element indicating an object "action movie". Can be extracted.

When at least one of the first and second ignition elements is extracted, the domain determination unit 140 based on the driving and parameters corresponding to the first and second ignition elements extracted from the extraction unit 130. To determine the current domain for providing the response information for the user's speech voice. In detail, the domain determiner 140 may generate a dialogue frame (hereinafter referred to as a current dialogue frame) based on driving and parameters corresponding to the first and second utterance elements extracted from the extractor 130. When the current conversation frame is generated, the domain determiner 140 may determine the current domain to which the current conversation frame belongs by referring to a domain table previously stored in the storage 160.

Here, the domain table may be a table in which a conversation frame generated based on driving corresponding to the first speech element extracted from the user's previous speech voice and parameters corresponding to the second speech element for each of a plurality of preset domains is matched. Can be. Therefore, when the current conversation frame is generated, the domain determiner 140 obtains at least one domain to which the current conversation frame belongs by referring to the domain table previously stored in the storage 160 and determines the obtained domain as the current domain. Can be.

On the other hand, the controller 150 controls the overall operation of each component of the speech recognition apparatus 100. In particular, the controller 150 may determine whether the user of the user is located on at least one of the current domain and the previous domain based on the conversation state on the current domain determined by the domain determiner 140 and the previous domain determined from the previous spoken voice of the user. A candidate dialogue frame for providing response information for the spoken voice is determined.

In detail, the controller 150 determines whether the current domain and the previous domain are the same and whether the conversation context is switched from the previous conversation frame generated in relation to the current conversation frame and the previous domain. Subsequently, the controller 150 provides response information about the user's spoken voice on at least one of the current domain and the previous domain according to the determination result of whether the two domains are identical and the determination result of switching the dialogue context. Can determine a candidate conversation frame.

According to an embodiment, if it is determined that the current domain and the previous domain are the same and the conversation context on both domains is not switched, the controller 150 may determine a candidate conversation frame for the current conversation frame based on the previous conversation frame. have.

On the other hand, if it is determined that the current domain and the previous domain are different, and the conversation context on the two domains is not switched, the controller 150 may determine a candidate conversation frame for the current conversation frame on the previous domain and the current domain based on the previous conversation frame. Can be determined.

On the other hand, if it is determined that the current domain and the previous domain are the same, and the conversation context on the two domains is switched, the controller 150 displays at least one of the current conversation frame and the conversation frame initialized with respect to the current conversation frame on the previous domain. Candidate conversation frames associated with the conversation frame may be determined.

On the other hand, if it is determined that the current domain and the previous domain are different, and the conversation context on the two domains is switched, the controller 150 may be a candidate conversation frame for the current conversation frame based on the previous conversation frame, and the current conversation frame on the previous domain. At least one of a candidate conversation frame for and a candidate conversation frame for an initialization conversation frame initialized with respect to the current conversation frame on the current domain may be determined.

As such, when at least one candidate conversation frame is determined according to whether the current domain and the previous domain are the same and whether the conversation context is switched on the two domains, the controller 150 generates response information about the determined candidate conversation frame, The generated response information may be transmitted to the display apparatus 200.

Hereinafter, an operation of determining a candidate conversation frame for providing response information about the user's spoken voice based on the user's spoken voice and the previous spoken voice will be described in detail with reference to FIGS. 3 to 6.

FIG. 3 is a first exemplary diagram of determining a candidate dialogue frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to an embodiment of the present invention.

As shown in FIG. 3, if it is determined that the domain determined in relation to the previous spoken voice and the current spoken voice of the user is the same, and the conversation context on the two domains is not switched, the controller 150 A candidate conversation frame for the current conversation frame may be determined based on the previous conversation frame.

For example, as shown in the dialogue context area 310, the user's previous spoken voice is "What is the animation VOD?", And the previous conversation frame generated based on the spoken elements extracted from the previous spoken voice is "search_program". (genre = animation) ", and the previous domain determined based on the previous conversation frame may be the VOD domain. In addition, the user's current speech voice is "show only the entire audience", the current conversation frame generated based on the speech element extracted from the current speech voice is "search_program (content_rating)", and the current domain determined based on the current conversation frame. This can be a VDO domain.

In this case, the controller 150 may determine that all domains determined in relation to the previous speech voice and the current speech voice of the user are VDO domains. In addition, the controller 150 analyzes the user's conversation pattern from the previous conversation frame "search_program (genre = animation)" and the current conversation frame "search_program (content_rating)" to determine that the conversation context on the two domains is not switched. can do. That is, the controller 150 may determine that the user's current spoken voice is a VOD conversation context following the previous spoken voice.

As such, when it is determined that the domain determined in relation to the user's previous spoken voice and the current spoken voice is the same, and the conversation context on the two domains is not switched, the controller 150 determines the user's spoken voice on the previous domain, the VOD domain. The candidate dialog frame 320 for providing response information may be determined.

In detail, the controller 150 may determine the current conversation frame “search_program (content_rating)” as the candidate conversation frame 320 based on the previous conversation frame “search_program (genre = animation)”.

As such, when the candidate conversation frame 320 is determined, the controller 150 determines the previous conversation frame "search_program (genre = animation)" and the "search_program (content_rating) determined as the candidate conversation frame 320 on the previous domain VOD domain. Based on the search result, a search may be performed for the animations that can be viewed in all, and response information including the search result information may be generated and transmitted to the display apparatus 100.

FIG. 4 is a second exemplary view for determining a candidate dialogue frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to another embodiment of the present invention.

As shown in FIG. 4, if it is determined that the current domain and the previous domain are different and the conversation context on the two domains is not switched, the controller 150 may determine the current conversation frame on both domains based on the previous conversation frame. Candidate conversation frames may be determined.

For example, as shown in the dialogue context area 410, the user's previous spoken speech is "What is the animation VOD?" And the previous conversation frame generated based on the spoken elements extracted from the previous spoken speech is "search_program". (genre = animation) ", and the previous domain determined based on the previous conversation frame may be the VOD domain. Then, the user's current speech voice is "Show ○○○ animation", the current conversation frame generated based on the speech element extracted from the current speech voice is "search_program (title = ○○○ animation)", and the current conversation frame The current domain determined based on may be a TV program domain and a VDO domain.

Therefore, when the domain determined in relation to the user's current speech voice is a TV program domain, the controller 150 may determine that the domain of the user is different from the VOD domain which is the previous domain determined in relation to the previous speech voice. In addition, the controller 150 analyzes the user's conversation pattern from the previous conversation frame "search_program (genre = animation)" and the current conversation frame "search_program (title = ○○○ animation)" to talk on two different domains. It can be determined that the context has not been switched.

As such, when it is determined that the two domains determined in relation to the previous speech voice and the current speech voice of the user are different from each other, and the conversation context on the two domains is not switched, the controller 150 may be configured to the user's speech voice on the two domains. First and

second candidate domains

420 and 430 for providing response information may be determined.

Specifically, the control unit 150 selects the "play_program (title = ○○○ animation)" which is the current conversation frame "play_program (title = ○○○ animation)" based on the previous conversation frame "search_program (genre = animation)". In this case, the changed "play_program (title = ○○○ animation)" may be determined as the first candidate conversation frame 420. In addition, the controller 150 may determine “search_program (title = ○○○ animation)” which is the current conversation frame, as the second candidate conversation frame 430.

As such, when the first and second candidate conversation frames 420 and 430 are determined, the controller 150 may provide response information about the user's speech based on the determined first and second candidate conversation frames 420 and 430. have.

In detail, in order to provide response information with respect to the first candidate conversation frame 420, the controller 150 controls the previous conversation frame “search_program (genre = animation)” and the first candidate conversation frame on the VOD domain. Based on "play_program (title = ○○○ animation)" 420, a search is performed for the ○○○ animation among the previously searched animations, and execution information is generated for the retrieved ○○○ animation.

In addition, in order to provide response information with respect to the second candidate dialog frame 420, the controller 150 may add a second candidate dialog frame “search_program (title = ○○○ animation)” on the TV program domain that is the current domain. Based on the search for the ○○○ animation, the search result information for the retrieved ○○○ animation is generated.

Subsequently, the controller 150 performs execution information on the ○○○ animation generated in relation to the first candidate dialogue frame 420 and a search result for the ○○○ animation generated in relation to the second candidate dialogue frame 430. Response information including the information may be generated and transmitted to the display apparatus 200.

FIG. 5 is a third exemplary view of determining a candidate dialogue frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to another embodiment of the present invention.

As shown in FIG. 5, if it is determined that the current domain and the previous domain are identical to each other, and the conversation context on the two domains is switched, the controller 150 initializes the current conversation frame and the current conversation frame on the previous domain. The candidate conversation frame associated with at least one conversation frame among the initialized conversation frames may be determined.

For example, as shown in the dialogue context area 510, the user's previous spoken speech is "What is the animation VOD?", And the previous conversation frame generated based on the speech element extracted from the previous spoken speech is "search_program". (genre = animation) ", and the previous domain determined based on the previous conversation frame may be the VOD domain. Then, the user's current speech voice is "What is the action VOD?", And the current conversation frame generated based on the speech element extracted from the current speech voice is "search_program (genre = action)" and based on the current conversation frame. The determined current domain may be a VDO domain.

In this case, the controller 150 may determine that all domains determined in relation to the previous speech voice and the current speech voice are VOD domains. In addition, the controller 150 analyzes the user's conversation pattern from the previous conversation frame "search_program (genre = animation)" and the current conversation frame "search_program (genre = action)" to switch the conversation context on the same two domains. It can be judged.

As such, when it is determined that the two domains determined in relation to the previous speech voice and the current speech voice of the user are identical to each other and the conversation context on the two domains is switched, the controller 150 may speak the user's speech on the VOD domain that is the previous domain. First and second candidate conversation frames 520 and 530 for providing response information about the voice may be determined.

Specifically, the controller 150 changes the current conversation frame "search_program (genre = action)" to "search_program (genre = action animation)" based on the previous conversation frame "search_program (genre = animation)" and changes "search_program (genre = action animation)" may be determined as the first candidate dialog frame 520. Also, the controller 150 may determine “search_program (genre = action)” which is a current conversation frame, as the second candidate conversation frame 530.

As such, when the first and second candidate conversation frames 520 and 530 are determined, the controller 150 may provide response information about the user's speech based on the determined first and second candidate conversation frames 520 and 530. have.

In detail, in order to provide response information with respect to the first candidate conversation frame 520, the controller 150 controls the previous conversation frame “search_program (genre = animation)” and the first candidate conversation frame on the VOD domain that is the previous domain. Based on "search_program (genre = action animation)" (520), a search for the action animation among the previously searched animations is performed, and search result information about the found action animation is generated.

In addition, in order to provide response information with respect to the second candidate conversation frame 530, the controller 150 controls the VOD based on the search_program (genre = action) that is the second candidate conversation frame 530 on the VOD domain that is the previous domain. Generates search result information about action related content among contents provided on the web.

Subsequently, the controller 150 may provide search result information on the action animation generated in relation to the first candidate dialog frame 520 and search result information on the action related content generated in relation to the second candidate dialog frame 530. The response information may be generated and transmitted to the display apparatus 200.

FIG. 6 is a fourth exemplary diagram of determining a candidate dialogue frame for providing response information about a spoken voice of a user in a speech recognition apparatus according to another embodiment of the present invention.

As illustrated in FIG. 6, if it is determined that the current domain and the previous domain are identical to each other, and the conversation context on the two domains is switched, the controller 150 may determine a candidate conversation frame for the current conversation frame based on the previous conversation frame. At least one of the candidate conversation frame for the current conversation frame on the previous domain and the candidate conversation frame for the initialization conversation frame initialized with respect to the current conversation frame on the current domain may be determined.

For example, as shown in the dialogue context area 610, the user's previous spoken speech is "What is the animation VOD?" And the previous conversation frame generated based on the spoken elements extracted from the previous spoken speech is "search_program". (genre = animation) ", and the previous domain determined based on the previous conversation frame may be the VOD domain. Then, the user's current spoken voice is "then what TV program?", The current conversation frame generated based on the speech element extracted from the current speech voice is "search_program ()", and the current domain determined based on the current conversation frame. This can be a TV program domain.

In this case, the controller 150 may determine that the domain determined in relation to the previous speech voice and the current speech voice is different. In addition, the controller 150 analyzes the user's conversation pattern from the previous conversation frame "search_program (genre = animation)" and the current conversation frame "search_program ()" and determines that the conversation context on the two different domains is switched. can do.

As such, when it is determined that the two domains determined in relation to the previous speech voice of the user and the current speech voice are different from each other, and the dialogue context on the two domains is switched, the controller 150 may determine the speech of the user on the two domains. First to third candidate conversation frames 620 to 640 for providing response information may be determined.

Specifically, the controller 150 changes the current search frame "search_program ()" to "search_program (genre = animation)" based on the previous conversation frame "search_program (genre = animation)" and changes the changed "search_program (genre = animation)". = Animation) "may be determined as the first candidate conversation frame 620. In addition, the controller 150 may determine “search_program ()”, which is the current conversation frame, as the second candidate conversation frame 630. In addition, the controller 150 may determine the initialization dialogue frame initialized with respect to the current dialogue frame as the third candidate dialogue frame 640. Here, since the current conversation frame is "search_program ()", the initialization conversation frame may be the same as "search_program ()" which is the current conversation frame. If the current conversation frame is a conversation frame generated based on driving and parameters corresponding to the first and second speech elements, the initialization conversation frame is assigned to the first speech element except for the parameter corresponding to the second speech element. It may be a conversation frame generated based on the corresponding driving.

As such, when the first to third candidate conversation frames 620 to 640 are determined, the controller 150 may respond to the user's spoken voice based on the determined first to third candidate conversation frames 620 to 640. Can be provided.

In detail, in order to provide response information with respect to the first candidate dialog frame 620, the controller 150 controls the first candidate dialog frame 620 "search_program (genre = animation)" on the TV program domain which is the current domain. The search is performed based on the animation, and search result information about the found animation is generated.

In addition, in order to provide response information with respect to the second candidate conversation frame 630, the controller 150 may control the TV program based on the second candidate conversation frame 630 "search_program ()" on the previous domain, the VOD domain. Generate search result information on the TV program related content provided on the.

In addition, in order to provide response information with respect to the third candidate conversation frame 640, the controller 150 may display the TV based on the "search_program ()" which is the third candidate conversation frame 640 on the TV program domain which is the current domain. Generate search result information on TV program related content provided on the program.

Subsequently, the controller 150 searches for search result information on animations generated in relation to the first candidate dialogue frame 620 and TV program related contents generated in relation to the second and third candidate dialogue frames 630 and 640. Response information including the result information may be generated and transmitted to the display apparatus 200.

On the other hand, the controller 150 determines the priority of the at least one candidate conversation frame determined based on the above-described embodiments according to a preset condition, and then provides response information about the candidate conversation frame in order of the candidate conversation frames having the highest priority. Can provide.

According to an embodiment of the present disclosure, the controller 150 determines the priority of the at least one candidate conversation frame determined based on the counting information matched to the conversation example information for each conversation frame previously stored in the storage 160. do. Subsequently, the controller 150 may provide response information on the candidate conversation frame in order of the candidate conversation frames corresponding to the highest order based on the determined priority.

In detail, as described above, the storage 160 may store a domain table in which the previous conversation frame is matched based on a speech element extracted from the user's previous speech voice for each of a plurality of preset domains. In addition, the storage unit 160 may match and store counting information according to a frequency degree of the dialogue example information related to the previous conversation frame matched for each of the plurality of domains and the user's spoken voice related to the dialogue example information.

For example, the previous dialog frame for "search_program (genre = animation)" may be matched to the VOD domain and the TV program domain. And, the previous dialogue frame for "search_program (genre = animation)" matched to each domain includes dialogue example information and corresponding dialogue example information related to the user's previous speech voice such as "What is animation?", "Find animation", etc. Counting information according to the frequency degree of the user's spoken voice associated with may be matched.

Accordingly, when a plurality of candidate conversation frames are determined, the controller 150 may determine a rank for each candidate conversation frame based on counting information about the matched conversation example information with respect to each candidate conversation frame determined.

For example, as described with reference to FIG. 6, the first to third candidate dialog frames 620 to 640 are determined, and the first candidate dialog frame for "search_program (genre = animation)" on the dual TV program domain ( The highest frequency for the conversation example information associated with 620 may be the highest, and the lowest frequency for the conversation example information related to the second candidate conversation frame 630 for "search_program ()" on the VOD domain.

In this case, the controller 150 includes search result information generated based on the first to third candidate conversation frames 620 to 640 and ranking information on the first to third candidate conversation frames 620 to 640. The response information is generated and transmitted to the display apparatus 200. Accordingly, the display apparatus 200 may display the respective search result information in order of search result information for the candidate dialog frame having the highest ranking based on the ranking information included in the received response information.

According to another exemplary embodiment, the controller 150 may refer to indexing information of at least one previous conversation frame-by-frame conversation example information stored in the storage 160, and may provide a candidate conversation for providing response information about the user's spoken voice. The response information for the candidate conversation frame may be provided in the order of the candidate conversation frames having the largest number of indexing information in the frame.

In detail, the storage 160 may further store indexing information for indexing at least one speech element included in the conversation example information for each of the at least one previous conversation frame. For example, in the case of the previous conversation frame for “search_program (genre = animation)”, the dialogue frame generated based on the first and second speech elements, and includes indexing information for each of the first and second speech elements. can do. Meanwhile, in the case of the previous conversation frame for "search_program ()", the conversation frame generated based on the first speech element may include only indexing information about the first speech element.

Therefore, when a plurality of candidate conversation frames are determined, the controller 150 refers to the number of indexing information for each utterance element constituting each candidate conversation frame, and the plurality of candidate conversation frames in the order of the candidate conversation frames with the largest number of indexing information. Determine the rank for the frame. Thereafter, the controller 150 generates response information including search result information about each candidate conversation frame and ranking information determined for each candidate conversation frame, and transmits the response information to the display apparatus 200.

Accordingly, the display apparatus 200 may display the respective search result information in order of search result information for the candidate dialog frame having the highest ranking based on the ranking information included in the received response information.

Up to now, each configuration of the speech recognition apparatus 100 that provides response information corresponding to the spoken voice of the user in the interactive system according to the present invention has been described in detail. Hereinafter, a method of providing response information corresponding to a spoken voice of a user in the speech recognition apparatus 100 of the interactive system according to the present invention will be described in detail.

As illustrated in FIG. 7, when the speech recognition signal of the user (hereinafter referred to as speech speech) collected from the display apparatus 200 is received, the speech recognition apparatus 100 may display a first command indicating an execution command from the received speech speech. At least one ignition element of the ignition element and the second ignition element representing the target is extracted (S710 and S720).

In detail, when a speech of a user is received from the display apparatus 200, the speech recognition apparatus 100 recognizes the received speech of the user and converts the speech into text. According to an embodiment, the speech recognition apparatus 100 may convert the received speech of the user into text using a speech to text (STT) algorithm. When the user's spoken voice is converted into text, the speech recognition apparatus 100 extracts at least one of the first spoken element representing the execution command and the second spoken element representing the target from the spoken voice of the user converted into text. do. For example, in the case of a spoken voice of a user saying "Find an action movie!", The speech recognition apparatus 100 may include a first speech element indicating an execution command of "Find me!" And an object indicating an object "action movie". 2 Ignition elements can be extracted.

When such a speech element is extracted, the speech recognition apparatus 100 determines a current domain for providing response information about the speech of the user based on the extracted first and second speech elements (S730). In detail, the speech recognition apparatus 100 may determine a current domain for providing response information to the user's speech voice based on driving and parameters corresponding to the extracted first and second speech elements. More specifically, the speech recognition apparatus 100 generates a current conversation frame based on driving and parameters corresponding to the first and second speech elements extracted from the user's speech voice. When such a current conversation frame is generated, the speech recognition apparatus 100 may determine a current domain to which the current conversation frame belongs by referring to a predetermined domain table. Here, the domain table may be a table in which a conversation frame generated based on driving corresponding to the first speech element extracted from the user's previous speech voice and parameters corresponding to the second speech element for each of a plurality of preset domains is matched. Can be.

Therefore, when the current conversation frame is generated, the speech recognition apparatus 100 may obtain at least one domain to which the current conversation frame belongs and refer to the previously stored domain table, and determine the acquired domain as the current domain.

As such, when the current domain for the user's spoken voice is determined, the speech recognition apparatus 100 may determine at least one of the current domain and the previous domain based on a conversation state on the current domain and the previous domain predetermined from the previous spoken voice of the user. A candidate dialog frame for providing response information on the user's spoken voice on one domain is determined (S740).

Specifically, when the current domain for the user's spoken voice is determined, the speech recognition apparatus 100 may determine whether the current domain is the same as the previous domain, and the dialogue context from the previous conversation frame generated in relation to the current conversation frame and the previous domain. Judge whether to switch. Thereafter, the voice recognition apparatus 100 determines response information of the user's spoken voice on at least one of the current domain and the previous domain according to a determination result of whether the two domains are the same and a determination result of switching the dialogue context. Candidate conversation frames to provide may be determined.

According to an embodiment, if it is determined that the current domain and the previous domain are the same, and the conversation context on the two domains is not switched, the voice recognition apparatus 100 selects a candidate conversation frame for the current conversation frame based on the previous conversation frame. You can decide.

On the other hand, if it is determined that the speech recognition apparatus 100 is different from the current domain and the previous domain, and the conversation context on the two domains is not switched, the candidate for the current conversation frame on the previous domain and the current domain based on the previous conversation frame. The conversation frame can be determined.

On the other hand, if the speech recognition apparatus 100 determines that the current domain and the previous domain are the same, and the conversation context on the two domains is switched, at least one of the current conversation frame and the conversation frame initialized with respect to the current conversation frame on the previous domain is determined. Candidate conversation frames associated with one conversation frame may be determined.

On the other hand, if it is determined that the speech recognition apparatus 100 is different from the current domain and the previous domain, and the conversation context on the two domains is switched, the candidate speech frame for the current conversation frame based on the previous conversation frame, the current conversation on the previous domain At least one of the candidate conversation frame for the conversation frame and the candidate conversation frame for the initialization conversation frame initialized with respect to the current conversation frame on the current domain may be determined.

As such, when at least one candidate conversation frame is determined according to whether the current domain and the previous domain are the same and whether the conversation context is switched on the two domains, the speech recognition apparatus 100 generates response information about the determined candidate conversation frame. In operation S750, the generated response information may be transmitted to the display apparatus 200.

In detail, the speech recognition apparatus 100 determines the priority of the at least one candidate conversation frame determined based on the above embodiments according to a preset condition, and then, for the candidate conversation frame in order of the candidate conversation frames having the highest priority. Response information may be provided.

According to an embodiment, the speech recognition apparatus 100 determines the priority of the at least one candidate conversation frame determined based on the counting information matched with the conversation example information for each of the at least one previous conversation frame. Thereafter, the speech recognition apparatus 100 may provide the display apparatus 200 with response information about the candidate dialogue frame in the order of the candidate dialogue frames having the highest priority based on the determined priority.

Specifically, before each of the above-described steps, the apparatus 100 for recognizing the speech may include a previous conversation frame matched by a plurality of domains, dialogue example information related to the previous dialogue frame matched by each domain, and a user's spoken voice related to the dialogue example information. The counting information according to the frequency information about may be matched and stored.

Therefore, when a plurality of candidate conversation frames are determined, the speech recognition apparatus 100 may determine a rank for each candidate conversation frame based on counting information about the matching dialogue example information with respect to each candidate conversation frame determined. have. When the ranking of the plurality of candidate conversation frames is determined, the speech recognition apparatus 100 may include a response including respective search result information generated based on the plurality of candidate conversation frames and ranking information of each candidate conversation frame. The information is generated and transmitted to the display apparatus 200. Accordingly, the display apparatus 200 may display the respective search result information in order of search result information for the candidate dialog frame having the highest ranking based on the ranking information included in the received response information.

According to another exemplary embodiment, the speech recognition apparatus 100 may refer to indexing information of conversation example information for at least one previous conversation frame, and may provide indexing information among candidate conversation frames to provide response information about the user's spoken voice. The number of candidate conversation frames is determined in order of the number of candidate conversation frames. Thereafter, the speech recognition apparatus 100 generates response information including search result information about the candidate conversation frame and rank information determined for each candidate conversation frame, and transmits the response information to the display apparatus 200.

So far I looked at the center of the preferred embodiment for the present invention.

While the above has been shown and described with respect to preferred embodiments of the present invention, the present invention is not limited to the specific embodiments described above, it is usually in the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the prospect of the present invention.

Claims

An extraction unit for extracting at least one of a first speech element representing an execution command and a second speech element representing a target from a user speech;

A domain determination unit that determines a current domain for providing response information for the spoken voice based on the first and second spoken elements; And

A candidate dialogue frame for providing response information for the spoken voice on at least one of the current domain and the previous domain based on a conversation state on the previous domain predetermined from the current domain and the previous spoken voice of the user; A control unit for determining;

Speech recognition device comprising a.
The method of claim 1,

The domain determination unit,

And determining a current domain for providing response information for the spoken voice based on driving and parameters corresponding to the first and second spoken elements extracted from the extractor.
The method of claim 2,

The control unit,

It is determined whether the current domain and the previous domain are the same, and whether or not to switch the dialogue context from the previous conversation frame generated in relation to the current conversation frame and the previous domain, thereby determining at least one of the current domain and the previous domain. And a candidate dialog frame for providing response information on the spoken voice on a domain.
The method of claim 3, wherein

The control unit,

And if the current domain and the previous domain are the same and the conversation context on both domains is not switched, determining a candidate conversation frame for the current conversation frame based on a previous conversation frame.
The method of claim 3, wherein

The control unit,

If the current domain and the previous domain are different and the conversation context on both domains is not switched, determining a candidate conversation frame for the current conversation frame on the previous domain and the current domain based on the previous conversation frame. Speech recognition device characterized in that.
The method of claim 3, wherein

The control unit,

If the current domain and the previous domain are the same, and the conversation context on both domains is switched, at least one conversation frame associated with the current conversation frame and the initialization conversation frame initialized with respect to the current conversation frame on the previous domain is associated. And a candidate conversation frame is determined.
The method of claim 3, wherein

The control unit,

If the current domain and the previous domain are different and the conversation context on both domains is switched, a candidate conversation frame for the current conversation frame based on the previous conversation frame, a candidate conversation for the current conversation frame on the previous domain. And at least one of a candidate conversation frame for an initialization conversation frame initialized with respect to the current conversation frame on the current domain.
The method of claim 1,

A storage unit matching and storing counting example information related to a previous conversation frame matched for each domain and counting information according to a frequency degree of a spoken voice related to the conversation example information;

Speech recognition device further comprises.
The method of claim 8,

The control unit,

The priority of the candidate conversation frame is determined based on the counting information matched to the conversation example information for each conversation frame for each previous conversation frame stored in the storage unit, and the candidate conversation frame is ranked in the order of the candidate conversation frames having the highest priority. Speech recognition device, characterized in that for providing the response information.
The method of claim 8,

The storage unit,

Further storing indexing information for indexing at least one speech element included in the conversation example information for each of the at least one previous conversation frame,

The control unit,

The candidate conversation frame in the order of the candidate conversation frames having the highest number of indexing information among candidate conversation frames for providing response information for the spoken voice with reference to the indexing information of the conversation example information for each previous conversation frame stored in the storage unit; Speech recognition device, characterized in that for providing response information to.
In the control method of the speech recognition apparatus,

Extracting at least one of a first speech element representing an execution command and a second speech element representing a target from a user's speech voice;

Determining a current domain for providing response information for the spoken voice based on the first and second spoken elements;

A candidate dialogue frame for providing response information for the spoken voice on at least one of the current domain and the previous domain based on a conversation state on the previous domain predetermined from the current domain and the previous spoken voice of the user; Determining; And

Providing response information to the spoken speech based on the candidate conversation frame;

Control method comprising a.
The method of claim 11,

The determining step,

And determining a current domain for providing response information to the spoken voice based on the driving and parameters corresponding to the extracted first and second spoken elements.
The method of claim 12,

The providing step,

It is determined whether the current current domain and the previous domain are the same, and whether or not to switch the dialogue context from the previous conversation frame generated in relation to the current conversation frame and the previous domain, and thus, at least one of the current domain and the previous domain. And determining a candidate conversation frame for providing response information on the spoken voice on one domain.
The method of claim 13,

The providing step,

And if the current domain and the previous domain are the same and the conversation context on both domains is not switched, determining a candidate conversation frame for the current conversation frame based on a previous conversation frame.
The method of claim 13,

The providing step,

If the current domain and the previous domain are different and the conversation context on both domains is not switched, determining a candidate conversation frame for the current conversation frame on the previous domain and the current domain based on the previous conversation frame. The control method characterized by the above-mentioned.