CN112489640A

CN112489640A - Speech processing apparatus and speech processing method

Info

Publication number: CN112489640A
Application number: CN201910783144.1A
Authority: CN
Inventors: 李丹
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2021-03-12

Abstract

The problem is to provide a speech processing device and a speech processing method that eliminate the need for a user to perform repetitive work even when a sentence that is not to be transmitted as a text is transmitted while receiving speech of a text to be transmitted as a text. The solution is that a voice processing device (1) is provided with: a voice input unit (11) for inputting a voice uttered by a user; a sentence processing unit (13) for outputting a text of a sentence from which a specific sentence has been removed from a sentence indicated by the speech inputted by the speech input unit (11) during a speech reception period which is a period in which the speech of the sentence to be a text is received; and a sentence transmission unit (15) for transmitting the text output by the sentence processing unit (13); instead of making all articles represented by the speech input during the speech acceptance period text and transmitting, a specific sentence is automatically removed.

Description

Speech processing apparatus and speech processing method

Technical Field

The present invention relates to a speech processing apparatus and a speech processing method, and more particularly to a technique suitable for use in a speech processing apparatus and a speech processing method for transmitting a speech uttered by a user as a text.

Background

Conventionally, a device for performing speech recognition on speech uttered by a user is known. For example, patent document 1 describes a speech processing device that: when spoken aloud by the speech synthesis function, the speech recognition function can be executed by interrupting the speech data, and the other function can be executed by interrupting the function.

In addition, as a device capable of performing voice recognition, there are conventionally available: the voice uttered by the user is input so that the input voice becomes text, and the text is sent as a message or mail of the chat application. By using such a device, the user can transmit a text of a desired content to the partner by a means of sound generation without using a hand.

Documents of the prior art

Patent document

Patent document 1 Japanese patent application laid-open No. 10-161846

Disclosure of Invention

In the above-described apparatus for making a voice uttered by a user into a text and transmitting the text, the following problems have conventionally been encountered. That is, the conventional apparatus makes all the voice uttered by the user as text during the period of receiving the voice. Therefore, in the above-described period, when the user needs to issue a specific sentence which is not intended to be transmitted as a text and the user issues the specific sentence, the text which is finally to be a text results to include the specific sentence. For example, some of the conventional apparatuses have the following functions: when a user issues a specific sentence, specific processing corresponding to the specific sentence is executed. In addition, when such a conventional apparatus is used, it is necessary to cause the conventional apparatus to execute a specific process. At this time, the user utters a voice of a specific sentence, resulting in the text of the specific sentence being included in the article which eventually becomes the text. When a text of a sentence which is not desired to be transmitted as a text is included in a text which is finally a text, the user needs to cancel the transmission of the text temporarily and perform a repeated operation of re-speaking, which is troublesome for the user.

The present invention has been made to solve the above-described problems, and an object of the present invention is to eliminate the need for a user to perform repetitive work even when a sentence which is not desired to be made into a text and transmitted is uttered while receiving a speech of a text which is to be made into a text.

In order to solve the above problem, in the present invention, a text of a sentence other than a specific sentence among the sentences represented by the speech inputted during the period of receiving the speech is transmitted. According to the present invention configured as described above, not all texts represented by voices input during the period of receiving voices are made text and transmitted, but the specific sentences are automatically removed and the texts of the texts from which the specific sentences have been removed are transmitted. Therefore, even if the user utters a specific sentence which the user does not want to be transmitted as a text during the period of receiving the voice, the specific sentence is automatically removed from the sentence which is finally transmitted as a text, and the user does not need to temporarily cancel the transmission of the text and perform a repeated operation of uttering the text again.

Drawings

Fig. 1 is a block diagram showing an example of a functional configuration of a speech processing device according to embodiment 1 of the present invention.

Fig. 2 is a diagram showing an example of a chat room screen.

Fig. 3 is a diagram for explaining the relationship between periods including a speech reception period.

Fig. 4 is a diagram showing an example of the content of the full-speech text data.

Fig. 5 is a flowchart showing an example of the operation of the speech processing device according to embodiment 1 of the present invention.

Fig. 6 is a flowchart showing an example of the operation of the speech processing device according to embodiment 1 of the present invention.

Fig. 7 is a flowchart showing an example of the operation of the speech processing device according to embodiment 1 of the present invention.

Fig. 8 is a flowchart showing an example of the operation of the speech processing device according to embodiment 1 of the present invention.

Fig. 9 is a block diagram showing an example of a functional configuration of the speech processing device according to embodiment 2 of the present invention.

Fig. 10 is a diagram showing a case where a sentence is added to a message field.

Fig. 11 is a flowchart showing an example of the operation of the speech processing device according to embodiment 2 of the present invention.

Fig. 12 is a block diagram showing an example of a functional configuration of the speech processing device according to embodiment 3 of the present invention.

Description of reference numerals:

1. 1A, 1B speech processing device

11 voice input part

13. 13A, 13B article processing section

14 specific processing execution control unit

15 article transmitting part

16 Voice instruction list storage section (storage section)

Detailed Description

< embodiment 1 >

Embodiment 1 of the present invention is explained below based on the drawings. Fig. 1 is a block diagram showing an example of a functional configuration of the speech processing apparatus 1. The speech processing device 1 according to the present embodiment is a device mounted on a vehicle, and has a function of providing a user interface for text chatting in which a plurality of people converse text messages. In particular, the speech processing device 1 according to the present embodiment has the following functions: in text chat, a voice uttered for a predetermined period by a person using the device (hereinafter, simply referred to as a "user") is input so that a text represented by the input voice is a text, and the text that is the text is transmitted as a message (hereinafter, referred to as a "message voice input function"). The user can generate and transmit a message to be transmitted to the other party in the text chat by using the message voice input function without using a manual input.

Further, the speech processing device 1 according to the present embodiment has the following functions: when a user issues any one of a plurality of voice commands prepared in advance, the issued voice command is recognized, and a specific process corresponding to the voice command is executed or another device is caused to execute a specific process (hereinafter referred to as a "voice command receiving function"). Voice commands are prepared in advance, and the user knows each voice command and specific processing to be executed when each voice command is issued. In the present embodiment, the voice command is set to at least "activate the wiper". "such a statement. This voice command is hereinafter referred to as a "wiper drive instruction command". The wiper drive instruction command is a voice instruction instructing the start of driving the wiper, and if the user issues the wiper drive instruction command, "drive the wiper" is executed as a corresponding specific process.

Note that the voice command is not limited to the command described in the present embodiment. For example, the voice command may be a word indicating to search for facilities of a specific category according to a specific condition, such as "search for a nearby convenience store", or a word indicating to search for a route to home, such as "home". The voice processing device 1 causes a navigation device, not shown, to execute corresponding processing in response to the voice command. Hereinafter, the vehicle on which the voice processing device 1 is mounted will be referred to as "own vehicle".

In the following description, it is assumed that the voice command is appropriately received in a period other than the voice reception period described later, and description of the processing executed by the voice processing apparatus 1 in accordance with the voice command in a period other than the voice reception period will be omitted.

As shown in fig. 1, the voice processing apparatus 1 is connected to a microphone 2 and a touch screen 3. The microphone 2 is provided at a position where it is possible to pick up a voice uttered by a user mounted on the vehicle. The microphone 2 receives a voice and outputs a voice signal of the received voice.

The touch screen 3 includes a display panel such as a liquid crystal display panel or an organic EL panel, and a touch sensor disposed so as to overlap the display panel, displays a screen in a display region, and detects a touch operation to a contact detection region. Various screens related to text chatting are displayed on the touch screen 3. The touch screen 3 is provided at a position such as a center portion of the dashboard, where the user can visually recognize the display area and perform a touch operation on the touch detection area.

As shown in fig. 1, the voice processing device 1 includes a general control unit 10, a voice input unit 11, a voice data analysis unit 12, a sentence processing unit 13, a specific processing execution control unit 14, and a sentence transmission unit 15. Each of the functional modules 10 to 15 may be configured by any one of hardware, a DSP (Digital Signal Processor), and software. For example, when the functional modules 10 to 15 are formed of software, the functional modules are actually formed of a CPU, a RAM, a ROM, or the like of a computer, and are realized by operating a program stored in a recording medium such as a RAM, a ROM, a hard disk, or a semiconductor memory.

As shown in fig. 1, the voice processing apparatus 1 includes a voice command list storage unit 16 (corresponding to a "storage unit" in the claims) as a storage medium. The voice command list storage unit 16 stores voice command list data 17. The voice command list data 17 describes a text of a voice command sentence (hereinafter referred to as a "voice command sentence") for each voice command. In the present embodiment, at least the "activate wiper" command that is the wiper drive instruction is described in the voice command list data 17. "text of such a voice instruction sentence.

The following describes the operation of the speech processing apparatus 1 when the operation mode of the speech processing apparatus 1 is switched from a mode other than the chat mode to the chat mode, and a text uttered by the user is transmitted as a text message. The chat mode is an operation mode in which the voice processing apparatus 1 provides a user with a user interface for text chat so that the user can perform text chat with a desired object.

The overall control unit 10 collectively controls each unit of the voice processing apparatus 1 by the functions of the firmware of the voice processing apparatus 1, the application operating on the firmware, and other programs. The overall control unit 10 can control the touch screen 3 to display various screens, and can detect the position coordinates when a touch operation is performed on the touch screen 3.

When the operation mode is not the chat mode, the collective control unit 10 displays a button (icon) for instructing transition to the chat mode on the touch screen 3. When the user touches this button, the overall control unit 10 activates an application related to a text chat, and displays various screens related to the text chat on the touch screen 3 by the functions of the application and an accompanying program.

When a user desires to perform a text chat with an object (or a plurality of persons), the user performs a predetermined touch operation on a predetermined screen to selectively turn on or off the message voice input function, and further instructs to open a message exchange chat room as a place where a message is exchanged with the object. The overall control unit 10 turns on or off the message voice input function based on the user's selection, and displays a screen 20 of the message exchange chat room instructed by the user on the touch panel 3 based on the user's instruction. Hereinafter, the screen 20 of the message exchange chat room is referred to as "chat room screen 20".

Fig. 2 (a) is a simplified diagram of chat room screen 20 according to the present embodiment. As shown in fig. 2 (a), on chat room screen 20, message field 21 describing the target message is displayed in time series on the left side of the screen, and message field 21 describing the user message is displayed in time series on the right side. All the articles described in one message column 21 are units to be transmitted as messages in one session. Further, on/off of the message voice input function is explicitly indicated on chat room screen 20.

When the message exchange chat room is opened with the message voice input function being on, the integrated control unit 10 outputs function on instruction information instructing to start the message voice input function to the voice input unit 11 and the voice data analysis unit 12. When the message voice input function is turned off or the message exchange chat room is closed by an instruction from the user or the like, the integrated control unit 10 outputs function turn-off instruction information instructing to turn off the message voice input function to the voice input unit 11 and the voice data analysis unit 12.

The voice input unit 11 inputs a voice uttered by a user. The processing of the voice input unit 11 will be described in detail below.

The voice input unit 11 executes the following processing during a period from when the overall control unit 10 inputs the function on instruction information until when the function off instruction information is input (hereinafter referred to as a "message voice input period"). That is, the voice input unit 11 receives the voice signal output from the microphone 2, performs analog-to-digital conversion processing including sampling, quantization, and encoding on the voice signal, performs other signal processing to generate voice data, and buffers the voice data in the buffer 18. The buffer 18 is a memory formed in a work area such as a RAM. The voice data is data of a voice waveform sampled at a predetermined sampling period (16 kHz as an example).

The voice data analysis unit 12 detects the start and end of a voice reception period (described later). The voice data analysis unit 12 detects that a message-sending word is present in the voice data buffered in the buffer 18. The processing of the voice data analysis unit 12 will be described in detail below.

Fig. 3 is a diagram showing the relationship among the respective periods of the message voice input period, the period of the message start word being uttered, the period of the message end word being uttered, the period of the message transmission word being uttered, and the voice reception period, and showing the respective periods clearly on the axis showing the passage of time. On the axis, time passes from left to right in the figure. First, a message start word, a message end word, a message sending word, and a voice acceptance period will be described with reference to fig. 3.

In the present embodiment, when a user desires to make a certain text into a text and send the text as a message by using the message voice input function, the user utters a message start word composed of a fixed sentence, then utters a text desired to be a text, and then utters a message end word composed of a fixed sentence. The message start word is a term such as "message start", and the message end word is a term such as "message end". That is, in the present embodiment, the period from the end of utterance of the message start word to the start of utterance of the message end word is a period in which the speech of a text to be a text (a text that the user desires to make a text) is received. This period corresponds to a "speech reception period".

In the present embodiment, as described later, after the user utters the message end word, the text of a predetermined sentence to be transmitted to the opposite party is displayed in the message column 21 of the chat room screen 20. The user confirms the contents of the articles displayed in the message column 21, and when the contents do not have any problem and the user desires to perform transmission, the user utters a message transmission word. The message sending word is a sentence such as "message sending" as an example. The article is sent to the object in response to the user uttering the message sending word.

In fig. 3, the voice input period starts at the time TS and ends at the time TE. The utterance of the message start word starts at a timing T1 after the timing TS, and the utterance of the message start word ends at a timing T2 after the timing T1. In addition, the utterance of the message end word starts at a timing T3 after the timing T2, and the utterance of the message end word ends at a timing T4 after the timing T3. In addition, the utterance of the message sending word starts at a timing T5 after the timing T4, and the utterance of the message sending word ends at a timing T6 after the timing T5. In the example of fig. 3, the period from the timing T2 to the timing T3 corresponds to a speech reception period during which a text uttered by the user is to be converted into text.

Further, if the message voice input period starts (when the function start instruction information is input from the overall control unit 10), the voice data analysis unit 12 analyzes the voice data buffered in the buffer 18 as needed, and monitors whether or not the voice waveform of the message start word appears in the voice data. In the present embodiment, a voice pattern of a message start word (a pattern of a voice waveform when the message start word is uttered) is registered in advance. Multiple speech patterns may also be registered. The voice data analysis unit 12 compares the voice waveform of the voice data with the voice pattern of the message start word at any time to calculate the similarity, and determines that the waveform of the message start word appears in the voice data when the similarity is equal to or greater than a certain level.

When detecting that the voice waveform of the message start word appears in the voice data, the voice data analysis section 12 specifies the end position of the voice waveform of the message start word in the voice data (hereinafter referred to as "start word end position"). The start word end position is a position corresponding to the timing at which the voice reception period starts (timing T2 in fig. 3). The position of the voice waveform in the voice data is expressed as, for example, the o-th cycle starting from the timing (timing TS in fig. 3) at which the message voice input period starts. For example, in the case where the voice data is data sampled at a sampling period of 16kHz, the start word end position is expressed in the form of "16324 th period". After the start word end position is specified, the speech data analysis unit 12 outputs information indicating the start word end position to the sentence processing unit 13.

Then, the voice data analysis unit 12 analyzes the voice data buffered in the buffer 18 as needed, and monitors whether or not the voice waveform of the end-of-message word appears in the voice data. This monitoring is performed in the same way as the above-described method of monitoring whether or not the voice waveform of the message start word appears, based on the voice pattern of the message end word registered in advance. When detecting that the voice waveform of the message end word appears in the voice data, the voice data analysis section 12 specifies the start position of the voice waveform of the message end word in the voice data (hereinafter referred to as "end word start position"). The end word start position is a position corresponding to the timing at which the speech reception period ends (timing T3 in fig. 3). After the end word start position is specified, the speech data analysis unit 12 outputs information indicating the end word start position to the sentence processing unit 13.

Then, the voice data analysis unit 12 analyzes the voice data buffered in the buffer 18 as needed, and monitors whether or not the voice waveform of the message sending word appears in the voice data. This monitoring is performed in the same way as the above-described method of monitoring whether or not the voice waveform of the message start word appears, based on the voice pattern of the message transmission word registered in advance. When detecting that the voice waveform of the message sending word appears in the voice data, the voice data analysis unit 12 outputs transmission instruction information notifying that the voice waveform of the message sending word appears in the voice data to the overall control unit 10. Then, the voice data analysis section 12 starts monitoring again whether or not the voice waveform of the message start word appears in the voice data.

As a result of the above processing performed by the voice data analysis unit 12, when the user utters the message start word, information indicating the end position of the start word is immediately output to the sentence processing unit 13, and when the user utters the message end word, information indicating the start position of the end word is immediately output to the sentence processing unit 13. When the user utters a message sending word, sending instruction information for notifying that a voice waveform of the message sending word has appeared in the voice data is immediately output to the collective control unit 10.

The sentence processing unit 13 outputs a text of a sentence in which a specific sentence is removed from a sentence indicated by the voice input unit 11 in the voice reception period, that is, a period in which the voice of the sentence to be a text is received. In particular, the sentence processing unit 13 according to the present embodiment extracts and deletes the text of the specific sentence from the text formed by the sentence indicated by all the voices input by the voice input unit 11 during the voice reception period after the voice reception period is ended, and outputs the text of the deleted sentence. At this time, the sentence processing unit 13 extracts the text of the specific sentence from the sentence which is the text based on the correspondence with the text of the specific sentence stored in advance in the voice command list storage unit 16. The processing of the text processing unit 13 is described in detail below.

When the speech data analysis unit 12 outputs information indicating the end position of the message start word, the text processing unit 13 inputs the information. Then, the sentence processing unit 13 inputs information indicating the start position of the message end word when the voice data analysis unit 12 outputs the information. When information indicating the start position of the message-ending word is input, the text processing unit 13 recognizes the end position of the message-starting word and the start position of the message-ending word in the speech data buffered in the buffer 18 based on the information and the information indicating the end position of the message-starting word input up to that time. The sentence processing unit 13 acquires voice data (hereinafter, referred to as "processing target voice data") belonging to a range from the end position of the message start word to the start position of the message end word, among the voice data buffered in the buffer 18.

After the target speech data is acquired, the text processing unit 13 performs speech recognition on the target speech data so that the text recorded in the target speech data becomes a text, and generates text data 23 (hereinafter referred to as "full speech text data 23") in which the text that becomes the text is described. When generating the full-speech text data 23, the sentence processing unit 13 divides a sentence into sentences and describes the divided sentences in the data. The term is an element that is the smallest unit of the presentation contents that have been presented among the elements in the sentence. Sentences are basically in periods in japanese and chinese. "end, otherwise ending in substantially a period" in english.

Each of fig. 4 is a diagram showing an example of the content of the full-speech text data 23. For example, it is assumed that the user is on "thank you today" during the speech reception period. In the future, please also find multiple relations. "and the article is a target described in the full-speech text data 23. In this case, "thank you today" as shown in fig. 4 (a). "such elements constitute one sentence, and" will be referred to in the future in a multi-concern manner. "such an element constitutes one sentence. Note that in fig. 4 a (the same applies to fig. 4B described later), a division symbol "/" is used as a symbol for dividing a sentence or a sentence as appropriate.

As described above, the voice processing device 1 according to the present embodiment has a voice command accepting function, and the user can recognize that a specific process corresponding to a voice command is executed by uttering any one of voice commands prepared in advance. Therefore, the user may speak a voice command to execute a specific process during the voice reception period.

For example, assume that the user wishes to cause "thank you today. In the future, please also find multiple relations. "such article becomes text and is sent as a message, first on thank you after uttering the message start word. "such article vocalizes. At this time, the user is aware that the rain outside the vehicle is increasing, and it is considered that the wiper needs to be activated. In this case, the user needs to take multiple actions in the future in order to operate the wiper as soon as possible. "before such an article is sounded," the wiper is operated. "such article vocalizes. In this case, as shown in fig. 4 (B), the full-speech text data 23 describes "thank you today". "," causes the wiper to operate. "and" will also be referred to in the future as "and" will be referred to in multiple contexts. "states of these three statements.

Wherein the full-speech text data 23 is generated based on speech recognition as appropriate by carrying out morpheme analysis, syntactic structure analysis, meaning structure analysis, and the like based on the prior art relating to natural language processing. Some of the techniques may also use artificial intelligence techniques. The text processing unit 13 may be configured to generate the full-speech text data 23 in conjunction with an external device. For example, the speech processing apparatus 1 may be configured to access a network, and the sentence processing unit 13 may be configured to transmit speech data to a server having a function of generating the full-speech text data 23 based on the speech data and receive the full-speech text data 23 in response thereto.

After the all-speech text data 23 is generated, the text processing unit 13 refers to the speech instruction list data 17 stored in the speech instruction list storage unit 16, and executes the following processing. That is, the sentence processing unit 13 determines whether or not the sentence matches any of the voice command sentences described in the voice command list data 17 for each sentence of the sentence described in the all-speech text data 23. As described above, in the present embodiment, "activate the wiper" is described as the voice command sentence in the voice command list data 17. "text of such a sentence. Therefore, in the present embodiment, the text processing unit 13 determines at least whether or not to "activate the wiper" for each sentence described in the full-speech text data 23. "such voice instruction sentences have consistency.

When determining whether there is any inconsistency between the words described in the entire voice text data 23 and the voice command words described in the voice command list data 17, in the present embodiment, the sentence processing section 13 determines whether there is any inconsistency between the words described in the entire voice text data 23 and the voice command words, reflecting the consistency of the intention interpretation processing based on the natural language processing in addition to the consistency of the character strings. Therefore, in addition to the case where the character strings of the respective words and phrases completely match, when a difference in characters occurs in the range in which the intention is matched, the sentence processing unit 13 may determine that the words and phrases have matching.

The sentence processing unit 13 maintains the state described in the full-speech text data 23 for a sentence which does not match any speech command sentence among the sentences described in the full-speech text data 23. On the other hand, the sentence processing unit 13 deletes a sentence having consistency with any one of the speech command sentences from among the sentences described in the all-speech text data 23 from the all-speech text data 23. For example, when the content of the full-speech text data 23 is the content shown in fig. 4 (B), the 2 nd sentence is "activate wiper". "is consistent with the speech instruction statement. In this case, therefore, the sentence processing section 13 deletes the 2 nd sentence from the full-speech text data 23. As a result of the above processing, when a word or phrase that matches any one of the voice command words or phrases is described in the all-speech text data 23, the word or phrase is extracted from the all-speech text data 23 by the sentence processing unit 13 and deleted.

The sentence processing unit 13 executes processing for determining the matching with the speech command sentence and deleting the sentence having the matching, and then outputs the processed whole speech text data 23 (hereinafter referred to as "processed speech text data") to the overall control unit 10. Further, when it is determined that a certain word in the all-speech text data 23 matches a certain speech command word, the sentence processing unit 13 outputs recognition information for recognizing the speech command corresponding to the speech command word to the specific processing execution control unit 14.

The specific processing execution control unit 14 executes specific processing or causes a device having a function of executing specific processing to execute specific processing, based on the specific sentence removed by the sentence processing unit 13. The process of the specific process execution control unit 14 will be described in detail below.

When the recognition information of the voice command is input from the sentence processing unit 13, the specific processing execution control unit 14 executes the following processing. That is, the specific processing execution control unit 14 recognizes a voice command (which is a voice command issued by a user) based on the input recognition information, and executes processing for setting a state in which specific processing corresponding to the recognized voice command is executed. The processing executed by the specific processing execution control unit 14 is predetermined for each voice command, and the specific processing execution control unit 14 is brought into a state in which the specific processing corresponding to the voice command is executed by executing the predetermined processing.

For example, when the identification information of the wiper drive instruction command is input, the specific processing execution control unit 14 outputs a control instruction instructing the driving of the wiper to the control means controlling the driving of the wiper. The control unit inputs a control command and starts to drive the wiper.

When the post-processing speech text data is input from the sentence processing unit 13, the overall control unit 10 executes the following processing. That is, the overall control unit 10 generates a message field 21 on the chat room screen 20, and displays a text described in the input processed speech text data in the message field 21 as a message. For example, in a situation where chat room screen 20 shown in fig. 2 (a) is displayed, when processed speech text data having the same content as all speech text data 23 in fig. 4 (a) is input, overall control unit 10 generates message field 21 as shown in fig. 2 (B), and displays a message in message field 21. The user confirms the content of the message in the message column 21 and, if there is no problem with the content, utters a message transmitting word, thereby instructing transmission of the message. If the message sending word is uttered, the voice waveform of the message sending word appears in the voice data, and the voice data analysis section 12 outputs the transmission instruction information to the overall control section 10.

After the message is displayed, the overall control unit 10 monitors whether or not the transmission instruction information is input from the voice data analysis unit 12. When the transmission instruction information is input, the overall control unit 10 instructs the sentence transmission unit 15 to transmit a message.

The article transmission unit 15 transmits a message to a predetermined server according to a protocol based on an instruction from the overall control unit 10. As a result, the article from which the voice instruction sentence is removed from the articles uttered by the user during the voice acceptance period becomes text and is transmitted to the partner as a message.

As described above, the speech processing apparatus 1 according to the present embodiment transmits a text of a sentence in which a speech command sentence (specific sentence) is removed from a sentence indicated by a speech input during a period of receiving a speech. According to this configuration, not all the articles indicated by the voice input during the period of receiving the voice are transmitted as texts, but the voice command sentence is automatically removed and the text of the article from which the voice command sentence is removed is transmitted. Therefore, even when the user issues a voice command that the user does not want to make the text and transmit the voice command during the period of receiving the voice, the user automatically removes the voice command from the text that is finally transmitted as the text, and the user does not need to temporarily cancel the transmission of the text and perform a repeated operation of re-speaking the text.

Next, the operation of the speech processing apparatus 1 will be described with reference to a flowchart. A flowchart FA in fig. 5 is a flowchart showing an example of the operation of overall control unit 10 related to the display of chat room screen 20 and the output of function on instruction information and function off instruction information. In the processing of the overall control unit 10 described with reference to the flowchart FA, the user selects to turn on the message voice input function when instructing to open the message exchange chat room.

As shown in the flowchart FA of fig. 5, the user performs a prescribed touch operation with respect to the touch screen 3, selects to turn on the message voice input function, and instructs to open a message exchange chat room for text chat with a desired object (step SX 1). In response to the instruction of step SX1, overall controller 10 turns on the message voice input function based on the user's selection (step SA1), and displays corresponding chat room screen image 20 on touch screen 3 (step SA 2).

Next, the overall control unit 10 outputs function activation instruction information to the voice input unit 11 and the voice data analysis unit 12 (step SA 3). Then, the overall control unit 10 monitors whether or not the message voice input function is turned off, or whether or not the message exchange chat room is closed according to an instruction of the user or the like (step SA 4). When the message voice input function is turned off or the message exchange chat room is closed, the function-off instruction information is output to the voice input unit 11 and the voice data analysis unit 12 (step SA 5).

The flowchart FB of fig. 6 is a flowchart showing an example of the operation of the voice input unit 11. The voice input unit 11 repeatedly executes the processing of the flowchart FB. As shown in fig. 6, voice input unit 11 monitors whether or not function on instruction information output from overall control unit 10 is input in step SA3 of flowchart FA (step SB 1). When the function on instruction information is input (yes in step SB1), the voice input unit 11 starts generating voice data based on the voice signal input from the microphone 2 and buffering the voice data in the buffer 18 (step SB 2). Next, the voice input unit 11 monitors whether or not function shutdown instruction information is input from the overall control unit 10 (step SB 3). When the function shutdown instruction information is input (yes in step SB3), the speech input unit 11 ends the generation and buffering of speech data (step SB 4). After the process of step SB4, the flowchart FB ends.

Fig. 7 and 8 are flowcharts showing operation examples of the voice data analysis unit 12, the sentence processing unit 13, the specific processing execution control unit 14, the overall control unit 10, and the sentence transmission unit 15. In fig. 7, a flowchart FC shows an operation example of the speech data analysis unit 12, and a flowchart FD shows an operation example of the sentence processing unit 13. In fig. 8, a flowchart FE shows an operation example of the specific process execution control unit 14, a flowchart FF shows an operation example of the overall control unit 10, and a flowchart FG shows an operation example of the text transmission unit 15.

As shown in the flowchart FC of fig. 7, the speech data analysis unit 12 monitors whether or not the function on instruction information output by the overall control unit 10 is input in step SA3 of the flowchart FA (step SC 1). When the message is input (yes in step SC1), the speech data analyzer 12 monitors whether or not the function off instruction information output from the overall controller 10 in step SA5 of the flowchart FA is input (step SC2), and also monitors whether or not the speech waveform of the message start word appears in the speech data (step SC 3). When the function shutdown instruction information is input (yes in step SC2), the process of the flowchart FC ends. When the speech waveform of the message start word appears (yes in step SC3), the speech data analysis unit 12 specifies the start word end position and outputs information indicating the start word end position to the sentence processing unit 13 (step SC 4).

Next, the voice data analysis unit 12 monitors whether or not a voice waveform of the end-of-message word appears in the voice data (step SC 5). When the speech waveform is present (YES in step SC5), the speech data analysis unit 12 specifies the end word start position and outputs information indicating the end word start position to the sentence processing unit 13 (step SC 6).

Next, the voice data analysis unit 12 monitors whether or not a voice waveform of the message sending word appears in the voice data (step SC 7). When the occurrence of a speech waveform is detected (YES in step SC7), the speech data analysis unit 12 outputs transmission instruction information to the overall control unit 10 (step SC 8). After the processing at step SC8, the flowchart FC ends. The voice data analysis unit 12 repeatedly executes the processing of the flowchart FC.

As shown in the flowchart FD of fig. 7, the sentence processing unit 13 monitors whether or not information indicating the end position of the message start word is input (step SD 1). When the message is input (YES in step SD1), the sentence processing unit 13 monitors whether or not information indicating the start position of the message end word is input (step SD 2). When the input is made (yes in step SD2), the sentence processing unit 13 acquires the processing target speech data based on the information input in step SD1 and the information input in step SD2 (step SD 3).

Next, the sentence processing unit 13 performs speech recognition on the processing target speech data so that the sentence recorded in the processing target speech data becomes a text, and generates the full-speech text data 23 (step SD 4). Next, the text processing unit 13 refers to the voice command list data 17 stored in the voice command list storage unit 16, and executes the following processing. That is, the sentence processing unit 13 determines whether or not there is a match with any of the voice command sentences described in the voice command list data 17 for each sentence of the sentence described in the all-speech text data 23, deletes the sentence having the match from the all-speech text data 23, and generates the processed speech text data (step SD 5). Next, the sentence processing unit 13 outputs the processed speech text data to the overall control unit 10 (step SD 6). Further, when it is determined that a certain word in the all-speech text data 23 matches a certain speech command word, the sentence processing unit 13 outputs the identification information relating to the speech command word to the specifying process execution control unit 14 (step SD 7). After the process of step SD7, the flowchart FD ends. The article processing unit 13 repeatedly executes the processing of the flowchart FD.

As shown in the flowchart FE of fig. 8, the specific process execution control unit 14 monitors whether or not the identification information is input (step SE 1). When the input is made (step SE 1: yes), the specific processing execution control unit 14 recognizes the voice command based on the input recognition information, and executes the processing for bringing the state after the specific processing corresponding to the recognized voice command is executed (step SE 2). After the processing of step SE2, the flow chart FE ends. The specific process execution control unit 14 repeatedly executes the process of the flowchart FE.

As shown in a flowchart FF of fig. 8, the overall control unit 10 monitors whether or not the processed speech text data is input from the sentence processing unit 13 (step SF 1). When the input is made (yes in step SF1), the overall control unit 10 generates a message field 21 on the chat room screen 20, and displays a sentence described in the input processed speech text data in the message field 21 as a message (step SF 2). Next, the overall control unit 10 monitors whether or not transmission instruction information is input from the voice data analysis unit 12 (step SF 3). In the case of input (step SF 3: YES). The summary control unit 10 instructs the article transmission unit 15 to transmit a message (step SF 4). After the processing of step SF4, the flowchart FF ends. The overall control unit 10 repeatedly executes the processing of the flowchart FF.

As shown in the flowchart FG of fig. 8, the text transmitting unit 15 monitors whether or not an instruction is given from the overall control unit 10 (step SG 1). If instructed (yes at step SG1), the article transmission unit 15 transmits the message displayed in the message column 21 (step SG 2). After the processing of step SG2, the flowchart FG ends. The overall control unit 10 repeatedly executes the processing of the flowchart FG.

< embodiment 2 >

Next, embodiment 2 will be described. In the following description of embodiment 2, the same elements as those of embodiment 1 are given the same reference numerals, and detailed description thereof will be omitted. Fig. 9 is a block diagram showing an example of a functional configuration of the speech processing device 1A according to the present embodiment. As shown in fig. 9, the voice processing device 1A according to the present embodiment includes a sentence processing unit 13A in place of the sentence processing unit 13 according to embodiment 1, and includes a summary control unit 10A in place of the summary control unit 10 according to embodiment 1. The processing of the text processing unit 13A and the processing of the overall control unit 10A after the start of the message voice input period in the processing of the voice processing device 1A according to the present embodiment are different from those of embodiment 1. The following describes the processing of the speech processing apparatus 1A after the start of the message speech input period.

As in embodiment 1, the speech data analysis unit 12 analyzes the speech data buffered in the buffer 18 as needed, and outputs information indicating the end position of the start word and information indicating the start position of the message end word to the sentence processing unit 13A based on the analysis result. When information indicating the end position of the start word is input, the sentence processing unit 13A performs speech recognition and speech analysis on the speech data buffered in the buffer 18 as needed thereafter, and monitors whether or not a sentence is present in the sentence indicated by the speech data.

For example, assume that the user uttered a text shown in fig. 4 (B). In this case, the article processing unit 13A is "thank you" with the 1 st sentence. "when the end of the speech data corresponding to such a phrase is buffered in the buffer 18," the phrase "is detected to appear based on the results of the speech recognition and the language analysis. Similarly, the sentence processing unit 13A detects that "sentences" are present at the time when the ends of the speech data corresponding to the 2 nd sentence and the 3 rd sentence are buffered. However, since the analysis by the sentence processing unit 13A requires a long time, there is a possibility that a slight time lag occurs between the timing at which the end of the speech data corresponding to the sentence is buffered in the buffer 18 and the timing at which the occurrence of the sentence in the sentence indicated by the speech data is detected.

The sentence processing unit 13A monitors whether or not a sentence is present in the sentence indicated by the speech data until information indicating the start position of the end word is input from the speech data analysis unit 12. When information indicating the start position of the end word is input from the speech data analysis unit 12, the sentence processing unit 13A recognizes the start position of the end word in the speech data, discards the analysis result for the speech data after the start position of the end word, and does not monitor whether or not the "sentence" appears in the speech data after the start position of the end word.

The article processing unit 13A executes the following processing each time it detects that a sentence has appeared. That is, the sentence processing unit 13A refers to the voice command list data 17 stored in the voice command list storage unit 16, and determines whether or not an appearing sentence matches any of the voice command sentences described in the voice command list data 17. When the presented sentence does not match any of the voice command sentences, the sentence processing unit 13A outputs text data describing the text of the presented sentence to the overall control unit 10A.

On the other hand, when the presented sentence matches any one of the voice command sentences, the sentence processing unit 13A discards the presented sentence. In this case, the text data describing the text of the appearing sentence is not output to the collective control unit 10A. Further, when the presented sentence matches any one of the voice command sentences, the sentence processing unit 13A outputs the identification information of the voice command corresponding to the voice command sentence determined to match the presented sentence to the specific processing execution control unit 14. The processing of the specific processing execution control unit 14 when the identification information is input is as described in embodiment 1.

Each time text data describing one sentence is input from the sentence processing unit 13A, the overall control unit 10A adds the text of the one sentence described in the text data to the message column 21 of the chat room screen 20. In which the message bar 21 is generated appropriately. Further, the overall control unit 10A monitors whether or not transmission instruction information is input from the voice data analysis unit 12. When the transmission instruction information is input, the overall control unit 10A instructs the sentence transmission unit 15 to transmit a message. The article transmission unit 15 transmits a message (a message including all the words described in the message column 21) in accordance with the instruction, as in embodiment 1.

As a result of the above processing, the sentence processing unit 13A outputs the text of the sentence, which does not match the speech command sentence, to the overall control unit 10A for each sentence included in the sentence uttered by the user during the speech reception period. On the other hand, for a word that matches the voice command word, the text of the word is not output to the overall control unit 10A by the word processing unit 13A, and the text of the word is not targeted for transmission by the word transmitting unit 15.

Fig. 10 is a diagram showing a case where a sentence is added to the message column 21 by the processing of the overall control unit 10A. Now, the content of the chat room screen 20 is as shown in fig. 10 (a), and in this state, the user utters the text of fig. 4 (B). If the user is "thank you for statement 1. When such a sentence is uttered, the message bar 21A is generated in accordance with the utterance, and the sentence is displayed in the message bar 21A ((B) of fig. 10). Next, if the user makes wiper operation for statement 2. When such a sentence is uttered, the sentence is not added to the message field 21A because the sentence matches the voice command sentence. Still further if the user asks for multi-concern the next time for statement 3. When such a sentence is uttered, the sentence is displayed in the message column 21A in accordance with the utterance (fig. 10C). Then, if the user utters a message transmission word, a sentence including all the sentences described in the message column 21A is transmitted as a message by the sentence transmission unit 15.

The structure of the present embodiment has the same effects as those of embodiment 1. That is, instead of transmitting all the texts indicated by the voice input during the period of receiving the voice as texts, the voice command is automatically removed, and the texts of the texts from which the voice command is removed are transmitted. Therefore, even when the user issues a voice command that the user does not want to make the text and transmit the voice command during the period of receiving the voice, the user automatically removes the voice command from the text that is finally made the text transmission, and the user does not need to temporarily cancel the text transmission and perform the repeated work of re-speaking.

Next, an operation example of the speech processing device 1A according to the present embodiment will be described with reference to a flowchart. Fig. 11 is a flowchart FH showing an example of the operation of the sentence processing unit 13A, and a flowchart FI showing an example of the operation of the overall control unit 10A.

As shown in the flowchart FH of fig. 11, the sentence processing unit 13A monitors whether or not information indicating the end position of the start word is input (step SH 1). When the input is made (YES in step SH1), the sentence processing unit 13A monitors whether or not information indicating the start position of the end word is input (step SH2), and whether or not a "sentence" is present in the sentence indicated by the speech data (step SH 3). When a phrase appears (yes in step SH3), the document processing unit 13A refers to the voice command list data 17 stored in the voice command list storage unit 16, and determines whether or not the appearing phrase matches any of the voice command phrases described in the voice command list data 17 (step SH 4).

When none of the presented sentences and any of the speech command sentences do match (no in step SH4), the sentence processing unit 13A outputs text data describing the text of the presented sentence to the overall control unit 10A (step SH 5). After the processing in step SH5, the processing step returns to step SH 2. On the other hand, when the presented phrase matches any one of the voice command phrases (YES in step SH4), the process returns to step SH 2. In this case, the text data describing the text of the appearing sentence is not output to the collective control unit 10A.

When the information indicating the end word start position is input (yes in step SH2), the sentence processing unit 13A recognizes the end word start position in the speech data and discards the analysis result for the speech data after the end word start position (step SH 6). After the processing of step SH6, flowchart FH ends. The sentence processing unit 13A repeatedly executes the processing of the flowchart FH.

As shown in the flowchart FI of fig. 11, the overall control unit 10A monitors whether transmission instruction information is input from the speech data analysis unit 12 (step SI1), and also monitors whether text data is input from the sentence processing unit 13A (step SI 2). When the text data is input (yes in step SI2), the overall control unit 10A adds the text of the sentence described in the text data to the message field 21 (step SI 3). After the processing of step SI3, the processing steps return to step SI 1. On the other hand, when the transmission instruction information is input, the overall control unit 10A instructs the sentence transmission unit 15 to transmit a message (step SI 4). After the processing of step SI4, the flow chart FI ends. The overall control unit 10A repeatedly executes the processing of the flowchart FI.

< embodiment 3 >

Next, embodiment 3 will be described. In the following description of embodiment 3, the same elements as those of embodiment 1 are given the same reference numerals, and detailed description thereof will be omitted. Fig. 12 is a block diagram showing an example of a functional configuration of the speech processing device 1B according to the present embodiment. As shown in fig. 12, the voice processing apparatus 1B is different from the voice processing apparatus 1 according to embodiment 1 in that it includes a sentence processing unit 13B instead of the sentence processing unit 13 according to embodiment 1 and does not include the specific processing execution control unit 14.

In the present embodiment, the host vehicle is provided with a specific processing execution device, not shown, in addition to the voice processing device 1B. The specific processing execution device is a device independent from the voice processing device 1B. The specific process execution means has the following functions: when a user issues a certain voice command, the issued voice command is recognized, and a specific process corresponding to the recognized voice command is executed, or a specific process corresponding to the voice command is executed by another device under control. In an environment where such a specific process execution device is provided, as in embodiment 1, a user may issue a voice command to cause the specific process execution device (or a device under the control thereof) to execute a specific process during a voice reception period.

The sentence processing unit 13B of the speech processing device 1B according to the present embodiment executes the following processing as processing different from the sentence processing unit 13 according to embodiment 1. That is, when it is determined that a certain term of the all-speech text data 23 matches a certain speech command term, the sentence processing unit 13 according to embodiment 1 outputs the recognition information of the speech command corresponding to the speech command term to the specific processing execution control unit 14. On the other hand, the document processing unit 13B according to the present embodiment does not output such identification information.

According to the configuration of the present embodiment, as in embodiment 1, even when the user issues a voice command that the user does not want to make a text and transmit the voice command while receiving a voice, the voice command is automatically removed from the text that is finally made a text and transmitted, and the user does not need to temporarily cancel the transmission of the text and perform a repeated operation of re-speaking.

The above three embodiments have been described, but the above embodiments do not show an example of embodying the present invention, and the technical scope of the present invention should not be interpreted in a limiting manner. That is, the present invention can be implemented in various forms without departing from the gist or main features thereof.

For example, in the above embodiments, the text transmission unit 15 transmits the text as the transmission of a message in a text chat, but the transmission of the text is not limited to the embodiments described above. For example, the text may be sent by mail. The transmission of text does not mean transmission to only a specific object, but is a concept broadly including transmission of text to an external device such as transmission of text to a server or a specific host device. For example, text that sends an article in agreement for a message contribution website or a forum website is also included in the sending of the text.

In embodiment 1, the speech processing device 1 is provided in a vehicle, and prevents a text of a speech command uttered in the vehicle from being transmitted. However, the voice processing device 1 does not have to be installed in the vehicle, and the object of preventing the transmission of the text is not limited to the voice command. The object to be prevented from transmitting the text may be, for example, a sentence which is not suitable for being transmitted as a text and is registered in advance. The same applies to

embodiments

2 and 3.

In the above embodiments, the user transmits the message start word to start the speech reception period. In this regard, the voice reception period may be started when the user performs a predetermined touch operation on the touch screen 3 or when the user performs a predetermined gesture in a configuration in which a gesture can be detected. The same applies to the end word and the send word. In particular, the message end word may be configured to be detected when the user does not utter for a predetermined period of time, and the speech reception period may be ended.

Claims

1. A speech processing device is characterized by comprising:

a voice input unit for inputting a voice uttered by a user;

a sentence processing unit that outputs a text of a sentence from which a specific sentence is removed from a sentence indicated by the speech input unit during a speech reception period, that is, a period during which speech of the sentence to be a text is received; and

and a sentence transmitting unit for transmitting the text output by the sentence processing unit.

2. The speech processing apparatus of claim 1,

the specific sentence is a sentence instructing to perform a specific process,

the speech processing device further includes:

and a specific processing execution control unit configured to execute the specific processing or cause a device having a function of executing the specific processing to execute the specific processing, based on the specific sentence removed by the sentence processing unit.

3. The speech processing apparatus of claim 1,

the sentence processing unit extracts and deletes the text of the specific sentence from the text of the sentence indicated by all the voices inputted by the voice input unit in the voice reception period after the voice reception period is ended, and outputs the text of the deleted sentence.

4. The speech processing apparatus of claim 3,

the sentence processing unit extracts the text of the specific sentence from the sentence which becomes the text based on the correspondence with the text of the specific sentence stored in advance in the storage unit.

5. The speech processing apparatus of claim 3,

the specific sentence is a sentence instructing to perform a specific process,

the speech processing device further includes:

and a specific processing execution control unit configured to execute the specific processing or cause a device having a function of executing the specific processing to execute the specific processing, based on the text of the specific sentence extracted by the sentence processing unit.

6. The speech processing apparatus of claim 1,

the sentence processing unit performs voice recognition on the voice input by the voice input unit as needed during the voice reception period, monitors whether or not a sentence is present, determines whether or not the presented sentence is the specific sentence each time the sentence is present, and outputs a text of the presented sentence if the sentence is not the specific sentence, while not outputting the text of the presented sentence if the sentence is the specific sentence.

7. The speech processing apparatus of claim 6,

the sentence processing unit determines whether or not a sentence to be a text is the specific sentence based on the correspondence with the text of the specific sentence stored in advance in the storage unit.

8. The speech processing apparatus of claim 6,

the specific sentence is a sentence instructing to perform a specific process,

the speech processing device further includes:

and a specific processing execution control unit that executes the specific processing or causes a device having a function of executing the specific processing to execute the specific processing, when the sentence to be a text is determined to be the specific sentence by the sentence processing unit.

9. A speech processing method, comprising the steps of:

a step in which a sentence processing unit of the speech processing apparatus outputs a text of a sentence from which a specific sentence is removed from a sentence indicated by speech input by a speech input unit of the speech processing apparatus during a speech reception period, that is, a period during which speech of the sentence to be a text is received; and

and a step of transmitting the text output by the text processing unit by a text transmitting unit of the speech processing apparatus.