US20190180751A1 - Information processing apparatus, method for processing information, and program - Google Patents

Information processing apparatus, method for processing information, and program Download PDF

Info

Publication number
US20190180751A1
US20190180751A1 US16/323,734 US201716323734A US2019180751A1 US 20190180751 A1 US20190180751 A1 US 20190180751A1 US 201716323734 A US201716323734 A US 201716323734A US 2019180751 A1 US2019180751 A1 US 2019180751A1
Authority
US
United States
Prior art keywords
word
voice recognition
unit
phrase
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/323,734
Other languages
English (en)
Inventor
Shinichi Kawano
Yuhei Taki
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Corp
Original Assignee
Sony Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Corp filed Critical Sony Corp
Assigned to SONY CORPORATION reassignment SONY CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KAWANO, SHINICHI, TAKI, Yuhei
Publication of US20190180751A1 publication Critical patent/US20190180751A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present disclosure relates to an information processing apparatus, a method for processing information, and a program, and in particular, to an information processing apparatus, a method for processing information, and a program capable of performing voice recognition with higher accuracy.
  • a more accurate voice recognition result can be obtained by prompting re-utterance and correcting the voice recognition result, for example.
  • a technique for improving the voice recognition accuracy by making the re-utterance for correcting the voice recognition result in a phrase unit, and a technique for easily correcting the voice recognition result of the re-utterance by sectioning it by a phrase unit on the basis of sound information, for example.
  • Patent Document 1 when a sentence expression is changed or added, it is possible to reduce burden on a user by preparing a sentence in a phrase unit including postpositional particles, predicates, and the like for words.
  • Patent Document 1 Japanese Patent Application Laid-Open No. 2012-053634
  • the present disclosure has been conceived in view of such a situation, and it is intended to enable voice recognition with higher accuracy.
  • An information processing apparatus includes: a voice recognition unit that obtains a word string representing utterance content as a voice recognition result by obtaining voice information obtained from utterance of a user and performing voice recognition on the voice information; a confidence level acquisition unit that obtains, at a time when the voice recognition unit performs the voice recognition on the voice information, a confidence level of each word recognized as the voice recognition result as an index representing a degree of reliability of the voice recognition result; a phrase unit determination unit that determines a phrase unit including a word with a low confidence level obtained by the confidence level acquisition unit; and an output processing unit that outputs voice recognition result information from which the phrase unit determined by the phrase unit determination unit is recognized together with the voice recognition result.
  • a method for processing information or a program includes steps of: obtaining a word string representing utterance content as a voice recognition result by obtaining voice information obtained from utterance of a user and performing voice recognition on the voice information; obtaining, at a time when the voice recognition is performed on the voice information, a confidence level of each word recognized as the voice recognition result as an index representing a degree of reliability of the voice recognition result; determining a phrase unit including a word with a low confidence level; and outputting voice recognition result information from which the phrase unit is recognized together with the voice recognition result.
  • a word string representing utterance content is obtained as a voice recognition result by obtaining voice information obtained from utterance of a user and performing voice recognition on the voice information, and at the time when the voice recognition is performed on the voice information, a confidence level of each word recognized as the voice recognition result, which is an index representing a degree of reliability of the voice recognition result, is obtained. Then, a phrase unit including a word with a low confidence level is determined, and voice recognition result information from which the phrase unit is recognized is output together with the voice recognition result.
  • voice recognition can be performed with higher accuracy.
  • FIG. 1 is a block diagram illustrating an exemplary configuration of a voice recognition system to which the present technology is applied according to an embodiment.
  • FIG. 2 is a block diagram illustrating a first exemplary configuration of a voice recognition server.
  • FIG. 3 is a diagram illustrating an example of phrase unit determination processing.
  • FIG. 4 is a diagram illustrating a pronunciation information table.
  • FIG. 5 is a diagram illustrating an example of voice recognition result output processing.
  • FIG. 6 is a flowchart illustrating a voice recognition process.
  • FIG. 7 is a flowchart illustrating a phrase unit determination process.
  • FIG. 8 is a flowchart illustrating a starting-end-word specifying process.
  • FIG. 9 is a flowchart illustrating a termination word specifying process.
  • FIG. 10 is a block diagram illustrating a second exemplary configuration of the voice recognition server.
  • FIG. 11 is a diagram illustrating a variation of the phrase unit determination processing.
  • FIG. 12 is a diagram illustrating a variation of a user interface for voice recognition.
  • FIG. 13 is a diagram illustrating a variation of the voice recognition result output processing.
  • FIG. 14 is a block diagram illustrating an exemplary configuration of a computer to which the present technology is applied according to an embodiment.
  • FIG. 1 is a block diagram illustrating an exemplary configuration of a voice recognition system to which the present technology is applied according to an embodiment.
  • a plurality of (N in the example of FIG. 1 ) client terminals 13 - 1 to 13 -N and a voice recognition server 14 are connected to a voice recognition system 11 via a network 12 such as the Internet.
  • the client terminals 13 - 1 to 13 -N are configured in a manner similar to each other, and are referred to as a client terminal 13 as appropriate in a case where there is no need to distinguish from each other.
  • the client terminal 13 includes a voice information acquisition device, such as a microphone, in which voice uttered by a user is input to obtain voice information, and transmits the voice information obtained by the voice information acquisition device to a voice recognition server 14 via a network 12 . Furthermore, the client terminal 13 receives a voice recognition result transmitted from the voice recognition server 14 , and presents it to the user. For example, the client terminal 13 displays video (image) representing the voice recognition result on a video output device, and outputs synthetic voice representing the voice recognition result from a voice output device.
  • a voice information acquisition device such as a microphone
  • the voice recognition server 14 performs voice recognition processing on the voice information transmitted from the client terminal 13 via the network 12 . Then, the voice recognition server 14 transmits, to the client terminal 13 via the network 12 , a word string or the like recognized from the voice information as the voice recognition result. At this time, the voice recognition server 14 can transmit the voice recognition result not only to the client terminal 13 from which the voice information is transmitted, but also to another client terminal 13 of another user in communication with the user of the client terminal 13 , for example.
  • the voice recognition system 11 is configured as described above, and the voice information obtained from utterance of the user of the client terminal 13 is transmitted to the voice recognition server 14 , the voice recognition server 14 performs voice recognition processing, and the voice recognition result is transmitted to the client terminal 13 . Therefore, the voice recognition system 11 can provide, even if the processing capacity of the individual client terminal 13 is low, the voice recognition processing with higher accuracy by implementing the latest high-performance voice recognition processing in the voice recognition server 14 , for example.
  • FIG. 2 is a block diagram illustrating a first exemplary configuration of the voice recognition server 14 .
  • the voice recognition server 14 includes a communication unit 21 , an input sound processing unit 22 , a voice recognition unit 23 , a confidence level acquisition unit 24 , a phonetic symbol conversion unit 25 , a phrase unit determination processing unit 26 , and a voice recognition result output processing unit 27 .
  • the communication unit 21 performs various types of communication with the client terminal 13 via the network 12 in FIG. 1 .
  • the communication unit 21 receives the voice information transmitted from the client terminal 13 , and supplies it to the input sound processing unit 22 . Furthermore, the communication unit 21 transmits the voice recognition result information supplied from the voice recognition result output processing unit 27 to the client terminal 13 .
  • the input sound processing unit 22 performs various preprocessing on the voice information supplied from the communication unit 21 , which is necessary before the voice recognition unit 23 performs voice recognition. For example, the input sound processing unit 22 performs voice activity detection (VAD) processing in which a section including no sound and a section including only noise within the voice information are excluded and an utterance section including uttered voice is detected, and supplies the voice information of the utterance section to the voice recognition unit 23 .
  • VAD voice activity detection
  • the voice recognition unit 23 performs the voice recognition on the voice information supplied from the input sound processing unit 22 , recognizes utterance content included in the voice information, and supplies a word string representing the utterance content to the phonetic symbol conversion unit 25 and the phrase unit determination processing unit 26 .
  • the confidence level acquisition unit 24 obtains, when the voice recognition unit 23 performs the voice recognition on the voice information, a confidence level for each word as an index representing a degree of reliability with respect to the voice recognition result at the time when the voice recognition unit 23 recognizes words, and supplies it to the phrase unit determination processing unit 26 .
  • the confidence level acquisition unit 24 can obtain the confidence level on the basis of a word graph generated in the process of the voice recognition performed by the voice recognition unit 23 .
  • the phonetic symbol conversion unit 25 refers to a dictionary in which a word is associated with a phonetic symbol, for example, converts the word string supplied from the voice recognition unit 23 into phonetic symbols associated with respective words, and supplies them to the phrase unit determination processing unit 26 .
  • the phrase unit determination processing unit 26 performs, on the word string supplied from the voice recognition unit 23 , phrase unit determination processing in which a phrase unit is determined as described later with reference to FIG. 3 on the basis of the confidence level supplied from the confidence level acquisition unit 24 and the phonetic symbols supplied from the phonetic symbol conversion unit 25 .
  • the phrase unit is configured by one or more words obtained by sectioning the word string recognized by the voice recognition unit 23 into, for example, each part to be preferably uttered collectively when the user is prompted to make re-utterance.
  • the phrase unit determination processing unit 26 obtains the confidence level of the voice recognition result in a certain unit (a unit of “article+word” in English language, and “morpheme+postpositional particle or auxiliary verb” in Japanese language), and in a case where there is a word a with low confidence level, a phrase unit is determined from words around the word.
  • the phrase unit determination processing unit 26 can refer to a pronunciation information table in which a voiced sound and an unvoiced sound are associated with phonetic symbols as illustrated in FIG. 4 on the basis of the phonetic symbols converted by the phonetic symbol conversion unit 25 , and can determine the phrase unit.
  • the phrase unit determination processing unit 26 sequentially selects the word arranged before the word with a low confidence level from the word immediately preceding the word with a low confidence level, and specifies a starting-end word of the phrase unit on the basis of whether or not the selected word starts with a voiced sound.
  • the phrase unit determination processing unit 26 sequentially selects the word arranged after the word with a low confidence level from the word immediately following the word with the low confidence level, and specifies a termination word of the phrase unit on the basis of determination on whether or not the selected word starts with a voiced sound.
  • the voice recognition result output processing unit 27 is a user interface for allowing the user to recognize the phrase unit determined by the phrase unit determination processing unit 26 , which performs voice recognition result output processing in which voice recognition result information for allowing the user of the client terminal 13 to recognize the phrase unit together with the voice recognition result is generated and output.
  • the voice recognition result output processing unit 27 generates and outputs display information for displaying a user interface (see FIG. 5 ) in which characters representing the voice recognition result are clearly indicated to be in a sectioned state in the phrase unit, or generates and outputs synthetic voice information for outputting synthetic voice representing the voice recognition result sectioned in the phrase unit.
  • the voice recognition server 14 is configured as described above, and the voice recognition is performed on the voice information transmitted from the client terminal 13 , a phrase unit for sectioning the recognized word string is determined, and the voice recognition result in which the word string is sectioned in the phrase unit can be transmitted to the client terminal 13 . Accordingly, in a case where the voice recognition result presented to the user with the client terminal 13 includes an incorrect word, it is possible to prompt the user to make re-utterance in the phrase unit including the incorrectly recognized word.
  • the voice recognition server 14 can correct the voice recognition result to include a correct word to output it as a result of performing the voice recognition in the phrase unit. In this manner, since the voice recognition result can be corrected, the voice recognition server 14 can consequently perform the voice recognition with higher accuracy.
  • phrase unit determination processing performed by the phrase unit determination processing unit 26 will be described with reference to FIGS. 3 to 5 .
  • the confidence level acquisition unit 24 obtains the confidence level “0.99” for the word “I” in the voice recognition result, obtains the confidence level “0.23” for the word “sue”, and obtains the confidence level “0.98” for the word “person”. Likewise, the confidence level acquisition unit 24 obtains the confidence level “0.99” for the word “with”, obtains the confidence level “0.98” for the word “red”, and obtains the confidence level “0.12” for the word “shoot”. Furthermore, the phonetic symbol conversion unit 25 converts each word of the voice recognition result into the phonetic symbols as illustrated.
  • the phrase unit determination processing unit 26 refers to the pronunciation information table in FIG. 4 on the basis of the phonetic symbols converted by the phonetic symbol conversion unit 25 , and determines the phrase unit such that a word starting with a voiced sound is arranged at both the front and the back of the word with the low confidence level.
  • the phrase unit determination processing unit 26 may determine the phrase unit such that a word starting with a voiced sound is arranged in at least one of the front and the back of the word with the low confidence level.
  • the phrase unit determination processing unit 26 determines, as a phrase unit, “I sue a person” in which the word “I” starting with the voiced sound is arranged before the word “sue” with the low confidence level and the word “person” starting with the voiced sound is arranged after the word “sue”. Furthermore, the phrase unit determination processing unit 26 determines, as a phrase unit, “a red shoot” in which the word “red” starting with the voiced sound is arranged before the word “shoot” as the word “shoot” with the low confidence level is arranged at the end.
  • the phrase unit determination processing unit 26 may specify, on the basis of the confidence level, a word with a high confidence level as a starting-end-word and a termination word in a phrase unit including a word with a low confidence level.
  • the phrase unit determination processing unit 26 may specify the starting-end-word and the termination word in the phrase unit including the word with the low confidence level on the basis of both the confidence level and phonetic symbols.
  • FIG. 5 illustrates a user interface displayed on a video output device of the client terminal 13 as an example of voice recognition result output processing performed by the voice recognition result output processing unit 27 .
  • the phrase unit is determined by the phrase unit determination processing unit 26 with respect to the voice recognition result of “I sue a person with a red shoot” obtained by the voice recognition unit 23 .
  • the voice recognition result output processing unit 27 performs the voice recognition result output processing for outputting the display information for displaying the voice recognition result on the user interface clearly indicating that the voice recognition result is sectioned into the phrase unit “I sue a person” and the phrase unit “a red shoot”. Accordingly, as illustrated in FIG. 5 , for example, the user interface in which the phrase unit “I sue a person” and the phrase unit “a red shoot” are surrounded by different frames is displayed on the video output device of the client terminal 13 .
  • the voice recognition unit 23 performs the voice recognition with respect to the re-uttered voice information in the phrase unit including the incorrect word, whereby more accurate voice recognition result can be obtained compared with, for example, a case where only the incorrect word is uttered.
  • FIG. 6 is a flowchart illustrating a voice recognition process executed in the voice recognition server 14 .
  • step S 11 the input sound processing unit 22 performs processing of detecting an utterance section including the voice uttered by the user of the client terminal 13 from the voice information supplied from the communication unit 21 .
  • step S 12 the input sound processing unit 22 determines whether or not the utterance of the user of the client terminal 13 has started according to the detection result of the utterance section in the processing in step S 11 .
  • step S 12 in a case where the input sound processing unit 22 determines that the utterance has not started, the process returns to step S 11 , and the process is suspended until it is determined that the utterance has started.
  • step S 12 in a case where the input sound processing unit 22 determines that the utterance of the user of the client terminal 13 has started, the process proceeds to step S 13 .
  • step S 13 the input sound processing unit 22 supplies the voice information in the utterance section to the voice recognition unit 23 , and the voice recognition unit 23 performs the voice recognition on the voice information.
  • step S 14 the input sound processing unit 22 determines whether or not the utterance of the user of the client terminal 13 has ended. In step S 14 , in a case where the input sound processing unit 22 determines that the utterance has not ended, the process returns to step S 13 , and the voice recognition performed by the voice recognition unit 23 continues. On the other hand, in step S 14 , in a case where the input sound processing unit 22 determines that the utterance of the user of the client terminal 13 has ended, the process proceeds to step S 15 .
  • step S 15 as a voice recognition result based on the voice recognition in step S 13 , which is based on the voice information from the start to the end of the utterance, the voice recognition unit 23 obtains the word string representing the utterance content included in the voice information. Then, the voice recognition unit 23 supplies the voice recognition result to the phonetic symbol conversion unit 25 and the phrase unit determination processing unit 26 .
  • step S 16 the phonetic symbol conversion unit 25 converts the word string supplied from the voice recognition unit 23 in step S 15 into phonetic symbols for each word, and supplies them to the phrase unit determination processing unit 26 .
  • step S 17 the confidence level acquisition unit 24 obtains the confidence level for each word when the voice recognition unit 23 performs the voice recognition in step S 13 , and supplies it to the phrase unit determination processing unit 26 .
  • step S 18 the phrase unit determination processing unit 26 performs, on the word string supplied from the voice recognition unit 23 in step S 15 , phrase unit determination processing (flowchart in FIG. 7 as described later) on the basis of the phonetic symbols supplied from the phonetic symbol conversion unit 25 in step S 16 and the confidence level supplied from the confidence level acquisition unit 24 in step S 17 . Then, the phrase unit determination processing unit 26 supplies the phrase unit determined in the phrase unit determination processing to the voice recognition result output processing unit 27 together with the word string.
  • step S 19 the voice recognition result output processing unit 27 outputs the voice recognition result information for displaying the user interface in which it is clearly indicated that the word string recognized by the voice recognition unit 23 is sectioned by the phrase unit determined by the phrase unit determination processing unit 26 . Then, the communication unit 21 transmits the voice recognition result information output from the voice recognition result output processing unit 27 to the client terminal 13 via the network 12 , and then the voice recognition process is terminated.
  • FIG. 7 is flowchart illustrating the phrase unit determination processing in step S 18 in the voice recognition process in FIG. 6 .
  • phrase unit determination process for example, processing is performed sequentially from the word at the beginning of the sentence of the word string recognized by the voice recognition unit 23 .
  • step S 21 the phrase unit determination processing unit 26 first sets the word at the beginning of the sentence as a processing target.
  • step S 22 the phrase unit determination processing unit 26 determines whether or not the confidence level obtained for the word to be processed is equal to or less than a predetermined threshold value.
  • step S 22 in a case where the phrase unit determination processing unit 26 determines that the confidence level is equal to or less than the predetermined threshold value, the process proceeds to step S 23 .
  • step S 23 the phrase unit determination processing unit 26 performs starting-end-word specifying processing (flowchart in FIG. 8 as described later) for specifying a starting-end-word to be the starting-end of the phrase unit in which the word to be processed is included.
  • step S 24 the phrase unit determination processing unit 26 performs termination word specifying processing (flowchart in FIG. 9 as described later) for specifying a termination word to be the terminal of the phrase unit in which the word to be processed is included.
  • step S 24 the process proceeds to step S 25 .
  • step S 25 the phrase unit determination processing unit 26 determines whether or not all the words included in the word string recognized by the voice recognition unit 23 have been set to the processing target.
  • step S 25 in a case where the phrase unit determination processing unit 26 determines that all the words have not been set to the processing target, in other words, in a case where there is a word that has not been set to the processing target, the process proceeds to step S 26 .
  • step S 26 the phrase unit determination processing unit 26 newly sets a word next to the word that is currently the processing target as a processing target. Then, the process returns to step S 22 , and the similar process is repeated for the word newly set as the processing target.
  • step S 25 in a case where the phrase unit determination processing unit 26 determines that all the words have been set to the processing target, the phrase unit determination process is terminated.
  • FIG. 8 is flowchart illustrating the starting-end-word specifying processing in step S 23 in the phrase unit determination process in FIG. 7 .
  • step S 31 the phrase unit determination processing unit 26 determines whether or not all the words preceding the word to be processed have been selected as targets for specifying the starting-end-word.
  • step S 31 in a case where the phrase unit determination processing unit 26 determines that all the words preceding the word to be processed have not been selected as the targets for specifying the starting-end-word, the process proceeds to step S 32 .
  • the phrase unit determination processing unit 26 determines that all the words preceding the word to be processed have not been selected as the targets for specifying the starting-end-word.
  • step S 32 the phrase unit determination processing unit 26 selects the immediately preceding word as a target for specifying the starting-end-word. For example, in a case where the starting-end-word specifying processing is performed for the first time, the phrase unit determination processing unit 26 selects the word immediately preceding the word to be processed in step S 21 or S 26 in FIG. 7 as a target for specifying the starting-end-word. Furthermore, in a case where the starting-end-word specifying processing is performed for the second time or later, the phrase unit determination processing unit 26 selects the word immediately preceding the word currently being selected as a target for specifying the starting-end-word.
  • step S 33 the phrase unit determination processing unit 26 determines whether or not the confidence level of the word selected in the immediately preceding step S 32 is equal to or less than a predetermined threshold value.
  • step S 33 in a case where the phrase unit determination processing unit 26 determines that the confidence level of the selected word is not equal to or less than the predetermined threshold value (i.e., the confidence level is larger than the predetermined threshold value), the process proceeds to step S 34 .
  • step S 34 the phrase unit determination processing unit 26 determines whether or not the phonetic symbols of the selected word start with a voiced sound according to the phonetic symbols supplied from the phonetic symbol conversion unit 25 .
  • step S 34 in a case where the phrase unit determination processing unit 26 determines that the phonetic symbols of the selected word start with a voiced sound, the process proceeds to step S 35 .
  • step S 35 the phrase unit determination processing unit 26 specifies the selected word as a starting-end-word.
  • step S 34 determines that the phonetic symbols of the selected word does not start with a voiced sound in step S 34 , in other words, the phonetic symbols of the selected word start with an unvoiced sound, the process returns to step S 31 , and the similar process is repeated thereafter.
  • step S 33 in a case where the phrase unit determination processing unit 26 determines that the confidence level of the selected word is equal to or less than the predetermined threshold value, the process proceeds to step S 36 .
  • step S 36 the phrase unit determination processing unit 26 specifies, as a starting-end-word, the word immediately following the word selected as a target for specifying the starting-end-word at this point. Note that, for example, in a case where the starting-end-word specifying processing is performed for the first time, the word immediately preceding the word to be processed is selected as a target for specifying the starting-end-word, and the word to be processed immediately following the selected word is specified as a starting-end-word.
  • step S 31 in a case where the phrase unit determination processing unit 26 determines that all the words preceding the word to be processed have been selected as the targets for specifying the starting-end-word, the process proceeds to step S 37 .
  • step S 37 the phrase unit determination processing unit 26 specifies the word at the beginning of the sentence of the word string recognized by the voice recognition unit 23 as a starting-end-word.
  • step S 35 After the processing in step S 35 , step S 36 , or step S 37 , the starting-end-word specifying process is terminated.
  • FIG. 9 is flowchart illustrating the termination word specifying processing in step S 24 in the phrase unit determination process in FIG. 7 .
  • step S 41 the phrase unit determination processing unit 26 determines whether or not all the words following the word to be processed have been selected as targets for specifying the termination word.
  • step S 41 in a case where the phrase unit determination processing unit 26 determines that all the words following the word to be processed have not been selected as the targets for specifying the termination word, the process proceeds to step S 42 .
  • the phrase unit determination processing unit 26 determines that all the words following the word to be processed have not been selected as the targets for specifying the termination word.
  • step S 42 the phrase unit determination processing unit 26 selects the immediately following word as a target for specifying the termination word. For example, in a case where the termination word specifying processing is performed for the first time, the phrase unit determination processing unit 26 selects the word immediately following the word to be processed in step S 21 or S 26 in FIG. 7 as a target for specifying the termination word. Furthermore, in a case where the termination word specifying processing is performed for the second time or later, the phrase unit determination processing unit 26 selects the word immediately following the word currently being selected as a target for specifying the termination word.
  • step S 43 the phrase unit determination processing unit 26 determines whether or not the confidence level of the word selected in the immediately preceding step S 42 is equal to or less than a predetermined threshold value.
  • step S 43 in a case where the phrase unit determination processing unit 26 determines that the confidence level of the selected word is not equal to or less than the predetermined threshold value (i.e., the confidence level is larger than the predetermined threshold value), the process proceeds to step S 44 .
  • step S 44 the phrase unit determination processing unit 26 determines whether or not the phonetic symbols of the selected word start with a voiced sound according to the phonetic symbols supplied from the phonetic symbol conversion unit 25 .
  • step S 44 in a case where the phrase unit determination processing unit 26 determines that the phonetic symbols of the selected word start with a voiced sound, the process proceeds to step S 45 .
  • step S 45 the phrase unit determination processing unit 26 specifies the selected word as a termination word.
  • step S 44 determines that the phonetic symbols of the selected word does not start with a voiced sound in step S 44 , in other words, the phonetic symbols of the selected word start with an unvoiced sound, the process returns to step S 41 , and the similar process is repeated thereafter.
  • step S 43 in a case where the phrase unit determination processing unit 26 determines that the confidence level of the selected word is equal to or less than the predetermined threshold value, the process proceeds to step S 46 .
  • step S 46 the phrase unit determination processing unit 26 specifies, as a termination word, the word immediately preceding the word selected as a target for specifying the termination word at this point. Note that, for example, in a case where the termination word specifying processing is performed for the first time, the word immediately following the word to be processed is selected as a target for specifying the termination word, and the word to be processed immediately preceding the selected word is specified as a termination word.
  • step S 41 in a case where the phrase unit determination processing unit 26 determines that all the words following the word to be processed have been selected as the targets for specifying the termination word, the process proceeds to step S 47 .
  • step S 47 the phrase unit determination processing unit 26 specifies the word at the end of the sentence of the word string recognized by the voice recognition unit 23 as a termination word.
  • step S 45 After the processing in step S 45 , step S 46 , or step S 47 , the termination word specifying process is terminated.
  • the voice recognition server 14 determines the phrase unit though the phrase unit determination processing at the time of performing the voice recognition on the voice information transmitted from the client terminal 13 , whereby the user interface by which the phrase unit can be recognized together with the voice recognition result can be presented. Accordingly, it is possible to cause the user to make re-utterance in the phrase unit, whereby more accurate voice recognition result can be obtained.
  • FIG. 10 is a block diagram illustrating a second exemplary configuration of the voice recognition server 14 . Note that, in a voice recognition server 14 A illustrated in FIG. 10 , configurations common to the voice recognition server 14 in FIG. 2 are denoted by the same reference numerals, and detailed descriptions thereof are omitted.
  • the voice recognition server 14 A has a configuration common to the voice recognition server 14 in FIG. 2 in that it includes the communication unit 21 , the input sound processing unit 22 , the voice recognition unit 23 , the confidence level acquisition unit 24 , the phonetic symbol conversion unit 25 , the phrase unit determination processing unit 26 , and the voice recognition result output processing unit 27 . Moreover, the voice recognition server 14 A includes a one-character voice recognition unit 28 , and a natural language analysis unit 29 .
  • the one-character voice recognition unit 28 is capable of performing voice recognition on the voice information supplied from the input sound processing unit 22 in a unit of one character.
  • the one-character voice recognition unit 28 includes a voice recognition engine specialized for the voice recognition in the unit of one character as compared with the voice recognition unit 23 .
  • the phrase unit determination processing unit 26 determines the phrase unit including only the word with the low confidence level. In other words, in this case, re-utterance of only the word with the low confidence level is prompted. Thereafter, when the input sound processing unit 22 obtains the voice information associated with the re-utterance, the input sound processing unit 22 supplies the voice information associated with the word with the low confidence level to the one-character voice recognition unit 28 , and causes the one-character voice recognition unit 28 to perform voice recognition.
  • the accuracy of the voice recognition can be improved compared with a case where the voice recognition unit 23 is caused to perform the voice recognition on the re-utterance.
  • the voice recognition result obtained by the voice recognition unit 23 having performed the voice recognition on the voice information is supplied to the natural language analysis unit 29 .
  • the natural language analysis unit 29 performs natural language analysis on the voice recognition result, and obtains sentence elements (sentence components) of the words included in the voice recognition result as an analysis result.
  • the natural language analysis unit 29 obtains sentence elements for each word included in the voice recognition result.
  • the natural language analysis unit 29 obtains the analysis result in which the word “I” is a noun (subject), the word “sue” is a verb, the word “a” is an article, the word “person” is a noun (object), the word “with” is a preposition, the word “a” is an article, the word “red” is an adjective, and the word “shoot” is a noun.
  • the phrase unit determination processing unit 26 can determine the phrase unit on the basis of the language structure according to such sentence elements. For example, the phrase unit determination processing unit 26 determines “I sue a person” as a phrase unit on the basis of the strongly connected language structure of a subject, a verb, and an object. Furthermore, the phrase unit determination processing unit 26 determines “a red shoot” as a phrase unit on the basis of the strongly connected language structure of an article, an adjective, and a noun, for example.
  • the phrase unit determination processing unit 26 may select, in the strongly connected language structure of a subject, a verb, and an object, a word starting with an unvoiced sound as a starting-end-word or a termination word.
  • the voice recognition unit 23 obtains the voice recognition result of “She prays with her hair” on the basis of the voice information “She plays with her hair” uttered by the user.
  • the natural language analysis unit 29 obtains the analysis result in which the word “She” is a noun (subject), the word “prays” is a verb, the word “with” is a preposition, the word “her” is a noun (object), and the word “hair” is a noun (object). At this time, the confidence level of the word “prays” is low, and the phrase unit determination processing unit 26 performs processing for determining the phrase unit including the word “prays”.
  • the phrase unit determination processing unit 26 can determine that it is a strongly connected language structure of a subject and a verb and the connection between sounds is strong. Therefore, the phrase unit determination processing unit 26 selects the word “She” not starting with a voiced sound as a starting-end-word, and can determine “She plays with” as a phrase unit.
  • the phrase unit determination processing unit 26 performs processing on the basis of whether or not it is a word starting with a voiced sound. This is because voice recognition can be performed on the word starting with a voiced sound with higher accuracy than on a word starting with an unvoiced sound. In addition, for example, in a case where relatively a large number of voiced sounds (e.g., more than half of all words) is included, it is considered that the voice recognition can be performed with high accuracy even if the word starts with an unvoiced sound.
  • the phrase unit determination processing unit 26 can determine the phrase unit by specifying the word starting with an unvoiced sound and including a relatively large number of voiced sounds as a starting-end-word or a termination word.
  • the voice recognition unit 23 has obtained the voice recognition result of “Statistics shoes that people are having fewer children” on the basis of the voice information “Statistics shows that people are having fewer children” uttered by the user.
  • the word “Statistics” does not start with a voiced sound, it can be determined that the content rate of voiced sounds is high, and the phrase unit determination processing unit 26 can determine the word “Statistics” as a starting-end-word and the word “Statistics shows that” as a phrase unit.
  • a variation of the user interface for voice recognition will be described with reference to FIG. 12 .
  • FIG. 12 illustrates an exemplary user interface in a system that allows the user to select the content to be uttered and input voice.
  • the voice recognition system 11 can be applied to the system that allows the user to utter either the option “Seashell” or the option “glass ball” in response to the question “Which do you like better?”.
  • the option “Seashell” includes a large number of unvoiced sounds, and it is assumed that the accuracy of the voice recognition is lowered.
  • the voice recognition system 11 can change the option to a word having a meaning similar to the option “Seashell” and including a large number of voiced sounds.
  • the voice recognition system 11 can change the option “Seashell” to the option “Shellfish”. Accordingly, the voice recognition can be performed with higher accuracy by making the user utter the option “Shellfish” including voiced sounds more than that of the option “Seashell”.
  • FIG. 13 illustrates a variation of the user interface output by the voice recognition result output processing unit 27 .
  • the user interface in which it is clearly indicated to be sectioned into the phrase unit “I sue a person” and the phrase unit “a red shoot” is presented to the user, whereby the re-utterance of the phrase unit “a red shoot” can be prompted.
  • the termination word starts with an unvoiced sound, it is considered that the accuracy of the voice recognition with respect to the re-utterance is difficult to greatly improve.
  • the voice recognition result output processing unit 27 can output the user interface that adds a warning sentence “speech with “please“ ” after the phrase unit “a red shoot” in which the word starting with an unvoiced sound is a termination word.
  • the user interface prompting utterance for adding, after the termination word, a word that does not influence the sentence and starts with a voiced sound is presented.
  • the user interface prompting re-utterance for adding, before the starting-end-word, a word that does not influence the sentence and starts with a voiced sound may be presented.
  • each processing described with reference to the flowcharts described above is not necessarily processed in a time series manner in the order illustrated in the flowchart, and may be executed in parallel or individually (e.g., parallel processing or object processing).
  • the program may be processed by one central processing unit (CPU), or may be subjected to distributed processing by a plurality of CPUs.
  • the series of processing described above may be executed by hardware or by software.
  • a program constituting the software is installed from a program recording medium in which the program is recorded into a computer incorporated in dedicated hardware or, for example, a general-purpose personal computer or the like capable of executing various functions by installing various programs therein.
  • FIG. 14 is a block diagram illustrating an exemplary hardware configuration of a computer that executes, using a program, the series of processing described above.
  • a computer 101 illustrated in FIG. 14 corresponds to, for example, the client terminal 13 in FIG. 1 , which has an exemplary configuration capable of performing voice recognition processing using the client terminal 13 alone without performing processing via the network 12 .
  • the computer 101 includes a voice information acquisition device 102 , a video output device 103 , a voice output device 104 , a CPU 105 , a memory 106 , a storage device 107 , and a network input-output device 108 .
  • the computer 101 includes the input sound processing unit 22 , the voice recognition unit 23 , the confidence level acquisition unit 24 , the phonetic symbol conversion unit 25 , the phrase unit determination processing unit 26 , and the voice recognition result output processing unit 27 .
  • the computer 101 includes the one-character voice recognition unit 28 and the natural language analysis unit 29 .
  • the voice information acquisition device 102 includes a microphone
  • the video output device 103 includes a display
  • the voice output device 104 includes a speaker
  • the network input-output device 108 corresponds to the communication unit 21 in FIG. 2 , and is capable of performing communication according to the standard of the local area network (LAN), for example.
  • LAN local area network
  • the CPU 105 loads the program stored in the storage device 107 into the memory 106 and executes it, thereby performing the series of processing described above.
  • the program to be executed by the CPU 105 may be provided by recording it in package media including, for example, a magnetic disk (including a flexible disk), an optical disk (e.g., compact disc read only memory (CD-ROM) and digital versatile disc (DVD)), a magneto-optic disk, a semiconductor memory, or the like, or may be provided via a wired or wireless transmission medium by using the network input-output device 108 .
  • package media including, for example, a magnetic disk (including a flexible disk), an optical disk (e.g., compact disc read only memory (CD-ROM) and digital versatile disc (DVD)), a magneto-optic disk, a semiconductor memory, or the like, or may be provided via a wired or wireless transmission medium by using the network input-output device 108 .
  • An information processing apparatus including:
  • a voice recognition unit that obtains a word string representing utterance content as a voice recognition result by obtaining voice information obtained from utterance of a user and performing voice recognition on the voice information
  • a confidence level acquisition unit that obtains, at a time when the voice recognition unit performs the voice recognition on the voice information, a confidence level of each word recognized as the voice recognition result as an index representing a degree of reliability of the voice recognition result;
  • phrase unit determination unit that determines a phrase unit including a word with a low confidence level obtained by the confidence level acquisition unit
  • an output processing unit that outputs voice recognition result information from which the phrase unit determined by the phrase unit determination unit is recognized together with the voice recognition result.
  • the information processing apparatus further including:
  • a phonetic symbol conversion unit that converts the word string recognized as the voice recognition result into a phonetic symbol of each word
  • the phrase unit determination unit determines the phrase unit on the basis of the phonetic symbol converted by the phonetic symbol conversion unit.
  • the phrase unit determination unit refers to the phonetic symbol converted by the phonetic symbol conversion unit, and specifies a word starting with a voiced sound as a word to be a starting end or a terminal of the phrase unit.
  • the phrase unit determination unit sequentially selects a word arranged before the word with a low confidence level from a word immediately preceding the word with a low confidence level, and specifies a starting-end word of the phrase unit on the basis of whether or not the selected word starts with a voiced sound.
  • the phrase unit determination unit sequentially selects a word arranged after the word with a low confidence level from a word immediately following the word with a low confidence level, and specifies a termination word of the phrase unit on the basis of whether or not the selected word starts with a voiced sound.
  • the information processing apparatus according to any one of (1) to (5) above, further including:
  • a natural language analysis unit that performs natural language analysis on a sentence including the word string recognized as the voice recognition result, in which
  • the phrase unit determination unit refers to an analysis result obtained by the natural language analysis unit, and determines the phrase unit on the basis of a strongly connected language structure.
  • the information processing apparatus according to any one of (1) to (6) above, further including:
  • a one-character voice recognition unit that performs voice recognition on the voice information in a unit of one character, in which
  • the one-character voice recognition unit performs voice recognition on voice information re-uttered with respect to the word with a low confidence level.
  • the output processing unit causes a user interface for prompting re-utterance in which a word that does not influence a sentence and starts with a voiced sound is added before or after the phrase unit to be presented.
  • the information processing apparatus according to any one of (1) to (8) above, further including:
  • a communication unit that communicates with another apparatus via a network
  • an input sound processing unit that performs processing for detecting an utterance section in which the voice information includes voice, in which
  • the communication unit obtains the voice information transmitted from the other apparatus via the network and supplies the voice information to the input sound processing unit, and
  • the voice recognition result information output from the output processing unit is transmitted to the other apparatus via the network.
  • a method for processing information including steps of:
  • obtaining a word string representing utterance content as a voice recognition result by obtaining voice information obtained from utterance of a user and performing voice recognition on the voice information;
  • a program for causing a computer to execute information processing including steps of:
  • obtaining a word string representing utterance content as a voice recognition result by obtaining voice information obtained from utterance of a user and performing voice recognition on the voice information;

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
US16/323,734 2016-08-31 2017-08-17 Information processing apparatus, method for processing information, and program Abandoned US20190180751A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP2016170308 2016-08-31
JP2016-170308 2016-08-31
PCT/JP2017/029493 WO2018043139A1 (ja) 2016-08-31 2017-08-17 情報処理装置および情報処理方法、並びにプログラム

Publications (1)

Publication Number Publication Date
US20190180751A1 true US20190180751A1 (en) 2019-06-13

Family

ID=61300773

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/323,734 Abandoned US20190180751A1 (en) 2016-08-31 2017-08-17 Information processing apparatus, method for processing information, and program

Country Status (5)

Country Link
US (1) US20190180751A1 (ja)
EP (1) EP3509060A4 (ja)
JP (1) JPWO2018043139A1 (ja)
CN (1) CN109643547A (ja)
WO (1) WO2018043139A1 (ja)

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000029492A (ja) * 1998-07-09 2000-01-28 Hitachi Ltd 音声翻訳装置、音声翻訳方法、音声認識装置
US20050114131A1 (en) * 2003-11-24 2005-05-26 Kirill Stoimenov Apparatus and method for voice-tagging lexicon
JP2005157166A (ja) * 2003-11-28 2005-06-16 Toyota Central Res & Dev Lab Inc 音声認識装置、音声認識方法及びプログラム
CN101002455B (zh) * 2004-06-04 2011-12-28 B·F·加萨比安 在移动和固定环境中增强数据输入的设备及方法
JP2006010739A (ja) * 2004-06-22 2006-01-12 Toyota Central Res & Dev Lab Inc 音声認識装置
TWI277949B (en) * 2005-02-21 2007-04-01 Delta Electronics Inc Method and device of speech recognition and language-understanding analysis and nature-language dialogue system using the method
CN101082836A (zh) * 2007-06-29 2007-12-05 华中科技大学 一种整合语音输入和手写输入功能的汉字输入系统
US8326631B1 (en) * 2008-04-02 2012-12-04 Verint Americas, Inc. Systems and methods for speech indexing
JP2010197669A (ja) * 2009-02-25 2010-09-09 Kyocera Corp 携帯端末、編集誘導プログラムおよび編集装置
JP5550496B2 (ja) 2010-08-31 2014-07-16 富士フイルム株式会社 文書作成支援装置、文書作成支援方法、並びに文書作成支援プログラム
WO2015059976A1 (ja) * 2013-10-24 2015-04-30 ソニー株式会社 情報処理装置、情報処理方法、及びプログラム
CN103810996B (zh) * 2014-02-21 2016-08-31 北京凌声芯语音科技有限公司 待测试语音的处理方法、装置及系统
JP2016109725A (ja) * 2014-12-02 2016-06-20 ソニー株式会社 情報処理装置、情報処理方法およびプログラム

Also Published As

Publication number Publication date
WO2018043139A1 (ja) 2018-03-08
EP3509060A4 (en) 2019-08-28
EP3509060A1 (en) 2019-07-10
CN109643547A (zh) 2019-04-16
JPWO2018043139A1 (ja) 2019-06-24

Similar Documents

Publication Publication Date Title
US11037553B2 (en) Learning-type interactive device
US9704413B2 (en) Non-scorable response filters for speech scoring systems
US20180190269A1 (en) Pronunciation guided by automatic speech recognition
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
WO2012134997A2 (en) Non-scorable response filters for speech scoring systems
WO2020256749A1 (en) Word lattice augmentation for automatic speech recognition
EP3509062B1 (en) Audio recognition device, audio recognition method, and program
KR20160122542A (ko) 발음 유사도 측정 방법 및 장치
CN112017633B (zh) 语音识别方法、装置、存储介质及电子设备
US20200219487A1 (en) Information processing apparatus and information processing method
CN110675866B (zh) 用于改进至少一个语义单元集合的方法、设备及计算机可读记录介质
JP6875819B2 (ja) 音響モデル入力データの正規化装置及び方法と、音声認識装置
JP2012194245A (ja) 音声認識装置、音声認識方法及び音声認識プログラム
EP2806415B1 (en) Voice processing device and voice processing method
US9805740B2 (en) Language analysis based on word-selection, and language analysis apparatus
JP2019124952A (ja) 情報処理装置、情報処理方法、およびプログラム
WO2018079294A1 (ja) 情報処理装置及び情報処理方法
US11961510B2 (en) Information processing apparatus, keyword detecting apparatus, and information processing method
US20190228765A1 (en) Speech analysis apparatus, speech analysis system, and non-transitory computer readable medium
US20190180751A1 (en) Information processing apparatus, method for processing information, and program
Chen et al. A proof-of-concept study for automatic speech recognition to transcribe AAC speakers’ speech from high-technology AAC systems
CN113763921B (zh) 用于纠正文本的方法和装置
CN110880327B (zh) 一种音频信号处理方法及装置
WO2021171417A1 (ja) 発話終端検出装置、制御方法、及びプログラム
JP2017219733A (ja) 言語判断装置、音声認識装置、言語判断方法、およびプログラム

Legal Events

Date Code Title Description
AS Assignment

Owner name: SONY CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KAWANO, SHINICHI;TAKI, YUHEI;REEL/FRAME:048278/0915

Effective date: 20181122

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION