WO2022254909A1 - Sound recognition device - Google Patents

Sound recognition device Download PDF

Info

Publication number
WO2022254909A1
WO2022254909A1 PCT/JP2022/014596 JP2022014596W WO2022254909A1 WO 2022254909 A1 WO2022254909 A1 WO 2022254909A1 JP 2022014596 W JP2022014596 W JP 2022014596W WO 2022254909 A1 WO2022254909 A1 WO 2022254909A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
speech recognition
sound information
recognition result
unit
Prior art date
Application number
PCT/JP2022/014596
Other languages
French (fr)
Japanese (ja)
Inventor
悠輔 中島
拓 加藤
太一 片山
圭 菊入
Original Assignee
株式会社Nttドコモ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Nttドコモ filed Critical 株式会社Nttドコモ
Priority to JP2023525437A priority Critical patent/JPWO2022254909A1/ja
Publication of WO2022254909A1 publication Critical patent/WO2022254909A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present invention relates to a speech recognition device that performs speech recognition.
  • Patent Document 1 describes a conference support system that recognizes text data representing speech from speech intervals while distinguishing between speech intervals and non-speech intervals contained in speech data in conferences and the like.
  • An object of the present invention is to provide a speech recognition device that can grasp the atmosphere of a conversation in a meeting or the like.
  • a speech recognition apparatus includes a sound information acquisition unit for acquiring sound information, a speech recognition process for first sound information among the sound information to acquire a recognition result, and a recognition result associated with the first sound information.
  • a sound information processing unit that determines that the second sound information to be received is the occurrence of a nonverbal event, and outputs information indicating the event in association with the recognition result.
  • FIG. 1 is a block diagram showing a functional configuration of a speech recognition device 100 according to the present disclosure
  • FIG. FIG. 4 is an explanatory diagram showing the reliability of speech recognition results and the reliability of each recognition event
  • It is a figure which shows the example of a processing result with respect to a certain utterance.
  • FIG. 10 is a diagram showing an outline of determination processing when changing a determination interval; 4 is a flowchart showing the operation of the speech recognition device 100;
  • FIG. 10 is a diagram showing processing when the utterance is in English;
  • FIG. 10 is a diagram showing an example of output of results in the process of adding event tag information; , and a diagram showing an example of processing determination for user A and user B.
  • FIG. It is a figure which shows the example of a result output in another example.
  • FIG. 11 is a block diagram showing a functional configuration of a speech recognition device 100a according to a modification of the present disclosure; It is a figure which shows the specific example of correction.
  • FIG. 1 is a block diagram showing the functional configuration of the speech recognition device 100 according to the present disclosure.
  • the speech recognition device 100 includes a speech acquisition unit 101, a speech recognition unit 102, a nonverbal speech recognition unit 103 (nonverbal sound recognition unit), a score determination unit 104, and a result output unit 105. It is Each configuration will be described below.
  • the speech acquisition unit 101 is a part that acquires speech in meetings, lectures, or the like.
  • the voice acquisition unit 101 is a microphone.
  • it is good also as a part which acquires the audio
  • the speech acquisition unit 101 detects speech segments from the speech waveform signal and outputs the segments to the speech recognition unit 102 and the non-language speech recognition unit 103 .
  • the speech recognition unit 102 performs speech recognition processing on the verbal or non-verbal sounds in the speech section output from the speech acquisition unit 101 using a known language model and acoustic model, and acquires the recognition result text for each recognition unit. It is also a part that derives the reading and reliability of the recognition result text.
  • the speech recognition unit 102 outputs the recognition result text, its reading and reliability in units of utterance, sentence, phrase, word, kana or phoneme, or time as recognition units.
  • This reliability is information indicating how much the recognition result in the speech recognition process can be trusted, and is generally indicated between 0 and 1, but is limited to this. Instead, it may be expressed as an integer or between 0 and 100. Moreover, it is good also as a normalized numerical value.
  • the reliability of the speech recognition result is obtained based on the reliability stored in the language model and the acoustic model, but not limited to this, for example, end-to-end speech recognition, etc. Other known reliability derivation methods may be used.
  • the non-verbal speech recognition unit 103 uses a known non-verbal sound recognition model to perform recognition processing on the non-verbal sound on the verbal sound or the non-verbal sound in the speech section output from the speech acquisition unit 101 . part.
  • the non-verbal speech recognition unit 103 generates a reliability for each event according to the type of non-verbal sound for each recognition time according to the recognition unit of speech recognition performed by the speech recognition unit 102 . For example, when the speech recognition unit 102 recognizes each word, the utterance time for the recognized word becomes the recognition unit of the non-verbal speech recognition unit 103 .
  • non-verbal sounds include non-verbal sounds based on sounds such as laughter, backtracking, nodding (affirmative or negative), sighs, sneezes, coughs, keyboard sounds, BGM, etc. Indicates sound such as musical sound. For example, a positive event indicates laughter, and a negative event indicates no laughter.
  • the non-verbal speech recognizer 103 recognizes these non-verbal sounds and generates confidence levels for positive and negative events.
  • the degree of reliability for an event is information that indicates how much the recognition result in non-verbal speech recognition processing can be trusted.
  • This reliability is generally represented between 0 and 1, but is not limited to this, and may be represented by an integer or between 0 and 100. Moreover, it is good also as a normalized numerical value. This reliability is obtained based on the reliability stored in the recognition model for recognizing non-verbal sounds, but is not limited to this, and other known reliability derivation methods may be used.
  • FIG. 2 is an explanatory diagram showing the reliability of speech recognition result text and the reliability of each event. As shown in the figure, it is assumed that there is an utterance content "ahahaha” ("ahahaha” is uttered). This indicates laughter.
  • the speech recognition unit 102 Since the speech recognition unit 102 tries to recognize the speech as a speech sound, it recognizes the text data as "Aboha” (converted into Japanese kanji) and the reading as "ahahaha". The speech recognition unit 102 derives a confidence level for the recognized recognition result text using the acoustic model and the language model.
  • the non-verbal speech recognition unit 103 derives the reliability for each event.
  • the reliability is derived according to the presence or absence of laughter. That is, the degree of confidence that "ahahaha" ("ahahaha" is uttered) is not laughter and the degree of confidence that it is laughter are calculated. In FIG. 2, the reliability is calculated as 0.3 without laughter and 0.7 with laughter.
  • the reliability of text data as laughter may be calculated.
  • other types of reliability may be calculated, such as the presence or absence of coughing or the presence or absence of keyboard sounds.
  • the non-verbal speech recognition unit 103 has recognition models for recognizing laughter, coughing, or keyboard sounds, respectively, and can calculate the respective reliability based on these recognition models.
  • recognition models for recognizing backtracking sounds, nodding sounds, and sneezes may be provided.
  • a recognition model may also be provided that outputs the reliability of each event with respect to a plurality of events.
  • the score determination unit 104 determines the recognition result text and the event based on the reliability of the recognition result text for the verbal sound recognized by the speech recognition unit 102 and the reliability for the event recognized by the non-verbal speech recognition unit 103. This is the part that determines which of the above is appropriate.
  • the score determination unit 104 determines the reliability of the recognition result text “Aboha” (Japanese in which “ahahaha” is recognized (here, an example of mistranslation)) of the speech recognition unit 102 and the non-verbal speech recognition unit 103 positive events and negative events (no/with laughter) are compared with respective reliability.
  • the speech recognition unit 102 outputs the reliability of the recognition result text: 0.3.
  • the nonverbal speech recognition unit 103 outputs reliability of negative event (no laughter): 0.3 and reliability of positive event (with laughter): 0.7.
  • the score determination unit 104 determines the recognition result text or event with the highest reliability.
  • the reliability of the positive event (with laughter): 0.7 is the highest reliability, so "ahahaha" is determined as an event indicating laughter.
  • the score determination unit 104 may determine the recognition result text or event based on the highest reliability, or may use values adjusted by weighting for each reliability. For example, the score determination unit 104 may determine which event the utterance content is based on a value obtained by multiplying the reliability of the positive event or the negative event by a predetermined coefficient. More specifically, the score determination unit 104 may determine the non-verbal sound by comparing the value obtained by multiplying the reliability of the positive event by 2 with the reliability of the negative event.
  • the score determination unit 104 multiplies the reliability of the positive event by 0.7, subtracts 0.1, and compares it with a threshold to determine whether the positive event is appropriate for the non-verbal sound. good.
  • These coefficients and threshold values may be stored in a memory or the like as fixed values in advance, or an input unit may be provided so that they can be input from the outside. Further, the given coefficients and thresholds may be varied according to predetermined formulas.
  • the score determination unit 104 is not limited to comparison with the binary reliability of each of the positive event and the negative event, and may include other values, for example, comparison with three or more values. For example, laughter and cough may be treated as affirmative events, no laughter and no cough as negative events, and keyboard sounds as noise events.
  • Such weighting adjustments are determined, for example, according to the attributes or types of users who use the speech recognition device 100 of the present disclosure, or the contents of the conference. For example, in the case of meeting content that is likely to cause laughter, it is easy to recognize laughter, but there are meeting content that is not. In such a case, it is difficult to recognize laughter as laughter. For such meetings or users, adjustments can be made as described above to enable accurate recognition.
  • the result output unit 105 selects a highly reliable recognition result text based on the determination result of the score determination unit 104, or adds and outputs event tag information according to the event.
  • the result output unit 105 inputs the information of the event determined by the score determination unit 104 and acquires the event tag information corresponding to the event from the storage unit or the like.
  • the storage section or the like stores these event tag information in advance.
  • Event tag information is, for example, information that graphically represents non-verbal sounds, and in the case of laughter, it is a graphical mark of laughter.
  • Event tag information may be, for example, predefined text information.
  • the output may be output to an external terminal, or may be displayed on a display.
  • FIG. 3 is a diagram showing an example of processing results for a certain utterance.
  • FIG. 3 shows, as examples of processing results, recognition result text, recognition result morpheme, recognition result text reliability, event reliability (laughter), event reliability (cough), event reliability (keyboard sound), judgment result, and result output. , and the supplemental output. It also shows the reliability, result output, etc. in units of words.
  • the following utterance content is acquired, and the recognition result by the speech recognition unit 102 is obtained.
  • Speech contents I, um, hahaha, something, (Cough: Van Gogh), (Keyboard sound: Kakakaka), nice.
  • the content of the above utterance is in Japanese, and is uttered as watashiwa ano- hahaha nanka (gohho) kakakaka ii. In Japanese, the ⁇ ha'' following the subject is pronounced wa.
  • Recognition result text I am a mother school
  • This recognition result text is segmented into watashi, wa, ano-, hahaha, nanka, gohho, kakakaka, and ii, and converted into Japanese when explained in terms of romaji notation.
  • recognition results: "I” (watashi) and "wa” (wa) are obtained for the utterance: "I am” (watashiwa). Confidences of 0.95 and 0.91 are derived for the recognition result text, respectively.
  • the reliability of each event by the nonverbal speech recognition unit 103 for "I" (watashi) is as follows.
  • the score determination unit 104 selects the most reliable recognition result texts "I" (watashi) and "wa" (wa) based on these degrees of reliability, and the result output unit 105 outputs "I am" ( watashiwa) is output.
  • the speech recognition unit 102 sets the speech recognition confidence level to 0.9 for the utterance: "ano-” and the recognition result text: “ano-”, and recognizes it as a filler. ing.
  • the non-verbal speech recognition unit 103 recognizes the utterance "hahaha” (hahaha) and the recognition result text "mother” (text in which hahaha is recognized) (reliability: 0.1).
  • the result output unit 105 outputs event tag information indicating that laughter occurred as a supplemental output. Note that the result output unit 105 does not have to output the event tag information.
  • the score determination unit 104 determines that a recognition result text determined as a filler based on the recognition result morpheme is a filler even if the recognition result text has a high degree of reliability. Then, the result output unit 105 does not output the recognition result text as a result output. Note that event tag information indicating a filler may be output as a supplemental output as necessary.
  • the reliability is calculated for each word, and the determination is made based on the reliability of each word, but it is not limited to this.
  • the score determination unit 104 may perform determination by changing the recognition unit of the recognition result text output by the speech recognition unit 102 and the non-language speech recognition unit 103 to the determination unit for score determination. For example, in FIG. 3, the speech recognition unit 102 and the non-verbal speech recognition unit 103 calculate the reliability for each word, but the score determination unit 104 integrates the reliability for each clause or sentence, Alternatively, determination may be made based on the reliability integrated in units of sentences.
  • score determination range for example, it is possible to change the addition position of the event tag information indicating the non-verbal sound, and the text becomes easier to read. For example, when score determination is made for each sentence, event tag information is added to the end of the sentence.
  • FIG. 4 is a diagram showing an overview of the processing contents when the score determination unit is changed to the sentence unit. For convenience of explanation, descriptions of event reliability and the like are simplified as compared with FIG. Assume that the following utterance content is input and a recognition result text is obtained.
  • Recognition result text I, umm, maternal, something, Van Gogh, kakkaka, good The above is the recognition result text converted to Japanese. Cough mixed with keyboard noise. This recognition result text is segmented into watashi, wa, ano-, hahaha, nanka, gohho, kakakaka, and ii, and converted into Japanese when explained in terms of romaji notation.
  • the score determination unit 104 performs score determination on a sentence-by-sentence basis. That is, the score determination unit 104 adds up the speech recognition reliability and the non-language speech recognition reliability in the recognition result text of one sentence. Taking FIG. 4 as an example, the score determination unit 104 calculates the total value of the reliability of the recognition result text and the total of the reliability of the event (the reliability of each of the positive event and the negative event) in the sentence. . Based on this total value, the adequacy of the recognition result text and the presence or absence of an event are determined. In the example of FIG. 4, the score determination unit 104 determines that the total value of the reliability of the recognition result text and the affirmative event is equal to or greater than the predetermined value. I judge.
  • the score determination unit 104 outputs to the result output unit 105 each of the recognition result texts "I”, “Ha”, “Ah”, “Something”, and “Good” whose reliability is equal to or higher than a predetermined value.
  • Output event information (laughter, cough, keyboard sounds).
  • the result output unit 105 acquires the event tag information from the event information and outputs it together with the recognition result text whose reliability is equal to or higher than a predetermined value.
  • event tag information can be output for each sentence, and event tag information can be added to the end of the recognition result text, making it easier to read.
  • FIG. 5 is a flow chart showing the operation of the speech recognition apparatus 100.
  • the speech acquisition unit 101 acquires a speech waveform signal (S101), detects a speech segment from the speech waveform signal, and converts the speech (or other sound) in the speech segment into a speech signal for use by the speech recognition unit 102 and the non-language signal. It is output to the speech recognition unit 103 (S102).
  • the speech recognition unit 102 performs speech recognition processing on the speech signal and outputs the recognition result text, reading, and reliability (S103).
  • the non-verbal speech recognition unit 103 also performs non-verbal speech recognition processing on the speech signal and outputs the reliability of each event for each recognition target time (S104).
  • the score determination unit 104 determines the validity of the recognition result text or event for each recognition target based on the reliability of the recognition result text by speech recognition processing and the reliability of each event recognized by non-verbal speech recognition. (S105).
  • the result output unit 105 selects an appropriate recognition result text from the recognition result texts based on the determination result, or acquires and outputs event tag information (S106).
  • FIG. 6 is a diagram showing processing when the utterance is in English. For convenience of explanation, the description is simplified. In FIG. 6, the following utterance is made.
  • Utterance content I go (laughter: hahaha) to (cough: off coff cough) (keyboard sound: clatter) school.
  • Recognition result text I go ah the her head to Costoco caca grata.
  • the recognition result texts "I”, “go”, “to”, and "school” recognized by the speech recognition unit 102 have high speech recognition reliability.
  • the recognition result texts "the her head”, “Costoco”, and “caca gratta” recognized by the speech recognition unit 102 have low recognition result text reliability but high event reliability.
  • the event reliability of laughter is high for the recognition result text "the her head”. This is because we tried to recognize the ⁇ hahaha'' of laughter as a speech sound.
  • event tag information (a mark such as an image, a symbol, or a text) indicating what kind of event the non-verbal sound is related to The process of adding will be described. In the following, when there are expressions and processes peculiar to Japanese, they will be explained using Japanese notation.
  • FIG. 7 is a diagram showing a specific example thereof.
  • FIG. 7(a) is a diagram showing actual utterance contents.
  • FIG. 7(b) is a diagram showing a result output based on the recognition result.
  • FIG. 7 conversations with users A and B are shown. In order to simplify the explanation, the content of the conversation is omitted and indicated as "---".
  • user A is having a conversation while laughing, and user B is giving a nod in response to the conversation.
  • the speech recognition unit 102 and the non-verbal speech recognition unit 103 can recognize who the speaker is based on speech channel or sound source separation. In English, for example, it is as follows. Mr. A I am Japanese. (laughs) I live in Tokyo. Mr. B hi
  • the speech recognition unit 102 and the non-language speech recognition unit 103 distinguish between user A's and user B's utterances (sound sources) and perform speech recognition processing and non-language speech recognition processing, respectively.
  • the speech recognition unit 102 and the non-language speech recognition unit 103 recognize the elapsed time of speech, perform speech recognition processing and non-language speech recognition processing on user A's speech, and recognize the elapsed time of the speech. .
  • the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and non-language speech recognition processing on user B's speech, and recognize the elapsed time of the speech. Therefore, the result output unit 105 can use these elapsed times to indicate that user B is giving a back-hand in response to user A's utterance.
  • FIG. 8 is a diagram showing an example of processing determination for user A and user B.
  • the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and non-language speech recognition processing separately for user A and user B based on the sound source separation technique.
  • the speech recognition unit 102 and the non-verbal speech recognition unit 103 that recognize the voice of User A recognize User A's utterance "desu.” there is).
  • This "desu.” indicates the ending of a word in Japanese. In Japanese, the ending is a verb, but in other languages it is not necessarily a verb.
  • the speech recognition unit 102 and the non-verbal speech recognition unit 103 that recognize the voice of the user B recognize the backtracking event of the user B "un”.
  • the speech recognition unit 102 and the non-verbal speech recognition unit 103 each grasp the passage of time. Although omitted in FIG. 8, the elapsed time is associated with each recognition result text and managed. In addition, when recognizing the ending of an English conversation, it is judged from the whole conversation.
  • the result output unit 105 outputs user A's "is” (desu) and user B's event tag information (backhand mark) according to the determination process of the score determination unit 104 . At that time, the result output unit 105 outputs each recognized recognition result text "is.”
  • FIG. 7(b) is a diagram showing a specific example of the result output.
  • the result output unit 105 outputs backhand marks according to the elapsed time of user B's speech. As a result, this backtracking mark corresponds to the position of the recognition result text of User A's "I am --- desu.” (meaning "I am ---.”).
  • FIG. 9 is a diagram showing a result output example in another example. As shown in the figure, FIG. 9(a) shows the utterance content.
  • the speech recognition unit 102 and the non-language speech recognition unit 103 recognize the contents of user A's and user B's utterances, respectively.
  • the result output unit 105 outputs recognition result text and event tag information according to the determination result of the score determination unit 104 .
  • the result output unit 105 adds event tag information to the recognition result text for each sentence. That is, event tag information is added to each user's utterance.
  • the result output unit 105 adds user B's event tag information: backhand mark x1 to user A's recognition result text "----desu.” ) is added.
  • event tag information of users A and B Mr. A's comedy mark x1 and Mr. B's backtracking mark x1 are added.
  • the nonverbal speech recognition unit 103 recognizes the occurrence time of user B's backtracking, and can insert it into the recognition result text of user A according to the time.
  • the result output unit 105 outputs the recognition result text and the event tag information “---- desu. am -----. I think -------) Outputs Mr. A's laughter mark x 1, Mr. B's backtracking mark x 2'. Then, the result output unit 105 operates to output the event tag information to the end of the recognition result text of user A's utterance.
  • the result output unit 105 outputs the event tag information of the mark indicating that user A laughed and the mark indicating that user B gave a backlash to the recognition result text.
  • the nonverbal speech recognition unit 103 recognizes that the user A laughed once, and that the user B backed up twice.
  • utterances and responses to the utterances in conversations of a plurality of users are treated as one paragraph unit.
  • the paragraph unit means the unit of the user who uttered the recognition result text, but it may also be determined based on the interval between utterances.
  • the result output unit 105 adds laughter marks to the recognition result text of user A, and event tag information (backhand mark) which is the recognition result of user B is added to the end of user A's recognition result text.
  • the result output unit 105 outputs the recognition result text and the event tag information of the user A based on the speaker including the recognition result text, that is, the utterance of the user A, and when the utterance of the user B is only an event , the event tag information of user B is output at the end of the recognition result text (or event tag information) of user A.
  • the result output unit 105 may output the event tag information together with the recognition result text of user B.
  • FIG. 10 is a diagram showing a result output example in another example.
  • the result output unit 105 can recognize the break of the topic based on the contents of user A's and user B's utterances (or recognition result text). For example, when the result output unit 105 detects a character string indicating an intention to change the topic, such as "by the way", the result output unit 105 performs processing to put a mark at the end of the topic so far. In FIG. 10, the result output unit 105 determines that a topic ends when user B responds with "hmm" (hmmm), and puts a smiley mark for user A and a user Add a backtracking mark for B. In addition, it is also possible to estimate a topic or divide a scene using a topic estimation engine or a topic division engine, which are known techniques.
  • a mark corresponding to the number of times or a number indicating the number of times is added.
  • FIG. 7 and FIG. 9 show the case where laughter occurs as an example, but the speech recognition apparatus 100 stores event tag information (image, mark, text) for each event and according to the frequency of occurrence in a memory or the like.
  • the event tag information may be prepared in a unit (not shown) and the event tag information may be changed according to the frequency. That is, a reference value for the frequency of occurrence may be prepared and the event tag information may be changed accordingly.
  • Event tag information may also be provided according to the frequency. For example, when coughing occurs less than 5 times, a mark indicating coughing is added, and when coughing occurs 5 times or more, the event tag information is concerned. For example, a mark indicating a face is added. In a meeting or the like, sections in which there are many laughing events (at least at a predetermined frequency) can be replaced with an approving mark or the like instead of the laughing mark.
  • a warning mark may be added to that effect. If the volume is above a certain level, a mark indicating a stronger display (for example, a loud keystroke sound) may be added.
  • a mark indicating a stronger warning message is added.
  • These reference values or the predetermined number of times may be predetermined values, or may be determined based on the frequency of occurrence of events in the entire conversation. For example, if the frequency of occurrence of a certain event in all utterances is averaged in units of time (or in units of paragraphs or sentences, etc.) as a reference value (or a predetermined number of times), and if the frequency of occurrence exceeds that value, the event tag information is It may be added.
  • event tag information pictograms such as smiles or videos may be used instead of marks.
  • the character color of the recognition result text of the utterance uttered immediately before may be changed. For example, for a smiling face, the recognition result text may be given a bright color (for example, yellow), and the background may be given a bright color (the color of blue sky). For example, for a sigh, the recognition result text may be dark (eg, gray) and the background may be dark (gray).
  • the contents of the event tag information are changed according to the type of event, the score determined by the score determination unit 104, and the frequency of the event. You may change the display form of event tag information.
  • a warning or advice may be displayed, and a message to speak in a bright voice may be output.
  • the speech recognition device 100 may perform speech recognition processing and non-language speech recognition processing on conversations of one or more users in real time, or may perform speech recognition processing and non-language speech recognition processing on recorded conversation data. Non-verbal speech recognition processing may be performed.
  • the speech recognition apparatus 100 When speech recognition processing and non-language speech recognition processing are performed in real time at a conference or the like, the speech recognition apparatus 100 outputs (displays) only to a specific person (terminal) or outputs for all to see. (display). Also, if there is a display column associated with each speaker (terminal) in a web conference or the like, it may be displayed in the column corresponding to the speaker. Further, it may be displayed and shown only to user A, or may be displayed on the screens of other participants such as user C or D as well.
  • FIG. 11 is a diagram showing an example thereof.
  • FIG. 11(a) shows actual utterance content
  • FIG. 11(b) shows an example of output of results.
  • the result output unit 105 outputs the recognition result text recognized by the speech recognition unit 102 and the event tag information recognized by the non-language speech recognition unit 103 .
  • the result output unit 105 performs recognition result output management by associating the event tag information (laughing mark L1 or backtracking marks L2 and L3) with the recognition result text.
  • the result output unit 105 displays the recognition result text and the event tag information on the display as the result output, the event tag information L (L1 to L3) is selected by the operation of the user viewing the display (clicking with a mouse, etc.). Then, the result output unit 105 receives the selection, outputs the recognition result text of the event tag information to the display, and the display displays it (FIG. 11(c)).
  • this process may be performed when the result of determination by the score determination unit 104 satisfies a predetermined condition. That is, when the score determination unit 104 does not reach a predetermined numerical value for a certain recognition result text and the event reliability corresponding thereto, it may be difficult to accurately determine which is appropriate. In that case, the result output unit 105 may acquire the recognition result text, the event tag information, and the reliability of each from the score determination unit 104, and switch the display by user operation.
  • the recognition result is output in association with the recognition result text.
  • the external terminal displays the recognition result text and the event tag information, and when the event tag information L (L1 to L3) is selected (clicked with a mouse, etc.) by user operation of the external terminal, the external terminal: Display the recognition result text linked to the event tag information.
  • FIG. 12 is a block diagram showing the functional configuration of the speech recognition device 100a according to the modification of the present disclosure. As shown in the figure, the speech recognition device 100a includes a correction unit 106 in addition to the functional configuration of the speech recognition device 100 in FIG.
  • the correction unit 106 is a part that corrects the recognition result text or event tag information of the result output that is output from the result output unit 105 and displayed on the display.
  • This display displays the recognition result text and event tag information, and the correction unit 106 accepts the correction portion indicated by the pointer or the like (part of the recognition result text or event tag information) according to the user's operation.
  • One or a plurality of correction candidates corresponding to the corrected portion are displayed to the user on the display. For example, the correction candidates are displayed in a pulldown.
  • the result output unit 105 associates the recognition result text and event tag information for the same utterance with each other and manages (stores) them as correction candidates.
  • the recognition result text includes text converted into kanji, text in hiragana, text in katakana, and other converted symbols or text.
  • the correction unit 106 switches the one correction candidate selected by the user, and the result output unit 105 outputs and displays the correction candidate on the display.
  • the correction candidates are recognition result texts and events recognized by the speech recognition unit 102 and the non-language speech recognition unit 103 .
  • the candidate corrections may include text converted to kanji, plain hiragana text, plain katakana text, or other converted symbols or text.
  • FIG. 13 is a diagram showing a specific example of the correction.
  • FIG. 13(a) shows utterance content
  • FIG. 13(b) is a diagram showing an example of a correction screen by the user.
  • the correction unit 106 moves the pointer P according to user's operation.
  • the correction unit 106 displays a correction candidate B when the correction portion indicated by the pointer P is selected.
  • the result output unit 105 outputs the candidate as a result output to the display.
  • the correction candidate B includes "mother", and a text that recognizes "hahaha” as it is is also selectable. Note that in the present disclosure, "mother” is misrecognized text.
  • the event tag information is explained as the correction target, but of course the recognition result text may also be the correction target.
  • FIG. 14 is a block diagram showing the functional configuration of a speech recognition device 100b of another modification. This speech recognition device 100b has a counting unit 107 in addition to the functional configuration of the speech recognition device 100.
  • FIG. 14 is a block diagram showing the functional configuration of a speech recognition device 100b of another modification. This speech recognition device 100b has a counting unit 107 in addition to the functional configuration of the speech recognition device 100.
  • This tabulation unit 107 is a part that tabulates the frequency of occurrence of events recognized by the non-verbal speech recognition unit 103 determined by the score determination unit 104 .
  • the tabulation unit 107 tabulates the types of events such as laughter and nod determined by the score determination unit 104 and their occurrence frequencies.
  • the aggregating unit 107 may aggregate by time period, paragraph, topic, or speaker. Also, it may be aggregated by having a long pause, changing paragraphs, or the like. As a result, the user who analyzes the speech recognition result can determine which utterance is important and its importance level based on the amount of laughter or nod for each classification such as time period and topic.
  • the counting unit 107 determines a predetermined character string (for example, “by the way”) obtained by speech recognition by the speech recognition unit 102 as the delimiters of topics.
  • a predetermined character string for example, “by the way”
  • Speakers can also be similarly distinguished by the speech recognition unit 102 and the non-verbal speech recognition unit 103 based on speakers distinguished by source separation or speech channels.
  • FIG. 15 is a graph showing the frequency of event types for each topic.
  • the speech recognition device 100b can provide information on the frequency of occurrence of each event for each topic in order to provide this graph.
  • a table may be used instead of the graph.
  • the occurrence frequency of each event for each speaker may be provided. Accordingly, it is possible to obtain a positive or negative analysis result of being a topic or speaker for each topic or speaker.
  • time zone A time period in which many laughter sounds or nods occur is regarded as a positive time period, and a time period in which there are few laughter sounds or nods or many sighs is regarded as a negative time period, and this can be used as an analysis result.
  • the speech recognition apparatus 100 of the present disclosure includes a speech acquisition unit 101 functioning as a sound information acquisition unit that acquires sound information, The recognition unit 102 and a non-verbal device that determines that the (related) user B's voice (second sound information) with respect to the user A's voice (first sound information) is the occurrence of a nonverbal event (for example, nodding). It includes a language speech recognition unit 103 and a result output unit 105 that outputs event tag information indicating an event in association with the recognition result text.
  • event tag information additional information
  • additional information such as a mark indicating laughter, which is one of non-verbal sound recognition results
  • Additional information such as these marks may be information represented by colors in addition to symbols.
  • the result output unit 105 in the speech recognition apparatus 100 outputs the event tag information in association with the recognition result text when the occurrence frequency of the event satisfies a predetermined condition.
  • the frequency of occurrence of laughter is a predetermined number or more
  • a mark indicating laughter is added to the recognition result text. This makes it possible to intuitively grasp the atmosphere of the place of conversation. That is, the atmosphere of the place is different between one or two times of laughter and the case of more laughter.
  • the result output unit 105 adds event tag information (indicating an event) of user B to the recognition result text for the sound information group (user A's utterance and user B's utterance) that satisfies a predetermined condition including user A's utterance. information).
  • the predetermined condition is that the sound information group is divided into sentence units, paragraph units, or topic units.
  • the result output unit 105 adds event tag information to the end of the recognition result text of user A when an event such as user B occurs. That is, the event tag information is added to the end of the recognition result text in utterances by different speakers. This makes it possible to intuitively grasp the atmosphere of the conversation.
  • the speech recognition device 100a further includes a correction unit 106 that corrects the recognition result text or the information indicating the event.
  • the voice recognition unit 102 performs voice recognition processing on the voices of user A and user B (the first sound information and the second sound information) to obtain respective recognition result texts.
  • the non-verbal speech recognition unit 103 performs non-linguistic speech recognition processing on the speech of user A and user B (the first sound information and the second sound information) to determine the occurrence of each event.
  • the correction unit 106 corrects the sound information using the recognition result text or event tag information.
  • the recognition result text and event tag information can be corrected, and correct recognition results can be obtained.
  • the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and Determines whether an event has occurred. Then, the result output unit 105 acquires the recognition result text and event tag information according to the determination result of the score determination unit 104 . Then, the result output unit 105 outputs, for example, the laughter recognition result text together with the event tag information of user B's laughter to the display.
  • the operating user who sees this display can switch the display between the event tag information of user B and the recognition result text. That is, when the operating user performs a switching operation (for example, selection by a pointer), the result output unit 105 performs output control for switching display between the event tag information and the recognition result.
  • the result output unit 105 When the output destination is not the display but the external terminal, the result output unit 105 outputs the event tag information of user B and the recognition result text to the external terminal. This output corresponds to output control that enables switching between event tag information and recognition result text. On the external terminal side, the event tag information is displayed, and the display of the event tag information and the recognition result text can be switched according to the user's operation of the external terminal.
  • determination of reliability and output of the result are performed within the speech recognition apparatus 100, but these processes may be requested to an external terminal. That is, as a process for not outputting the result based on the speech recognition process, the speech recognition apparatus 100 outputs the recognition result text by the speech recognition process and its reliability, and the recognition result (event etc.) by the non-language speech recognition process and The reliability is output to an external terminal.
  • the external terminal can obtain the recognition result text and the like based on the information.
  • the sound information processing unit includes a score determination unit 104 that determines the recognition results of the speech recognition unit 102 and the non-language speech recognition unit 103, and the voice recognition unit 102 according to the determination. and a result output unit 105 for processing and outputting the recognition result text. That is, the result output unit 105 does not output the non-verbal sound portion of the recognition result text based on the determination result of the score determination unit 104, and outputs only the language portion.
  • the speech recognition unit 102 derives the verbal sound reliability for the sound information being a language, that is, the reliability for the recognition result text
  • the non-language speech recognition unit 103 derives the sound
  • the score determination unit 104 determines the recognition results (recognition result text and each event) by the speech recognition unit 102 and the non-language speech recognition unit 103 based on these reliability levels (the verbal sound reliability level and the non-verbal sound reliability level). judge.
  • the reliability of the event indicating the non-verbal sound reliability is the reliability that the speech waveform signal (sound information) is non-verbal sound and the reliability that the speech waveform signal (sound information) is not non-verbal sound. degree. That is, it indicates positive events and negative events, respectively.
  • the score determination unit 104 performs determination processing by weighting at least one of the reliability of each of the positive event and the negative event.
  • the nonverbal language is at least one of laughter, coughing, nodding, sneezing, and keyboard sounds. It is not limited to these.
  • the speech recognition unit 102 performs speech recognition in a predetermined speech recognition unit (sentence unit, clause unit, word unit, etc.), and the non-language speech recognition unit 103 performs speech recognition according to the predetermined speech recognition unit. Perform non-verbal speech recognition in units of time.
  • a predetermined speech recognition unit sentence unit, clause unit, word unit, etc.
  • the non-language speech recognition unit 103 performs speech recognition according to the predetermined speech recognition unit. Perform non-verbal speech recognition in units of time.
  • the score determination unit 104 may determine the recognition result in a determination unit different from the voice recognition unit. For example, speech recognition and non-language speech recognition may be performed on a word-by-word basis, and score determination may be made on a sentence-by-sentence basis.
  • each functional block may be implemented using one device physically or logically coupled, or directly or indirectly using two or more physically or logically separate devices (e.g. , wired, wireless, etc.) and may be implemented using these multiple devices.
  • a functional block may be implemented by combining software in the one device or the plurality of devices.
  • Functions include judging, determining, determining, calculating, calculating, processing, deriving, examining, searching, checking, receiving, transmitting, outputting, accessing, resolving, selecting, choosing, establishing, comparing, assuming, expecting, assuming, Broadcasting, notifying, communicating, forwarding, configuring, reconfiguring, allocating, mapping, assigning, etc.
  • a functional block (component) that makes transmission work is called a transmitting unit or transmitter.
  • the implementation method is not particularly limited.
  • the speech recognition devices 100, 100a, and 100b may function as computers that perform processing of the speech recognition method of the present disclosure.
  • the speech recognition device 100 will be described below, the same applies to the speech recognition devices 100a and 100b.
  • FIG. 16 is a diagram showing an example of the hardware configuration of the speech recognition device 100 according to an embodiment of the present disclosure.
  • the speech recognition device 100 described above may be physically configured as a computer device including a processor 1001, a memory 1002, a storage 1003, a communication device 1004, an input device 1005, an output device 1006, a bus 1007, and the like.
  • the term "apparatus” can be read as a circuit, device, unit, or the like.
  • the hardware configuration of the speech recognition apparatus 100 may be configured to include one or more of each device shown in the figure, or may be configured without some of the devices.
  • Each function of the speech recognition apparatus 100 is performed by causing the processor 1001 to perform calculations, controlling communication by the communication apparatus 1004, and controlling the It is realized by controlling at least one of data reading and writing in 1002 and storage 1003 .
  • the processor 1001 for example, operates an operating system and controls the entire computer.
  • the processor 1001 may be configured by a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic device, registers, and the like.
  • CPU central processing unit
  • the speech recognition unit 102 and the non-verbal speech recognition unit 103 described above may be implemented by the processor 1001 .
  • the processor 1001 also reads programs (program codes), software modules, data, etc. from at least one of the storage 1003 and the communication device 1004 to the memory 1002, and executes various processes according to them.
  • programs program codes
  • software modules software modules
  • data etc.
  • the program a program that causes a computer to execute at least part of the operations described in the above embodiments is used.
  • the speech recognition unit 102 may be implemented by a control program stored in the memory 1002 and running on the processor 1001, and other functional blocks may be implemented similarly.
  • FIG. Processor 1001 may be implemented by one or more chips. Note that the program may be transmitted from a network via an electric communication line.
  • the memory 1002 is a computer-readable recording medium, and is composed of at least one of, for example, ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), etc. may be
  • ROM Read Only Memory
  • EPROM Erasable Programmable ROM
  • EEPROM Electrical Erasable Programmable ROM
  • RAM Random Access Memory
  • the memory 1002 may also be called a register, cache, main memory (main storage device), or the like.
  • the memory 1002 can store executable programs (program code), software modules, etc. for implementing a speech recognition method according to an embodiment of the present disclosure.
  • the storage 1003 is a computer-readable recording medium, for example, an optical disc such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disc, a magneto-optical disc (for example, a compact disc, a digital versatile disc, a Blu-ray disk), smart card, flash memory (eg, card, stick, key drive), floppy disk, magnetic strip, and/or the like.
  • Storage 1003 may also be called an auxiliary storage device.
  • the storage medium described above may be, for example, a database, server, or other suitable medium including at least one of memory 1002 and storage 1003 .
  • the communication device 1004 is hardware (transmitting/receiving device) for communicating between computers via at least one of a wired network and a wireless network, and is also called a network device, a network controller, a network card, a communication module, or the like.
  • the communication device 1004 includes a high-frequency switch, a duplexer, a filter, a frequency synthesizer, etc., in order to realize at least one of, for example, frequency division duplex (FDD) and time division duplex (TDD). may consist of
  • FDD frequency division duplex
  • TDD time division duplex
  • the voice acquisition unit 101 and the like described above may be implemented by the communication device 1004 .
  • the voice acquisition unit 101 may be physically or logically separated into a transmitting unit and a receiving unit.
  • the input device 1005 is an input device (for example, keyboard, mouse, microphone, switch, button, sensor, etc.) that receives input from the outside.
  • the output device 1006 is an output device (eg, display, speaker, LED lamp, etc.) that outputs to the outside. Note that the input device 1005 and the output device 1006 may be integrated (for example, a touch panel).
  • Each device such as the processor 1001 and the memory 1002 is connected by a bus 1007 for communicating information.
  • the bus 1007 may be configured using a single bus, or may be configured using different buses between devices.
  • the speech recognition device 100 also includes hardware such as a microprocessor, a digital signal processor (DSP), an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), and an FPGA (Field Programmable Gate Array). , and part or all of each functional block may be implemented by the hardware.
  • processor 1001 may be implemented using at least one of these pieces of hardware.
  • notification of information is not limited to the aspects/embodiments described in the present disclosure, and may be performed using other methods.
  • notification of information includes physical layer signaling (e.g. DCI (Downlink Control Information), UCI (Uplink Control Information)), upper layer signaling (e.g. RRC (Radio Resource Control) signaling, MAC (Medium Access Control) signaling, It may be implemented by broadcast information (MIB (Master Information Block), SIB (System Information Block))), other signals, or a combination thereof.
  • RRC signaling may also be called an RRC message, and may be, for example, an RRC connection setup message, an RRC connection reconfiguration message, or the like.
  • Input/output information may be stored in a specific location (for example, memory) or managed using a management table. Input/output information and the like may be overwritten, updated, or appended. The output information and the like may be deleted. The entered information and the like may be transmitted to another device.
  • the determination may be made by a value represented by one bit (0 or 1), by a true/false value (Boolean: true or false), or by numerical comparison (for example, a predetermined value).
  • notification of predetermined information is not limited to being performed explicitly, but may be performed implicitly (for example, not notifying the predetermined information). good too.
  • Software whether referred to as software, firmware, middleware, microcode, hardware description language or otherwise, includes instructions, instruction sets, code, code segments, program code, programs, subprograms, and software modules. , applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, and the like.
  • software, instructions, information, etc. may be transmitted and received via a transmission medium.
  • a transmission medium For example, if the Software uses wired technology (coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), etc.) and/or wireless technology (infrared, microwave, etc.), the website, Wired and/or wireless technologies are included within the definition of transmission media when sent from a server or other remote source.
  • wired technology coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), etc.
  • wireless technology infrared, microwave, etc.
  • data, instructions, commands, information, signals, bits, symbols, chips, etc. may refer to voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these. may be represented by a combination of
  • determining and “determining” used in this disclosure may encompass a wide variety of actions.
  • “Judgement” and “determination” are, for example, judging, calculating, computing, processing, deriving, investigating, looking up, searching, inquiring (eg, lookup in a table, database, or other data structure);
  • "judgment” and “determination” are used for receiving (e.g., receiving information), transmitting (e.g., transmitting information), input, output, access (accessing) (for example, accessing data in memory) may include deeming that a "judgment” or “decision” has been made.
  • judgment and “decision” are considered to be “judgment” and “decision” by resolving, selecting, choosing, establishing, comparing, etc. can contain.
  • judgment and “decision” may include considering that some action is “judgment” and “decision”.
  • judgment (decision) may be read as “assuming”, “expecting”, “considering”, or the like.
  • connection means any direct or indirect connection or coupling between two or more elements, It can include the presence of one or more intermediate elements between two elements being “connected” or “coupled.” Couplings or connections between elements may be physical, logical, or a combination thereof. For example, “connection” may be read as "access”.
  • two elements are defined using at least one of one or more wires, cables, and printed electrical connections and, as some non-limiting and non-exhaustive examples, in the radio frequency domain. , electromagnetic energy having wavelengths in the microwave and light (both visible and invisible) regions, and the like.
  • any reference to elements using the "first,” “second,” etc. designations used in this disclosure does not generally limit the quantity or order of those elements. These designations may be used in this disclosure as a convenient method of distinguishing between two or more elements. Thus, reference to a first and second element does not imply that only two elements can be employed or that the first element must precede the second element in any way.
  • a and B are different may mean “A and B are different from each other.”
  • the term may also mean that "A and B are different from C”.
  • Terms such as “separate,” “coupled,” etc. may also be interpreted in the same manner as “different.”
  • DESCRIPTION OF SYMBOLS 100... Speech recognition apparatus, 101... Speech acquisition part, 102... Speech recognition part, 103... Non-language speech recognition part, 104... Score determination part, 105... Result output part, 106... Correction part, 107... Aggregation part.

Abstract

The purpose of the present invention is to provide a sound recognition device capable of ascertaining the mood of a conversation at, for example, a meeting. A sound recognition device 100 comprises: a sound acquisition unit 101 that functions as an audio information acquisition unit which acquires audio information; a sound recognition unit 102 that acquires a recognition result text of a sound of a user A (first audio information) out of the audio information; a nonverbal sound recognition unit 103 that determines that a sound of a user B (second audio information) with respect to (associated with) the sound of the user A (first audio information) is the occurrence of a nonverbal event (for example, a nod); and a result output unit 105 that outputs event tag information indicative of an event in association with the recognition result text.

Description

音声認識装置voice recognition device
 本発明は、音声認識を行う音声認識装置に関する。 The present invention relates to a speech recognition device that performs speech recognition.
 特許文献1には、会議などにおける音声データに含まれている音声区間と非音声区間とを識別しながら、音声区間から音声を示すテキストデータを認識する会議支援システムの記載がある。 Patent Document 1 describes a conference support system that recognizes text data representing speech from speech intervals while distinguishing between speech intervals and non-speech intervals contained in speech data in conferences and the like.
特開2018-45208号公報JP 2018-45208 A
 しかしながら、会議等において、笑い声などを認識することができず、その雰囲気を提供することができない。したがって、会議の雰囲気などその質を分析することが困難である。 However, in meetings, etc., it is not possible to recognize laughter and the like, and it is not possible to provide that atmosphere. Therefore, it is difficult to analyze the quality, such as the atmosphere of the meeting.
 本発明は、会議などにおける会話の雰囲気を把握することができる音声認識装置を提供することを目的とする。 An object of the present invention is to provide a speech recognition device that can grasp the atmosphere of a conversation in a meeting or the like.
 本発明の音声認識装置は、音情報を取得する音情報取得部と、前記音情報のうち第1の音情報を音声認識処理して認識結果を取得するとともに、前記第1の音情報に関連する第2の音情報が非言語によるイベントの発生であることを判断し、前記イベントを示す情報を前記認識結果に関連付けて出力する音情報処理部と、を備える。 A speech recognition apparatus according to the present invention includes a sound information acquisition unit for acquiring sound information, a speech recognition process for first sound information among the sound information to acquire a recognition result, and a recognition result associated with the first sound information. a sound information processing unit that determines that the second sound information to be received is the occurrence of a nonverbal event, and outputs information indicating the event in association with the recognition result.
 本発明によると、会議などにおける会話の雰囲気を把握しやすい音声認識結果を得ることができる。 According to the present invention, it is possible to obtain speech recognition results that make it easy to grasp the atmosphere of a conversation in a meeting or the like.
本開示における音声認識装置100の機能構成を示すブロック図である。1 is a block diagram showing a functional configuration of a speech recognition device 100 according to the present disclosure; FIG. 音声の認識結果の信頼度および認識イベントごとの信頼度を示す説明図である。FIG. 4 is an explanatory diagram showing the reliability of speech recognition results and the reliability of each recognition event; ある発話に対する処理結果例を示す図である。It is a figure which shows the example of a processing result with respect to a certain utterance. 判定区間を変えたときの判定処理の概要を示す図である。FIG. 10 is a diagram showing an outline of determination processing when changing a determination interval; 音声認識装置100の動作を示すフローチャートである。4 is a flowchart showing the operation of the speech recognition device 100; 発話が英語である場合の処理を示す図である。FIG. 10 is a diagram showing processing when the utterance is in English; イベントタグ情報を付加する処理における結果出力例を示す図である。FIG. 10 is a diagram showing an example of output of results in the process of adding event tag information; 、ユーザAおよびユーザBに対する処理判定例を示す図である。, and a diagram showing an example of processing determination for user A and user B. FIG. 他の例における結果出力例を示す図である。It is a figure which shows the example of a result output in another example. 別の例における結果出力例を示す図である。It is a figure which shows the example of a result output in another example. イベントタグ情報と認識結果テキストとを関連付けた事例について説明する。An example in which event tag information and recognition result text are associated will be described. 本開示の変形における音声認識装置100aの機能構成を示すブロック図である。FIG. 11 is a block diagram showing a functional configuration of a speech recognition device 100a according to a modification of the present disclosure; 修正の具体例を示す図である。It is a figure which shows the specific example of correction. 別の変形例の音声認識装置100bの機能構成を示すブロック図である。FIG. 11 is a block diagram showing the functional configuration of a speech recognition device 100b of another modified example; 話題ごとのイベント種別の頻度を示したグラフである。4 is a graph showing the frequency of event types for each topic; 本開示の一実施の形態に係る音声認識装置100のハードウェア構成の一例を示す図である。1 is a diagram illustrating an example hardware configuration of a speech recognition device 100 according to an embodiment of the present disclosure; FIG.
 添付図面を参照しながら本開示の実施形態を説明する。可能な場合には、同一の部分には同一の符号を付して、重複する説明を省略する。 An embodiment of the present disclosure will be described with reference to the accompanying drawings. Where possible, the same parts are denoted by the same reference numerals, and overlapping descriptions are omitted.
 図1は、本開示における音声認識装置100の機能構成を示すブロック図である。図に示される通り、音声認識装置100は、音声取得部101、音声認識部102、非言語音声認識部103(非言語音認識部)、スコア判定部104、および結果出力部105を含んで構成されている。以下、各構成について説明する。 FIG. 1 is a block diagram showing the functional configuration of the speech recognition device 100 according to the present disclosure. As shown in the figure, the speech recognition device 100 includes a speech acquisition unit 101, a speech recognition unit 102, a nonverbal speech recognition unit 103 (nonverbal sound recognition unit), a score determination unit 104, and a result output unit 105. It is Each configuration will be described below.
 音声取得部101は、会議または講義などにおける音声を取得する部分である。例えば、音声取得部101は、マイクである。なお、これに限らず、有線または無線によって送信された音声信号を取得する部分としてもよい。この音声取得部101は、音声波形信号から、音声区間を検出し、その区間ごとに音声認識部102および非言語音声認識部103に出力する。 The speech acquisition unit 101 is a part that acquires speech in meetings, lectures, or the like. For example, the voice acquisition unit 101 is a microphone. In addition, it is good also as a part which acquires the audio|voice signal transmitted not only by this but by wire or radio. The speech acquisition unit 101 detects speech segments from the speech waveform signal and outputs the segments to the speech recognition unit 102 and the non-language speech recognition unit 103 .
 音声認識部102は、音声取得部101から出力された音声区間における言語音または非言語音を、公知の言語モデルおよび音響モデルを用いて音声認識処理を行い、認識単位ごとの認識結果テキストを取得するとともに、その認識結果テキストの読みおよび信頼度を導出する部分である。音声認識部102は、認識単位として、発話単位、文単位、文節単位、単語単位、カナ若しくは音素単位、または時間単位に、認識結果テキスト、その読みおよび信頼度を出力する。 The speech recognition unit 102 performs speech recognition processing on the verbal or non-verbal sounds in the speech section output from the speech acquisition unit 101 using a known language model and acoustic model, and acquires the recognition result text for each recognition unit. It is also a part that derives the reading and reliability of the recognition result text. The speech recognition unit 102 outputs the recognition result text, its reading and reliability in units of utterance, sentence, phrase, word, kana or phoneme, or time as recognition units.
 この信頼度は、音声認識処理をした場合における認識結果に対してどれだけ信頼してよいかを示した情報であり、一般的には、0~1の間で示されるが、これに限るものではなく、整数で表してもよいし、0~100の間で表してもよい。また、正規化した数値としてもよい。本開示においては、言語モデルおよび音響モデルに記憶されている信頼度に基づいて音声認識結果に対する信頼度を求めることとするが、これに限らず、例えば、End―to―Endの音声認識など、他の公知の信頼度の導出方法を用いてもよい。 This reliability is information indicating how much the recognition result in the speech recognition process can be trusted, and is generally indicated between 0 and 1, but is limited to this. Instead, it may be expressed as an integer or between 0 and 100. Moreover, it is good also as a normalized numerical value. In the present disclosure, the reliability of the speech recognition result is obtained based on the reliability stored in the language model and the acoustic model, but not limited to this, for example, end-to-end speech recognition, etc. Other known reliability derivation methods may be used.
 非言語音声認識部103は、音声取得部101から出力された音声区間における言語音または非言語音に対して、公知の非言語音のための認識モデルを用いて非言語音に対する認識処理を行う部分である。この非言語音声認識部103は、音声認識部102により音声認識された認識単位に応じた認識時間ごとの非言語音の種別に応じた各イベントに対する信頼度を生成する。例えば、音声認識部102が単語単位で認識した場合、その認識した単語に対する発話時間が、非言語音声認識部103の認識単位となる。 The non-verbal speech recognition unit 103 uses a known non-verbal sound recognition model to perform recognition processing on the non-verbal sound on the verbal sound or the non-verbal sound in the speech section output from the speech acquisition unit 101 . part. The non-verbal speech recognition unit 103 generates a reliability for each event according to the type of non-verbal sound for each recognition time according to the recognition unit of speech recognition performed by the speech recognition unit 102 . For example, when the speech recognition unit 102 recognizes each word, the utterance time for the recognized word becomes the recognition unit of the non-verbal speech recognition unit 103 .
 イベントとは、非言語音の種別を示し、肯定イベントおよび否定イベントなる。例えば、本開示において、非言語音とは、笑い声、相槌音、頷き音(肯定または否定を区別可)、ため息、くしゃみ、咳などの音声に基づく非言語音のほか、キーボード音およびBGMなどの楽曲音などの音を示す。例えば、肯定イベントは、笑い声であることを示し、否定イベントは、笑い声ではないことを示す。非言語音声認識部103は、これら非言語音を認識して、肯定イベントおよび否定イベントに対する信頼度を生成する。  Events indicate types of non-verbal sounds, and are positive events and negative events. For example, in the present disclosure, non-verbal sounds include non-verbal sounds based on sounds such as laughter, backtracking, nodding (affirmative or negative), sighs, sneezes, coughs, keyboard sounds, BGM, etc. Indicates sound such as musical sound. For example, a positive event indicates laughter, and a negative event indicates no laughter. The non-verbal speech recognizer 103 recognizes these non-verbal sounds and generates confidence levels for positive and negative events.
 イベントに対する信頼度は、非言語音声認識処理をした場合における認識結果に対してどれだけ信頼してよいかを示した情報である。この信頼度は、一般的には、0~1の間で示されるが、これに限るものではなく、整数で表してもよいし、0~100の間で表してもよい。また、正規化した数値としてもよい。この信頼度は、非言語音を認識するための認識モデルに記憶されている信頼度に基づいて求められるが、これに限らず、他の公知の信頼度の導出方法を用いてもよい。  The degree of reliability for an event is information that indicates how much the recognition result in non-verbal speech recognition processing can be trusted. This reliability is generally represented between 0 and 1, but is not limited to this, and may be represented by an integer or between 0 and 100. Moreover, it is good also as a normalized numerical value. This reliability is obtained based on the reliability stored in the recognition model for recognizing non-verbal sounds, but is not limited to this, and other known reliability derivation methods may be used.
 図2は、音声の認識結果テキストの信頼度およびイベントごとの信頼度を示す説明図である。図に示されるとおり、発話内容「あははは」(“ahahaha”と発話されている)があるとする。これは笑い声を示している。 FIG. 2 is an explanatory diagram showing the reliability of speech recognition result text and the reliability of each event. As shown in the figure, it is assumed that there is an utterance content "ahahaha" ("ahahaha" is uttered). This indicates laughter.
 音声認識部102は、音声を言語音として認識しようとするため、テキストデータ「あ母派」(日本語の漢字に変換されている)、読み方「アハハハ」(ahahaha)と認識する。音声認識部102は、音響モデルおよび言語モデルを用いて認識された認識結果テキストに対する信頼度を導出する。 Since the speech recognition unit 102 tries to recognize the speech as a speech sound, it recognizes the text data as "Aboha" (converted into Japanese kanji) and the reading as "ahahaha". The speech recognition unit 102 derives a confidence level for the recognized recognition result text using the acoustic model and the language model.
 一方で、非言語音声認識部103は、イベントごとの信頼度を導出する。図2では、笑い声の有無に応じた信頼度を導出する。すなわち、「あははは」(“ahahaha”と発話されている)が、笑い声ではないことに対する信頼度、笑い声であることに対する信頼度を算出する。図2では、笑い声なしが、0.3、笑い声ありが、0.7であることを示す信頼度を算出する。 On the other hand, the non-verbal speech recognition unit 103 derives the reliability for each event. In FIG. 2, the reliability is derived according to the presence or absence of laughter. That is, the degree of confidence that "ahahaha" ("ahahaha" is uttered) is not laughter and the degree of confidence that it is laughter are calculated. In FIG. 2, the reliability is calculated as 0.3 without laughter and 0.7 with laughter.
 そのほか、笑い声としてのテキストデータの信頼度を算出してもよい。また、非言語音として、咳の有無、またはキーボード音の有無など、その他の種別の信頼度を算出してもよい。 In addition, the reliability of text data as laughter may be calculated. In addition, as non-verbal sounds, other types of reliability may be calculated, such as the presence or absence of coughing or the presence or absence of keyboard sounds.
 非言語音声認識部103は、笑い声、咳、またはキーボード音をそれぞれ認識するための認識モデルを有しており、この認識モデルに基づいてそれぞれの信頼度を算出することができる。なお、当然にその他、相槌音、頷き音、くしゃみを認識するための認識モデルを備えてもよい。また、複数のイベントに対する各イベントの信頼度を出力する認識モデルを備えてもよい。 The non-verbal speech recognition unit 103 has recognition models for recognizing laughter, coughing, or keyboard sounds, respectively, and can calculate the respective reliability based on these recognition models. Of course, other recognition models for recognizing backtracking sounds, nodding sounds, and sneezes may be provided. A recognition model may also be provided that outputs the reliability of each event with respect to a plurality of events.
 スコア判定部104は、音声認識部102により認識された言語音に対する認識結果テキストの信頼度と、非言語音声認識部103により認識されたイベントに対する信頼度とに基づいて、認識結果テキストとイベントとのいずれが妥当であるかを判定する部分である。 The score determination unit 104 determines the recognition result text and the event based on the reliability of the recognition result text for the verbal sound recognized by the speech recognition unit 102 and the reliability for the event recognized by the non-verbal speech recognition unit 103. This is the part that determines which of the above is appropriate.
 その詳細処理を、図2を用いて説明する。なお、図2では、音声として「あははは」(“ahahaha”)が入力され、それぞれ音声認識部102および非言語音声認識部103において処理がなされたものとする。スコア判定部104は、音声認識部102の認識結果テキスト「あ母派」(“ahahaha”が認識された日本語(ここでは誤変換の例とする))の信頼度と、非言語音声認識部103の肯定イベントおよび否定イベント(笑い声なし/あり)のそれぞれ信頼度とを比較する。図2では、音声認識部102は、認識結果テキストの信頼度:0.3を出力する。また、非言語音声認識部103は、否定イベント(笑い声なし)の信頼度:0.3、肯定イベント(笑い声あり)の信頼度:0.7を出力する。 The detailed processing will be explained using FIG. In FIG. 2, it is assumed that "ahahaha" ("ahahaha") is input as a voice and processed by the voice recognition unit 102 and the non-verbal voice recognition unit 103, respectively. The score determination unit 104 determines the reliability of the recognition result text “Aboha” (Japanese in which “ahahaha” is recognized (here, an example of mistranslation)) of the speech recognition unit 102 and the non-verbal speech recognition unit 103 positive events and negative events (no/with laughter) are compared with respective reliability. In FIG. 2, the speech recognition unit 102 outputs the reliability of the recognition result text: 0.3. In addition, the nonverbal speech recognition unit 103 outputs reliability of negative event (no laughter): 0.3 and reliability of positive event (with laughter): 0.7.
 スコア判定部104は、信頼度の最も高い認識結果テキストまたはイベントを判定する。図2では、肯定イベント(笑い声あり)の信頼度:0.7が最も高い信頼度であるため、「あははは」(ahahaha)は、笑い声を示すイベントとして判定される。 The score determination unit 104 determines the recognition result text or event with the highest reliability. In FIG. 2, the reliability of the positive event (with laughter): 0.7 is the highest reliability, so "ahahaha" is determined as an event indicating laughter.
 なお、スコア判定部104は、上記の通り、最も大きい信頼度に基づいて認識結果テキストまたはイベントを判定してもよいし、信頼度ごとに重み付けで調整した値を利用してもよい。例えば、スコア判定部104は、肯定イベントまたは否定イベントの信頼度に対して所定の係数を掛けた値に基づいて、発話内容がいずれのイベントであるかを判定してもよい。より具体的には、スコア判定部104は、肯定イベントの信頼度に対して2を掛けた、その値と、否定イベントの信頼度と比較することにより、非言語音を判定してもよい。 Note that, as described above, the score determination unit 104 may determine the recognition result text or event based on the highest reliability, or may use values adjusted by weighting for each reliability. For example, the score determination unit 104 may determine which event the utterance content is based on a value obtained by multiplying the reliability of the positive event or the negative event by a predetermined coefficient. More specifically, the score determination unit 104 may determine the non-verbal sound by comparing the value obtained by multiplying the reliability of the positive event by 2 with the reliability of the negative event.
 また、スコア判定部104は、肯定イベントの信頼度に対して0.7を掛けて、そして0.1を減算し、閾値と比較することにより非言語音における肯定イベントの適否を判定してもよい。これら係数および閾値は、固定値をあらかじめメモリ等に記憶しておいてもよいし、外部から入力できるよう入力部を備えてもよい。さらに、与えられた係数および閾値を所定の数式によって変動させてもよい。また、スコア判定部104は、肯定イベントおよび否定イベントのそれぞれの2値の信頼度と比較に限定せず、それ以外を含めてもよく、例えば3値以上の値と比較してもよい。例えば、笑い声あり、咳ありを肯定イベントとし、笑い声なし、咳なしを否定イベントとし、キーボード音ありを雑音イベントとして、3つのイベントの信頼度3値を比較してもよい。 Also, the score determination unit 104 multiplies the reliability of the positive event by 0.7, subtracts 0.1, and compares it with a threshold to determine whether the positive event is appropriate for the non-verbal sound. good. These coefficients and threshold values may be stored in a memory or the like as fixed values in advance, or an input unit may be provided so that they can be input from the outside. Further, the given coefficients and thresholds may be varied according to predetermined formulas. Also, the score determination unit 104 is not limited to comparison with the binary reliability of each of the positive event and the negative event, and may include other values, for example, comparison with three or more values. For example, laughter and cough may be treated as affirmative events, no laughter and no cough as negative events, and keyboard sounds as noise events.
 このような重み付けによる調整は、例えば、本開示における音声認識装置100を利用するユーザの属性若しくは種別、または会議の内容に応じて決められる。例えば、笑いが起きやすい会議内容の場合には、笑い声を認識しやすいが、そうではない会議内容も存在する。そういった場合には、笑い声が、笑い声として認識しづらい。そういう会議またはユーザに対しては、上記の通り、調整をすることで、正確な認識を可能にする。 Such weighting adjustments are determined, for example, according to the attributes or types of users who use the speech recognition device 100 of the present disclosure, or the contents of the conference. For example, in the case of meeting content that is likely to cause laughter, it is easy to recognize laughter, but there are meeting content that is not. In such a case, it is difficult to recognize laughter as laughter. For such meetings or users, adjustments can be made as described above to enable accurate recognition.
 結果出力部105は、スコア判定部104による判定結果に基づいて、信頼度が高い認識結果テキストを選択し、またはイベントに応じたイベントタグ情報を付加して出力する。出力に際して、結果出力部105は、スコア判定部104により判定されたイベントの情報を入力して、イベントに対応するイベントタグ情報を記憶部等から取得する。この記憶部等は、予めこれらイベントタグ情報を記憶している。イベントタグ情報とは、例えば、非言語音を図形で表した情報であり、笑い声である場合には、笑いを図形化したマークである。イベントタグ情報は、例えば、あらかじめ規定されたテキスト情報であってもよい。出力は、外部端末に対する出力でもよいし、ディスプレイに画面表示をさせるものとしてもよい。 The result output unit 105 selects a highly reliable recognition result text based on the determination result of the score determination unit 104, or adds and outputs event tag information according to the event. When outputting, the result output unit 105 inputs the information of the event determined by the score determination unit 104 and acquires the event tag information corresponding to the event from the storage unit or the like. The storage section or the like stores these event tag information in advance. Event tag information is, for example, information that graphically represents non-verbal sounds, and in the case of laughter, it is a graphical mark of laughter. Event tag information may be, for example, predefined text information. The output may be output to an external terminal, or may be displayed on a display.
 図3は、ある発話に対する処理結果例を示す図である。図3は、処理結果例として、認識結果テキスト、認識結果形態素、認識結果テキスト信頼度、イベント信頼度(笑い)、イベント信頼度(咳)、イベント信頼度(キーボード音)、判定結果、結果出力、および補足出力を示している。また、単語単位における信頼度、結果出力等を示している。 FIG. 3 is a diagram showing an example of processing results for a certain utterance. FIG. 3 shows, as examples of processing results, recognition result text, recognition result morpheme, recognition result text reliability, event reliability (laughter), event reliability (cough), event reliability (keyboard sound), judgment result, and result output. , and the supplemental output. It also shows the reliability, result output, etc. in units of words.
 以下の発話内容が取得され、音声認識部102による認識結果が得られる。 The following utterance content is acquired, and the recognition result by the speech recognition unit 102 is obtained.
発話内容:私は、あのー、ははは、なんか、(咳:ゴッホ)、(キーボード音:カカカカ)、いい。
上記発話内容は、日本語であり、watashiwa ano- hahaha nanka (gohho) kakakaka iiと発話されている。なお、日本語では、主語の次に来る「は」は、waと発話される。
Speech contents: I, um, hahaha, something, (Cough: Van Gogh), (Keyboard sound: Kakakaka), nice.
The content of the above utterance is in Japanese, and is uttered as watashiwa ano- hahaha nanka (gohho) kakakaka ii. In Japanese, the ``ha'' following the subject is pronounced wa.
認識結果テキスト:私 は あのー 母派 なんか ゴッホ かかかか いい
上記は、日本語に変換された認識結果テキストである。ここでは咳とキーボード音が混じっている。この認識結果テキストは、ローマ字表記で説明すると、watashi、wa、ano-、hahaha、nanka、gohho、kakakaka、iiに分節されて日本語に変換されたことを示す。
 ここでは、発話:「私は」(watashiwa)に対して、認識結果:「私」(watashi)、「は」(wa)が得られている。それぞれ、認識結果テキストの信頼度は、0.95および0.91が導出されている。一方で、「私」(watashi)に対する非言語音声認識部103による各イベントの信頼度は以下のとおりである。
Recognition result text: I am a mother school The above is the recognition result text converted into Japanese. There is a mix of coughing and keyboard noises here. This recognition result text is segmented into watashi, wa, ano-, hahaha, nanka, gohho, kakakaka, and ii, and converted into Japanese when explained in terms of romaji notation.
Here, recognition results: "I" (watashi) and "wa" (wa) are obtained for the utterance: "I am" (watashiwa). Confidences of 0.95 and 0.91 are derived for the recognition result text, respectively. On the other hand, the reliability of each event by the nonverbal speech recognition unit 103 for "I" (watashi) is as follows.
笑い声の肯定イベントの信頼度:0.23
笑い声の否定イベントの信頼度:0.90
咳の肯定イベントの信頼度:0.21
咳の否定イベントの信頼度:0.70
キーボード音の肯定イベントの信頼度:0.15
キーボード音の否定イベントの信頼度:0.75
 また、「は」(wa)に対する非言語音声認識部103による各フラグの信頼度は以下のとおりである。
Confidence level of laughter affirmative event: 0.23
Reliability of laughter denial event: 0.90
Confidence for cough positive event: 0.21
Confidence for negative cough event: 0.70
Confidence of keyboard sound positive event: 0.15
Confidence of keyboard sound negative event: 0.75
The reliability of each flag by the non-verbal speech recognition unit 103 for "ha" (wa) is as follows.
笑い声の肯定イベントの信頼度:0.35
笑い声の否定イベントの信頼度:0.85
咳の肯定イベントの信頼度:0.05
咳の否定イベントの信頼度:0.81
キーボード音の肯定イベントの信頼度:0.12
キーボード音の否定イベントの信頼度:0.85
 スコア判定部104は、これら信頼度に基づいて最も信頼度の高い認識結果テキストである「私」(watashi)と「は」(wa)を選択し、結果出力部105は、「私は」(watashiwa)と出力する。
Confidence of laughter affirmative event: 0.35
Reliability of laughter denial event: 0.85
Confidence for cough positive event: 0.05
Confidence for negative cough event: 0.81
Confidence of keyboard sound positive event: 0.12
Confidence of keyboard sound negative event: 0.85
The score determination unit 104 selects the most reliable recognition result texts "I" (watashi) and "wa" (wa) based on these degrees of reliability, and the result output unit 105 outputs "I am" ( watashiwa) is output.
 一方で、発話:「あのー」(ano-)、認識結果テキスト:「あのー」(ano-)に対しては、音声認識部102は、音声認識の信頼度:0.9とし、フィラーと認識している。 On the other hand, the speech recognition unit 102 sets the speech recognition confidence level to 0.9 for the utterance: "ano-" and the recognition result text: "ano-", and recognizes it as a filler. ing.
 また、発話:「ははは」(hahaha)、認識結果テキスト:「母派」(hahahaが認識されたテキスト)(信頼度:0.1)に対して、非言語音声認識部103は、肯定イベント(笑い声あり):0.7の信頼度を算出している。認識結果テキストの信頼度、そのほかの肯定イベント(咳、キーボード音)に対して、肯定イベント(笑い声あり)の信頼度が高いため、スコア判定部104は、認識結果:「母派」は笑い声であると判定する。結果出力部105は、補足出力として、笑いが起こったことを示すイベントタグ情報を出力する。なお、結果出力部105は、イベントタグ情報を出力しなくてもよい。 In addition, the non-verbal speech recognition unit 103 recognizes the utterance "hahaha" (hahaha) and the recognition result text "mother" (text in which hahaha is recognized) (reliability: 0.1). Event (with laughter): A reliability of 0.7 is calculated. Since the reliability of the recognition result text and the reliability of the affirmative event (with laughter) is higher than the reliability of other affirmative events (coughing, keyboard sound), the score determination unit 104 determines that the recognition result: “mother” is laughter. Determine that there is. The result output unit 105 outputs event tag information indicating that laughter occurred as a supplemental output. Note that the result output unit 105 does not have to output the event tag information.
 本開示においては、スコア判定部104は、認識結果形態素に基づいて、フィラーと判定した認識結果テキストについては、認識結果テキストで高い信頼度であったとしても、フィラーと判定する。そして、結果出力部105は、その認識結果テキストを結果出力として出力しない。なお、必要に応じて補足出力として、フィラーであることを示すイベントタグ情報を出力してもよい。 In the present disclosure, the score determination unit 104 determines that a recognition result text determined as a filler based on the recognition result morpheme is a filler even if the recognition result text has a high degree of reliability. Then, the result output unit 105 does not output the recognition result text as a result output. Note that event tag information indicating a filler may be output as a supplemental output as necessary.
 ところで、図3の例では、単語ごとに信頼度を算出し、単語ごとの信頼度により判定を行った例であるが、これに限るものではない。スコア判定部104は、音声認識部102および非言語音声認識部103により出力された認識結果テキストの認識単位から、改めてスコア判定用の判定単位に変更して判定してもよい。例えば、図3では、音声認識部102および非言語音声認識部103は、単語単位で信頼度を算出したが、スコア判定部104は、文節単位または文単位に信頼度を統合して、文節単位または文単位に統合された信頼度に基づいて判定してもよい。このようにスコアの判定範囲を変えることで、例えば、非言語音を示すイベントタグ情報の付加位置を変えることができ、文章として読みやすいものとなる。例えば、文単位でスコア判定した場合には、文末にイベントタグ情報が付加される。 By the way, in the example of FIG. 3, the reliability is calculated for each word, and the determination is made based on the reliability of each word, but it is not limited to this. The score determination unit 104 may perform determination by changing the recognition unit of the recognition result text output by the speech recognition unit 102 and the non-language speech recognition unit 103 to the determination unit for score determination. For example, in FIG. 3, the speech recognition unit 102 and the non-verbal speech recognition unit 103 calculate the reliability for each word, but the score determination unit 104 integrates the reliability for each clause or sentence, Alternatively, determination may be made based on the reliability integrated in units of sentences. By changing the score determination range in this way, for example, it is possible to change the addition position of the event tag information indicating the non-verbal sound, and the text becomes easier to read. For example, when score determination is made for each sentence, event tag information is added to the end of the sentence.
 図4は、スコア判定単位を文単位に変えたときの処理内容の概要を示す図である。なお、説明の便宜上、図3と比較してイベント信頼度等の記載を簡略化している。以下の発話内容が入力され、認識結果テキストが得られたとする。 FIG. 4 is a diagram showing an overview of the processing contents when the score determination unit is changed to the sentence unit. For convenience of explanation, descriptions of event reliability and the like are simplified as compared with FIG. Assume that the following utterance content is input and a recognition result text is obtained.
発話内容: わたしは、あのー、(笑い:ははは)、なんか、(咳:ゴッホ)、(キーボード音:カカカカ)、いい
上記発話内容は、日本語であり、watashiwa ano- hahaha nanka (gohho) kakakaka iiと発話されている。なお、日本語では、主語の次に来る「は」は、waと発話される。
認識結果テキスト:私は、あのー、母派、なんか、ゴッホ、かかかか、いい
上記は、日本語に変換された認識結果テキストである。咳とキーボード音が混じっている。この認識結果テキストは、ローマ字表記で説明すると、watashi、wa、ano-、hahaha、nanka、gohho、kakakaka、iiに分節されて日本語に変換されたことを示す。
 ここで、スコア判定部104は、文単位でスコア判定をする。すなわち、スコア判定部104は、一文の認識結果テキストにおける音声認識信頼度および非言語音声認識信頼度をそれぞれ合算する。例えば図4を例にとると、スコア判定部104は、その1文における認識結果テキストの信頼度の合計値およびイベントの信頼度(肯定イベントおよび否定イベントそれぞれの信頼度)の合計値を算出する。この合計値に基づいて、認識結果テキストの適否およびイベントの有無を判定する。図4の例では、スコア判定部104は、認識結果テキストおよび肯定イベントの信頼度の合計値が所定値以上であることから、この認識結果テキストは、発話、笑い、咳、またはキーボード音を含むと判定する。
Content of utterance: I, um, (laughter: hahaha), something, (cough: Van Gogh), (keyboard sound: kakakaka), nice It is uttered as kakakaka ii. In Japanese, the ``ha'' following the subject is pronounced wa.
Recognition result text: I, umm, maternal, something, Van Gogh, kakkaka, good The above is the recognition result text converted to Japanese. Cough mixed with keyboard noise. This recognition result text is segmented into watashi, wa, ano-, hahaha, nanka, gohho, kakakaka, and ii, and converted into Japanese when explained in terms of romaji notation.
Here, the score determination unit 104 performs score determination on a sentence-by-sentence basis. That is, the score determination unit 104 adds up the speech recognition reliability and the non-language speech recognition reliability in the recognition result text of one sentence. Taking FIG. 4 as an example, the score determination unit 104 calculates the total value of the reliability of the recognition result text and the total of the reliability of the event (the reliability of each of the positive event and the negative event) in the sentence. . Based on this total value, the adequacy of the recognition result text and the presence or absence of an event are determined. In the example of FIG. 4, the score determination unit 104 determines that the total value of the reliability of the recognition result text and the affirmative event is equal to or greater than the predetermined value. I judge.
 そして、スコア判定部104は、結果出力部105に、認識結果テキストの信頼度が所定値以上である各認識結果テキスト「私」「は」「あのー」「なんか」「いい」を出力するとともに、イベントの情報を(笑い、咳、キーボード音)を出力する。結果出力部105は、イベントの情報からイベントタグ情報を取得して、信頼度が所定値以上の認識結果テキストとともに出力する。 Then, the score determination unit 104 outputs to the result output unit 105 each of the recognition result texts "I", "Ha", "Ah", "Something", and "Good" whose reliability is equal to or higher than a predetermined value. Output event information (laughter, cough, keyboard sounds). The result output unit 105 acquires the event tag information from the event information and outputs it together with the recognition result text whose reliability is equal to or higher than a predetermined value.
 これにより、文単位でイベントタグ情報を出力することができ、認識結果テキストの末尾にイベントタグ情報を付加でき、読みやすい文章となる。 As a result, event tag information can be output for each sentence, and event tag information can be added to the end of the recognition result text, making it easier to read.
 つぎに、本開示の音声認識装置100の動作について説明する。図5は、音声認識装置100の動作を示すフローチャートである。音声取得部101は、音声波形信号を取得し(S101)、音声波形信号から、音声区間検出を行って、音声区間の音声(またはその他の音)を音声信号として、音声認識部102および非言語音声認識部103に出力する(S102)。 Next, the operation of the speech recognition device 100 of the present disclosure will be described. FIG. 5 is a flow chart showing the operation of the speech recognition apparatus 100. As shown in FIG. The speech acquisition unit 101 acquires a speech waveform signal (S101), detects a speech segment from the speech waveform signal, and converts the speech (or other sound) in the speech segment into a speech signal for use by the speech recognition unit 102 and the non-language signal. It is output to the speech recognition unit 103 (S102).
 音声認識部102は、音声信号を音声認識処理して認識結果テキスト、読み、信頼度を出力する(S103)。また、非言語音声認識部103は、音声信号を非言語音声認識処理して、認識対象時間ごとの各イベントの信頼度を出力する(S104)。 The speech recognition unit 102 performs speech recognition processing on the speech signal and outputs the recognition result text, reading, and reliability (S103). The non-verbal speech recognition unit 103 also performs non-verbal speech recognition processing on the speech signal and outputs the reliability of each event for each recognition target time (S104).
 スコア判定部104は、音声認識処理による認識結果テキストの信頼度と、非言語音声認識に認識された各イベントの信頼度とに基づいて、認識対象ごとの認識結果テキストまたはイベントの妥当性を判定する(S105)。 The score determination unit 104 determines the validity of the recognition result text or event for each recognition target based on the reliability of the recognition result text by speech recognition processing and the reliability of each event recognized by non-verbal speech recognition. (S105).
 結果出力部105は、判定結果に基づいて認識結果テキストから妥当な認識結果テキストを選択し、またはイベントタグ情報を取得して出力する(S106)。 The result output unit 105 selects an appropriate recognition result text from the recognition result texts based on the determination result, or acquires and outputs event tag information (S106).
 このような処理により、会議中などにおける言語音および非言語音を認識することができる。 Through such processing, it is possible to recognize verbal and non-verbal sounds during a meeting.
 つぎに、他の言語の適用例について説明する。上記発話事例は、日本語を対象にしたものであるが、当然に他の言語でも同様の処理が可能である。図6は、発話が英語である場合の処理を示す図である。なお、説明の便宜上、その記載を簡略化している。図6では、以下の発話がなされている。 Next, we will explain application examples for other languages. Although the above example of utterance is intended for Japanese, the same processing can naturally be performed in other languages as well. FIG. 6 is a diagram showing processing when the utterance is in English. For convenience of explanation, the description is simplified. In FIG. 6, the following utterance is made.
発話内容:I go (笑い:hahaha) to (咳:off coff cough) (キーボード音:clatter) school.
認識結果テキスト:I go ah the her head to Costoco caca grata.
 ここで、音声認識部102により認識された認識結果テキスト「I」「go」「to」「school」について、その音声認識信頼度は高い。
Utterance content: I go (laughter: hahaha) to (cough: off coff cough) (keyboard sound: clatter) school.
Recognition result text: I go ah the her head to Costoco caca grata.
Here, the recognition result texts "I", "go", "to", and "school" recognized by the speech recognition unit 102 have high speech recognition reliability.
 一方で、音声認識部102により認識された認識結果テキスト「the her head」「Costoco」「caca gratta」は、その認識結果テキスト信頼度は低いが、イベント信頼度は高い。例えば、認識結果テキスト「the her head」について、笑いのイベント信頼度が高い。これは、笑いの「hahaha」を、言語音として認識しようとしたためである。 On the other hand, the recognition result texts "the her head", "Costoco", and "caca gratta" recognized by the speech recognition unit 102 have low recognition result text reliability but high event reliability. For example, the event reliability of laughter is high for the recognition result text "the her head". This is because we tried to recognize the ``hahaha'' of laughter as a speech sound.
 認識結果テキスト「costoco」「caca grata」についても同様で、それぞれ、咳またはキーボード音を、言語音として認識しようとした結果、その音声信頼度が低く導出されている。一方で、これら発話は咳であり、またキーボード音であることから、これに対する認識結果テキストを出力しないようにする必要がある。 The same is true for the recognition result texts "costoco" and "cacagrata", and as a result of trying to recognize coughing or keyboard sounds as speech sounds, the speech reliability is derived to be low. On the other hand, since these utterances are coughs and keyboard sounds, it is necessary not to output the recognition result text for them.
 図6の例では、認識結果テキスト信頼度およびイベント信頼度に基づいて以下の結果出力がなされる。 In the example of FIG. 6, the following results are output based on the recognition result text reliability and event reliability.
結果出力:I go to school.
 このようにして、英語の音声認識に対しても非言語音を出力しないようにすることができる。
Result output: I go to school.
In this way, non-verbal sounds can be prevented from being output even for English speech recognition.
 つぎに、本開示における音声認識装置100が非言語音を認識したときに、その非言語音がどのようなイベントに関する音であるかを示すイベントタグ情報(画像、記号またはテキストなどのマーク)を付加する処理について説明する。以下では、日本語特有の表現および処理がある場合には、日本語の表記を交えて説明する。 Next, when the speech recognition apparatus 100 according to the present disclosure recognizes a non-verbal sound, event tag information (a mark such as an image, a symbol, or a text) indicating what kind of event the non-verbal sound is related to The process of adding will be described. In the following, when there are expressions and processes peculiar to Japanese, they will be explained using Japanese notation.
 図7は、その具体例を示す図である。図7(a)は、実際の発話内容を示す図である。図7(b)は、認識結果に基づいた結果出力を示す図である。 FIG. 7 is a diagram showing a specific example thereof. FIG. 7(a) is a diagram showing actual utterance contents. FIG. 7(b) is a diagram showing a result output based on the recognition result.
 図7においては、ユーザAおよびBとの会話を示している。説明を簡易にするために、会話内容については省略して、「―――」と示している。図7(a)において、ユーザAは、笑いながら会話をし、ユーザBは、その会話に応じて相槌をうっていることを示している。音声認識部102および非言語音声認識部103は、音声チャネルまたは音源分離などに基づいて、話者が誰であるかを認識することができる。
 英語で表記すると、例えば、以下の通りとなる。
Aさん I am Japanese. (笑い)I live in Tokyo.
Bさん       yeah(相槌)             yeah(相槌)
In FIG. 7, conversations with users A and B are shown. In order to simplify the explanation, the content of the conversation is omitted and indicated as "---". In FIG. 7(a), user A is having a conversation while laughing, and user B is giving a nod in response to the conversation. The speech recognition unit 102 and the non-verbal speech recognition unit 103 can recognize who the speaker is based on speech channel or sound source separation.
In English, for example, it is as follows.
Mr. A I am Japanese. (laughs) I live in Tokyo.
Mr. B yeah
 図7では、音声認識部102および非言語音声認識部103は、ユーザAおよびユーザBの発話(音源)を区別し、それぞれ音声認識処理および非言語音声認識処理を行う。音声認識部102および非言語音声認識部103は、発話の経過時間を認識しており、ユーザAの音声に対して音声認識処理および非言語音声認識処理を行うとともにその発話の経過時間を認識する。同様に、音声認識部102および非言語音声認識部103は、ユーザBの音声に対して音声認識処理および非言語音声認識処理を行うとともにその発話の経過時間を認識する。よって、結果出力部105は、これら経過時間を利用して、ユーザAの発話に応じて、ユーザBが相槌を打っていることを示すことを可能とする。 In FIG. 7, the speech recognition unit 102 and the non-language speech recognition unit 103 distinguish between user A's and user B's utterances (sound sources) and perform speech recognition processing and non-language speech recognition processing, respectively. The speech recognition unit 102 and the non-language speech recognition unit 103 recognize the elapsed time of speech, perform speech recognition processing and non-language speech recognition processing on user A's speech, and recognize the elapsed time of the speech. . Similarly, the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and non-language speech recognition processing on user B's speech, and recognize the elapsed time of the speech. Therefore, the result output unit 105 can use these elapsed times to indicate that user B is giving a back-hand in response to user A's utterance.
 図8は、ユーザAおよびユーザBに対する処理判定例を示す図である。図に示されるとおり、音声認識部102および非言語音声認識部103は、ユーザAおよびユーザBを、音源分離の手法に基づいて、それぞれ区別して音声認識処理および非言語音声認識処理を行う。 FIG. 8 is a diagram showing an example of processing determination for user A and user B. FIG. As shown in the figure, the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and non-language speech recognition processing separately for user A and user B based on the sound source separation technique.
 図8に示されるとおり、ユーザAの音声を認識する音声認識部102および非言語音声認識部103は、ユーザAの発話「です。」(“desu”と発音されており、日本語の動詞である)を認識する。この「です。」は、日本語における語尾を示している。日本語の場合、語尾は動詞となるが、他の言語の場合、必ずしも動詞とはならない。それにあわせて、ユーザBの音声を認識する音声認識部102および非言語音声認識部103は、ユーザBの「うん」(un)という相槌イベントを認識する。また、音声認識部102および非言語音声認識部103は、それぞれ時間経過を把握している。図8では省略しているが、認識結果テキストごとに、その経過時間は対応付けて管理されている。なお、英語の会話における語尾を認識する場合には、会話全体から判断される。 As shown in FIG. 8, the speech recognition unit 102 and the non-verbal speech recognition unit 103 that recognize the voice of User A recognize User A's utterance "desu." there is). This "desu." indicates the ending of a word in Japanese. In Japanese, the ending is a verb, but in other languages it is not necessarily a verb. At the same time, the speech recognition unit 102 and the non-verbal speech recognition unit 103 that recognize the voice of the user B recognize the backtracking event of the user B "un". Also, the speech recognition unit 102 and the non-verbal speech recognition unit 103 each grasp the passage of time. Although omitted in FIG. 8, the elapsed time is associated with each recognition result text and managed. In addition, when recognizing the ending of an English conversation, it is judged from the whole conversation.
 結果出力部105は、スコア判定部104の判定処理に従って、ユーザAの「です。」(desu)およびユーザBのイベントタグ情報(相槌マーク)を出力する。その際、結果出力部105は、それぞれの認識した認識結果テキスト「です。」(desu)およびイベントタグ情報:相槌マークをそれぞれ発話の時間経過に応じた位置に出力する。 The result output unit 105 outputs user A's "is" (desu) and user B's event tag information (backhand mark) according to the determination process of the score determination unit 104 . At that time, the result output unit 105 outputs each recognized recognition result text "is."
 図7(b)は、その結果出力の具体例を示した図である。図に示されるとおり、結果出力部105は、ユーザBの発話の経過時間にあわせて、相槌マークを出力する。結果的に、この相槌マークは、ユーザAの「私は―――です。」(“I am ―――.”を意味している)の認識結果テキストの位置に対応することになる。 FIG. 7(b) is a diagram showing a specific example of the result output. As shown in the figure, the result output unit 105 outputs backhand marks according to the elapsed time of user B's speech. As a result, this backtracking mark corresponds to the position of the recognition result text of User A's "I am --- desu." (meaning "I am ---.").
 このように、複数のユーザの発話内容およびその発話タイミングに応じてイベントタグ情報を付すことで、会話内容の雰囲気を的確に表現することができる。 In this way, by adding event tag information according to the utterances of multiple users and their utterance timing, it is possible to accurately express the atmosphere of the conversation.
 図9は、他の例における結果出力例を示す図である。図に示されるとおり、図9(a)は、発話内容を示す。音声認識部102および非言語音声認識部103は、それぞれユーザAおよびユーザBの発話内容を認識する。結果出力部105は、スコア判定部104の判定結果に従って、認識結果テキストおよびイベントタグ情報を出力する。 FIG. 9 is a diagram showing a result output example in another example. As shown in the figure, FIG. 9(a) shows the utterance content. The speech recognition unit 102 and the non-language speech recognition unit 103 recognize the contents of user A's and user B's utterances, respectively. The result output unit 105 outputs recognition result text and event tag information according to the determination result of the score determination unit 104 .
 図9(b)では、結果出力部105は、文単位にイベントタグ情報を認識結果テキストに付加する。すなわち、ユーザの発話単位に、イベントタグ情報を付加する。図9(b)では、結果出力部105は、ユーザAの認識結果テキスト「----です。」(I am ----.の意味)に、ユーザBのイベントタグ情報:相槌マークx1)を付加している。同様に、ユーザAの次の発話の認識結果テキスト「-----なんだよ。」(I think -----.の意味。なお、thinkに限るものではなく、会話によって変わる)に、ユーザAおよびBのイベントタグ情報:Aさんお笑いマークx1、Bさん相槌マークx1を付加している。上述したとおり、非言語音声認識部103は、ユーザBの相槌の発生時間を認識しており、その時間にあわせてユーザAの認識結果テキストに挿入することができる。 In FIG. 9(b), the result output unit 105 adds event tag information to the recognition result text for each sentence. That is, event tag information is added to each user's utterance. In FIG. 9B, the result output unit 105 adds user B's event tag information: backhand mark x1 to user A's recognition result text "----desu." ) is added. Similarly, in the recognition result text of the next utterance of user A "----- what is it?" , event tag information of users A and B: Mr. A's comedy mark x1 and Mr. B's backtracking mark x1 are added. As described above, the nonverbal speech recognition unit 103 recognizes the occurrence time of user B's backtracking, and can insert it into the recognition result text of user A according to the time.
 図9(c)では、結果出力部105は、段落単位(発話したユーザ単位)に、認識結果テキストおよびイベントタグ情報である「----です。-----なんだよ。(I am ----. I think -----.の意味)Aさん笑いマークx1、Bさん相槌マークx2」を出力する。そして、結果出力部105は、イベントタグ情報を、ユーザAの発話の認識結果テキストの末尾に出力するよう動作する。 In FIG. 9(c), the result output unit 105 outputs the recognition result text and the event tag information “---- desu. am -----.  I  think ----------) Outputs Mr. A's laughter mark x 1, Mr. B's backtracking mark x 2'. Then, the result output unit 105 operates to output the event tag information to the end of the recognition result text of user A's utterance.
 すなわち、結果出力部105は、ユーザAが笑ったことを示すマークおよびユーザBが相槌を打ったことを示すマークのイベントタグ情報を、認識結果テキストに付加して出力する。その際、非言語音声認識部103はユーザAが1回笑ったことを認識し、ユーザBが2回相槌を打ったことを認識して、結果出力部105は、その回数を示す数値をマークに付加する。なお、ここでは、複数のユーザの会話における発話とそれに対する応答を、一つの段落単位とする。段落単位とは、認識結果テキストを発話したユーザ単位を意味するが、それ以外に、発話の間隔に基づいて定めてもよい。 That is, the result output unit 105 outputs the event tag information of the mark indicating that user A laughed and the mark indicating that user B gave a backlash to the recognition result text. At that time, the nonverbal speech recognition unit 103 recognizes that the user A laughed once, and that the user B backed up twice. Append to Here, utterances and responses to the utterances in conversations of a plurality of users are treated as one paragraph unit. The paragraph unit means the unit of the user who uttered the recognition result text, but it may also be determined based on the interval between utterances.
 図9(d)の例では、同様の処理をするが、結果出力部105は、ユーザAの認識結果テキスト内に笑いマークを付加し、ユーザBの認識結果であるイベントタグ情報(相槌マーク)をユーザAの認識結果テキストの末尾に付加する。 In the example of FIG. 9D, similar processing is performed, but the result output unit 105 adds laughter marks to the recognition result text of user A, and event tag information (backhand mark) which is the recognition result of user B is added to the end of user A's recognition result text.
 すなわち、結果出力部105は、認識結果テキストを含む話者、すなわちユーザAの発話をベースに、ユーザAの認識結果テキストおよびイベントタグ情報を出力するとともに、ユーザBの発話がイベントのみである場合には、ユーザAの認識結果テキスト(またはイベントタグ情報)の末尾に、ユーザBのイベントタグ情報を出力する。なお、ユーザBの発話の認識結果が認識結果テキストを含む場合には、結果出力部105は、ユーザBのその認識結果テキストとともにイベントタグ情報を出力してもよい。 That is, the result output unit 105 outputs the recognition result text and the event tag information of the user A based on the speaker including the recognition result text, that is, the utterance of the user A, and when the utterance of the user B is only an event , the event tag information of user B is output at the end of the recognition result text (or event tag information) of user A. In addition, when the recognition result of user B's utterance includes the recognition result text, the result output unit 105 may output the event tag information together with the recognition result text of user B. FIG.
 つぎに、別の例における結果出力例を示す。図10は、別の例における結果出力例を示す図である。この図によれば、結果出力部105は、ユーザAおよびユーザBの発話内容(または認識結果テキスト)に基づいて、話題の区切りを認識することができる。例えば、結果出力部105は、「ところで」(by the way)など、話題を変える意図を示す文字列を検出すると、それまでの話題の末尾に、マークを入れる処理を行う。図10においては、結果出力部105は、ユーザBが「ふーん」(hmmm)という相槌をうったところが、一つの話題の終わりと判断し、その話題の末尾に、ユーザAに対する笑いマーク、およびユーザBに対する相槌マークを付加する。そのほか、公知技術である話題推定エンジンや話題分割エンジンを用いて、話題を推定またはシーンを分割することもできる。 Next, a result output example in another example is shown. FIG. 10 is a diagram showing a result output example in another example. According to this figure, the result output unit 105 can recognize the break of the topic based on the contents of user A's and user B's utterances (or recognition result text). For example, when the result output unit 105 detects a character string indicating an intention to change the topic, such as "by the way", the result output unit 105 performs processing to put a mark at the end of the topic so far. In FIG. 10, the result output unit 105 determines that a topic ends when user B responds with "hmm" (hmmm), and puts a smiley mark for user A and a user Add a backtracking mark for B. In addition, it is also possible to estimate a topic or divide a scene using a topic estimation engine or a topic division engine, which are known techniques.
 これにより、話題ごとにイベントタグ情報を付加することもできる。なお、話題として、会議などにおける説明時間帯と質疑応答の時間帯などを区別できる。 This makes it possible to add event tag information for each topic. As topics, it is possible to distinguish between an explanation time period and a question-and-answer time period in a meeting or the like.
 図10(b)においては、その回数に応じたマークを付加するか、またはその回数を示した数字を付加しているが、これ以外にも、所定回数以上のイベントが発生していた場合には、その頻度を強調する画像のマークを付加してもよい。例えば、笑いが10回以上あった場合には、大きな笑顔のマークを付加するなどである。 In FIG. 10B, a mark corresponding to the number of times or a number indicating the number of times is added. may be marked with an image that emphasizes its frequency. For example, if there are ten or more laughs, a large smile mark is added.
 図7および図9は、笑いが起こった場合を例に示したが、音声認識装置100は、イベントごとに、また発生頻度に応じたイベントタグ情報(画像、マーク、テキスト)をメモリ等の記憶部(図示せず)に用意し、頻度に応じてイベントタグ情報をかえてもよい。すなわち、発生頻度の基準値を用意しておき、それに応じてイベントタグ情報をかえてもよい。 FIG. 7 and FIG. 9 show the case where laughter occurs as an example, but the speech recognition apparatus 100 stores event tag information (image, mark, text) for each event and according to the frequency of occurrence in a memory or the like. The event tag information may be prepared in a unit (not shown) and the event tag information may be changed according to the frequency. That is, a reference value for the frequency of occurrence may be prepared and the event tag information may be changed accordingly.
 例えば、笑いが2回以上発生した場合に、笑いを示すマークを付加し、咳が5回以上発生した場合に、咳を示すマークを付加するなどである。また、その頻度に応じたイベントタグ情報としてもよく、例えば、咳が5回未満発生した場合には、咳を示すマークを付加し、咳が5回以上発生した場合には、心配している顔を示すマークを付加するなどである。また、会議などで、笑いのイベントが多い区間(所定頻度以上)は、笑いマークに代えて、賛成マークなどにすることもできる。 For example, if laughter occurs two or more times, a mark indicating laughter is added, and if cough occurs five times or more, a cough mark is added. Event tag information may also be provided according to the frequency. For example, when coughing occurs less than 5 times, a mark indicating coughing is added, and when coughing occurs 5 times or more, the event tag information is concerned. For example, a mark indicating a face is added. In a meeting or the like, sections in which there are many laughing events (at least at a predetermined frequency) can be replaced with an approving mark or the like instead of the laughing mark.
 また、キーボード音については、キーボード音が所定回数以下発生した場合には、その旨を警告表示するマークを付加してもよい。音量が一定以上である場合には、さらに強い表示(例えば打鍵音が大きい)を示したマークを付加してもよい。また、音声認識部102により音声認識が行われ、認識結果テキストが一定以上出力されてしまう場合(認識結果テキストの信頼度が所定値以上の場合)、さらに強い警告メッセージを示したマークを付加してもよい。 Also, as for the keyboard sound, if the keyboard sound occurs less than a predetermined number of times, a warning mark may be added to that effect. If the volume is above a certain level, a mark indicating a stronger display (for example, a loud keystroke sound) may be added. In addition, when speech recognition is performed by the speech recognition unit 102 and the recognition result text is output over a certain amount (when the reliability of the recognition result text is equal to or higher than a predetermined value), a mark indicating a stronger warning message is added. may
 これら基準値または所定回数は、あらかじめ定めた値としてもよいし、全体の会話の中でイベントの発生頻度に基づいて定めてもよい。例えば、全発話におけるあるイベントの発生頻度を、時間単位(または段落単位若しくは文単位等)で平均した値を基準値(または所定回数)として、それ以上の発生頻度の場合に、イベントタグ情報を付加するなどしてもよい。 These reference values or the predetermined number of times may be predetermined values, or may be determined based on the frequency of occurrence of events in the entire conversation. For example, if the frequency of occurrence of a certain event in all utterances is averaged in units of time (or in units of paragraphs or sentences, etc.) as a reference value (or a predetermined number of times), and if the frequency of occurrence exceeds that value, the event tag information is It may be added.
 また、イベントタグ情報として、マークに代えて、笑顔などの絵文字でも、動画でもよい。また、イベントタグ情報を付加することに代えて、その直前に発せされた発話の認識結果テキストの文字の色を変えてもよい。例えば、笑顔だと、その認識結果テキストを明るい色(例えば黄色など)にしたり、背景を明るい色(青空の色)にしてもよい。例えば、ため息だと、その認識結果テキストを暗い色(例えば灰色など)にしたり、背景を暗い色(灰色)にしてもよい。 Also, as event tag information, pictograms such as smiles or videos may be used instead of marks. Also, instead of adding the event tag information, the character color of the recognition result text of the utterance uttered immediately before may be changed. For example, for a smiling face, the recognition result text may be given a bright color (for example, yellow), and the background may be given a bright color (the color of blue sky). For example, for a sigh, the recognition result text may be dark (eg, gray) and the background may be dark (gray).
 また、上記では、イベントの種別、スコア判定部104によるスコア、イベントの頻度によってイベントタグ情報の内容を変えたが、複数のイベントが発生した場合、全体のイベントに対する各イベントの発生割合によって、そのイベントタグ情報の表示形態を変えてもよい。 In the above description, the contents of the event tag information are changed according to the type of event, the score determined by the score determination unit 104, and the frequency of the event. You may change the display form of event tag information.
 また、ため息などの暗いイベントが発生した場合には、その警告またはアドバイスを表示して、明るい声で話すようにメッセージを出力するなどしてもよい。 Also, if a dark event such as a sigh occurs, a warning or advice may be displayed, and a message to speak in a bright voice may be output.
 本開示においては、音声認識装置100は、リアルタイムに一または複数のユーザの会話に対して音声認識処理および非言語音声認識処理を行ってもよいし、録音した会話データに対して音声認識処理および非言語音声認識処理を行ってもよい。 In the present disclosure, the speech recognition device 100 may perform speech recognition processing and non-language speech recognition processing on conversations of one or more users in real time, or may perform speech recognition processing and non-language speech recognition processing on recorded conversation data. Non-verbal speech recognition processing may be performed.
 会議などで、リアルタイムで音声認識処理および非言語音声認識処理を行う場合には、音声認識装置100は、特定の人(端末)にのみ出力(表示)したり、または全員がみられるように出力(表示)してもよい。また、WEB会議などで、話者(端末)ごとに関連づけられた表示欄がある場合は、その話者に対応する欄に表示してもよい。また、ユーザAにだけ表示して見せてもよいし、ユーザCまたはDなど他の参加者の画面にも表示してもよい。 When speech recognition processing and non-language speech recognition processing are performed in real time at a conference or the like, the speech recognition apparatus 100 outputs (displays) only to a specific person (terminal) or outputs for all to see. (display). Also, if there is a display column associated with each speaker (terminal) in a web conference or the like, it may be displayed in the column corresponding to the speaker. Further, it may be displayed and shown only to user A, or may be displayed on the screens of other participants such as user C or D as well.
 つぎに、イベントタグ情報と認識結果テキストとを関連付けておき、それを切り替え表示可能にする事例について説明する。図11は、その例示を示す図である。図11(a)は、実際の発話内容を示し、図11(b)は、結果出力例を示す。図11(b)に示されるとおり、結果出力部105は、音声認識部102が認識した認識結果テキストおよび非言語音声認識部103が認識したイベントタグ情報を出力する。その際、結果出力部105は、イベントタグ情報(笑いマークL1または相槌マークL2、L3)については、認識結果テキストと紐付けて、認識結果出力管理を行う。 Next, an example will be described in which event tag information and recognition result text are associated with each other and can be switched and displayed. FIG. 11 is a diagram showing an example thereof. FIG. 11(a) shows actual utterance content, and FIG. 11(b) shows an example of output of results. As shown in FIG. 11B , the result output unit 105 outputs the recognition result text recognized by the speech recognition unit 102 and the event tag information recognized by the non-language speech recognition unit 103 . At this time, the result output unit 105 performs recognition result output management by associating the event tag information (laughing mark L1 or backtracking marks L2 and L3) with the recognition result text.
 結果出力部105が、結果出力として認識結果テキストおよびイベントタグ情報をディスプレイに表示する場合、その表示を見たユーザの操作により当該イベントタグ情報L(L1~L3)が選択(マウスによるクリック等)されると、結果出力部105は、その選択を受け付け、そのイベントタグ情報の認識結果テキストをディスプレイに出力し、ディスプレイはそれを表示する(図11(c))。 When the result output unit 105 displays the recognition result text and the event tag information on the display as the result output, the event tag information L (L1 to L3) is selected by the operation of the user viewing the display (clicking with a mouse, etc.). Then, the result output unit 105 receives the selection, outputs the recognition result text of the event tag information to the display, and the display displays it (FIG. 11(c)).
 なお、この処理は、スコア判定部104による判定結果が所定条件である場合に、行うようにしてもよい。すなわち、スコア判定部104が、ある認識結果テキストおよびそれに対応するイベント信頼度が所定の数値に達していない場合に、いずれが妥当か正確な判定が困難である場合がある。その場合に、結果出力部105は、スコア判定部104から認識結果テキストとイベントタグ情報とそれぞれの信頼度とを取得して、ユーザ操作により切り替え表示を可能にしてもよい。 Note that this process may be performed when the result of determination by the score determination unit 104 satisfies a predetermined condition. That is, when the score determination unit 104 does not reach a predetermined numerical value for a certain recognition result text and the event reliability corresponding thereto, it may be difficult to accurately determine which is appropriate. In that case, the result output unit 105 may acquire the recognition result text, the event tag information, and the reliability of each from the score determination unit 104, and switch the display by user operation.
 なお、結果出力部105が他の外部端末に出力する場合、認識結果テキストと紐付けて、認識結果出力を行う。当該外部端末は、認識結果テキストおよびイベントタグ情報を表示するとともに、その外部端末のユーザ操作により当該イベントタグ情報L(L1~L3)が選択(マウスによるクリック等)されると、外部端末は、イベントタグ情報に紐付けられた認識結果テキストを表示する。 When the result output unit 105 outputs to another external terminal, the recognition result is output in association with the recognition result text. The external terminal displays the recognition result text and the event tag information, and when the event tag information L (L1 to L3) is selected (clicked with a mouse, etc.) by user operation of the external terminal, the external terminal: Display the recognition result text linked to the event tag information.
 つぎに、音声認識結果の修正処理について説明する。図12は、本開示の変形における音声認識装置100aの機能構成を示すブロック図である。図に示されるとおり、音声認識装置100aは、図1における音声認識装置100の機能構成に加えて、修正部106を含んで構成されている。 Next, the correction processing of speech recognition results will be explained. FIG. 12 is a block diagram showing the functional configuration of the speech recognition device 100a according to the modification of the present disclosure. As shown in the figure, the speech recognition device 100a includes a correction unit 106 in addition to the functional configuration of the speech recognition device 100 in FIG.
 修正部106は、結果出力部105からディスプレイに出力されて表示された結果出力の認識結果テキストまたはイベントタグ情報を修正する部分である。このディスプレイは認識結果テキストおよびイベントタグ情報を表示しており、修正部106は、ユーザの操作に従って、ポインタ等で示された修正箇所(認識結果テキストまたはイベントタグ情報の一部)を受け付けると、その修正箇所に応じた一または複数の修正候補をユーザにディスプレイで表示する。例えば、プルダウンでその修正候補を表示する。 The correction unit 106 is a part that corrects the recognition result text or event tag information of the result output that is output from the result output unit 105 and displayed on the display. This display displays the recognition result text and event tag information, and the correction unit 106 accepts the correction portion indicated by the pointer or the like (part of the recognition result text or event tag information) according to the user's operation. One or a plurality of correction candidates corresponding to the corrected portion are displayed to the user on the display. For example, the correction candidates are displayed in a pulldown.
 なお、結果出力部105は、同じ発話に対する認識結果テキストおよびイベントタグ情報を対応付けて修正候補として紐付け管理(記憶)をしている。なお、認識結果テキストには、漢字に変換したテキスト、平仮名のままのテキスト、カタカナのままのテキスト、そのほかの変換した記号またはテキストを含むものとする。 It should be noted that the result output unit 105 associates the recognition result text and event tag information for the same utterance with each other and manages (stores) them as correction candidates. Note that the recognition result text includes text converted into kanji, text in hiragana, text in katakana, and other converted symbols or text.
 そして、修正部106は、ユーザにより選択された一の修正候補を切り替えて、結果出力部105が、その修正候補をディスプレイに出力して、表示させる。修正候補は、音声認識部102および非言語音声認識部103により認識された認識結果テキストおよびイベントである。上述の通り、修正候補は、漢字に変換したテキスト、平仮名のままのテキスト、カタカナのままのテキスト、またはそのほかの変換した記号若しくはテキストを含むものとする。 Then, the correction unit 106 switches the one correction candidate selected by the user, and the result output unit 105 outputs and displays the correction candidate on the display. The correction candidates are recognition result texts and events recognized by the speech recognition unit 102 and the non-language speech recognition unit 103 . As described above, the candidate corrections may include text converted to kanji, plain hiragana text, plain katakana text, or other converted symbols or text.
 図13は、その修正の具体例を示す図である。図13(a)は、発話内容を示し、図13(b)は、ユーザによる修正画面の一例を示す図である。図に示されるとおり、修正部106は、ユーザ操作に従って、ポインタPを移動させる。修正部106は、ポインタPで示された修正箇所が選択されると、修正候補Bを表示する。結果出力部105は、ユーザが修正候補Bから任意の候補を選択するとその候補を結果出力としてディスプレイに出力する。図13において、修正候補Bには、「母派」が含まれており、「ははは」をそのまま認識されたテキストも選択可能としている。なお、本開示においては、「母派」は誤認識されたテキストである。 FIG. 13 is a diagram showing a specific example of the correction. FIG. 13(a) shows utterance content, and FIG. 13(b) is a diagram showing an example of a correction screen by the user. As shown in the figure, the correction unit 106 moves the pointer P according to user's operation. The correction unit 106 displays a correction candidate B when the correction portion indicated by the pointer P is selected. When the user selects an arbitrary candidate from the correction candidates B, the result output unit 105 outputs the candidate as a result output to the display. In FIG. 13, the correction candidate B includes "mother", and a text that recognizes "hahaha" as it is is also selectable. Note that in the present disclosure, "mother" is misrecognized text.
 図では、イベントタグ情報を修正対象として説明しているが、当然に認識結果テキストを修正対象としてもよい。 In the diagram, the event tag information is explained as the correction target, but of course the recognition result text may also be the correction target.
 つぎに、イベントの発生頻度の集計について説明する。本開示における変形例における音声認識装置100aは、音声認識を行う際、イベントの頻度を集計してもよい。イベントの発生頻度を集計することにより、その会話の重要度、会話の質等を判断することができる。図14は、別の変形例の音声認識装置100bの機能構成を示すブロック図である。この音声認識装置100bは、音声認識装置100の機能構成に加えて、集計部107を備えている。 Next, we will explain how to count the frequency of occurrence of events. The speech recognition device 100a according to the modification of the present disclosure may count the frequency of events when performing speech recognition. By summarizing the occurrence frequency of events, it is possible to judge the importance of the conversation, the quality of the conversation, and the like. FIG. 14 is a block diagram showing the functional configuration of a speech recognition device 100b of another modification. This speech recognition device 100b has a counting unit 107 in addition to the functional configuration of the speech recognition device 100. FIG.
 この集計部107は、スコア判定部104が判定した非言語音声認識部103が認識したイベントの発生頻度を集計する部分である。例えば、集計部107は、スコア判定部104において判定された笑い、頷き等のイベントの種類およびその発生頻度を集計する。その際、集計部107は、時間区間、段落ごと、話題ごと、または話者ごとに集約してもよい。また、長いポーズがあったり、段落が替わったり、等で集約してもよい。これにより、音声認識結果を分析するユーザは、時間区間、話題などの分類ごとの笑いまたは頷きの多さに基づいて、どの発話が重要であるか、その重要度を判断することができる。 This tabulation unit 107 is a part that tabulates the frequency of occurrence of events recognized by the non-verbal speech recognition unit 103 determined by the score determination unit 104 . For example, the tabulation unit 107 tabulates the types of events such as laughter and nod determined by the score determination unit 104 and their occurrence frequencies. At that time, the aggregating unit 107 may aggregate by time period, paragraph, topic, or speaker. Also, it may be aggregated by having a long pause, changing paragraphs, or the like. As a result, the user who analyzes the speech recognition result can determine which utterance is important and its importance level based on the amount of laughter or nod for each classification such as time period and topic.
 なお、集計部107は、話題の区切りを判断するにあたって、音声認識部102が音声認識して得た所定の文字列(例えば、「ところで」(by the way))を話題の区切りとして判断するが、当然にそれ以外の方法で話題を判断してもよい。話者についても同様に、音声認識部102および非言語音声認識部103が、音源分離または音声チャネルで区別した話者に基づいて区別できる。 Note that, when judging the delimiters of topics, the counting unit 107 determines a predetermined character string (for example, “by the way”) obtained by speech recognition by the speech recognition unit 102 as the delimiters of topics. Of course, other methods may be used to determine the topic. Speakers can also be similarly distinguished by the speech recognition unit 102 and the non-verbal speech recognition unit 103 based on speakers distinguished by source separation or speech channels.
 図15は、話題ごとのイベント種別の頻度を示したグラフである。音声認識装置100bは、このグラフを提供するために話題ごとに、各イベントの発生頻度の情報を提供することすることができる。なお、グラフに代えて表にしてもよい。また、話題に代えて、話者ごとの各イベントの発生頻度を提供してもよい。これによって、話題または話者ごとに肯定的または否定的な話題または話者であることの分析結果を得ることができる。 FIG. 15 is a graph showing the frequency of event types for each topic. The speech recognition device 100b can provide information on the frequency of occurrence of each event for each topic in order to provide this graph. A table may be used instead of the graph. Also, instead of the topic, the occurrence frequency of each event for each speaker may be provided. Accordingly, it is possible to obtain a positive or negative analysis result of being a topic or speaker for each topic or speaker.
 同様に、イベントの発生時刻を参照し、時間帯で分類してもよい。笑い声または頷きが多く発生した時間帯を肯定的な時間帯、笑い声または頷きが少なかったり、ため息が多い時間帯を否定的な時間帯とし、これを分析結果とすることができる。 Similarly, it is possible to refer to the time of event occurrence and classify by time zone. A time period in which many laughter sounds or nods occur is regarded as a positive time period, and a time period in which there are few laughter sounds or nods or many sighs is regarded as a negative time period, and this can be used as an analysis result.
 つぎに、本開示の音声認識装置100の作用効果について説明する。 Next, the effects of the speech recognition device 100 of the present disclosure will be described.
 本開示の音声認識装置100は、音情報を取得する音情報取得部として機能する音声取得部101と、音情報のうちユーザAの音声(第1の音情報)の認識結果テキストを取得する音声認識部102と、ユーザAの音声(第1の音情報)に対する(関連する)ユーザBの音声(第2の音情報)が非言語によるイベント(例えば頷き)の発生であることを判断する非言語音声認識部103と、イベントを示すイベントタグ情報を認識結果テキストに関連付けて出力する結果出力部105を備える。 The speech recognition apparatus 100 of the present disclosure includes a speech acquisition unit 101 functioning as a sound information acquisition unit that acquires sound information, The recognition unit 102 and a non-verbal device that determines that the (related) user B's voice (second sound information) with respect to the user A's voice (first sound information) is the occurrence of a nonverbal event (for example, nodding). It includes a language speech recognition unit 103 and a result output unit 105 that outputs event tag information indicating an event in association with the recognition result text.
 この構成により、イベントの発生を認識結果に関連付けてユーザ等に提示することができる。例えば、非言語音認識結果の一つである笑いを示すマークなどのイベントタグ情報(付加情報)を認識結果テキストに付加したり、認識結果テキストを色づけするなどで加工する。従って、視覚的にイベントの発生を関連付けた場合に、そのイベントタグ情報を受けたユーザは直感的に、また感覚的にそのイベントを認識することができる。本開示においては、ユーザであっても異なっていても、イベントタグ情報を認識結果テキストに付加することで直感的または感覚的にイベント認識を可能にする。 With this configuration, it is possible to associate the occurrence of an event with the recognition result and present it to the user. For example, event tag information (additional information) such as a mark indicating laughter, which is one of non-verbal sound recognition results, is added to the recognition result text, or the recognition result text is processed by coloring. Therefore, when the occurrence of an event is visually associated, the user who receives the event tag information can intuitively and intuitively recognize the event. In the present disclosure, regardless of whether the user is different, by adding event tag information to the recognition result text, it is possible to intuitively or sensuously recognize the event.
 これらマーク等の付加情報は、記号のほか、色で表した情報としてもよい。 Additional information such as these marks may be information represented by colors in addition to symbols.
 また、音声認識装置100における結果出力部105は、イベントの発生頻度が所定条件を満たす場合に、イベントタグ情報を認識結果テキストに関連付けて出力する。 Also, the result output unit 105 in the speech recognition apparatus 100 outputs the event tag information in association with the recognition result text when the occurrence frequency of the event satisfies a predetermined condition.
 例えば、笑いの発生頻度が所定回数以上である場合に、笑いを示すマークを、認識結果テキストに付加する。これによりその会話の場の雰囲気を直感的に把握できる。すなわち、1,2回程度の笑いと、それ以上の笑いが発生した場合とでは、その場の雰囲気は違う。 For example, when the frequency of occurrence of laughter is a predetermined number or more, a mark indicating laughter is added to the recognition result text. This makes it possible to intuitively grasp the atmosphere of the place of conversation. That is, the atmosphere of the place is different between one or two times of laughter and the case of more laughter.
 また、結果出力部105は、ユーザAの発話を含む所定の条件を満たした音情報群(ユーザAの発話およびユーザBの発話)に対する認識結果テキストに、ユーザBのイベントタグ情報(イベントを示す情報)を関連付ける。例えば、所定の条件とは、音情報群が、文単位、段落単位、または話題単位に区分されていることである。文末、段落末、または話題の末尾にイベントタグ情報を付加することで、ひとまとまりとなった会話でそのイベントの把握を容易にすることができる。 In addition, the result output unit 105 adds event tag information (indicating an event) of user B to the recognition result text for the sound information group (user A's utterance and user B's utterance) that satisfies a predetermined condition including user A's utterance. information). For example, the predetermined condition is that the sound information group is divided into sentence units, paragraph units, or topic units. By adding event tag information to the end of a sentence, the end of a paragraph, or the end of a topic, it is possible to easily grasp the event in a unified conversation.
 また、結果出力部105は、ユーザBなどのイベントが発生した場合に、イベントタグ情報を、ユーザAの認識結果テキストの末尾に付加する。すなわち、話者が異なる発話における認識結果テキストの末尾にイベントタグ情報を付加する。これにより、会話の雰囲気を直感的に把握できる。 Also, the result output unit 105 adds event tag information to the end of the recognition result text of user A when an event such as user B occurs. That is, the event tag information is added to the end of the recognition result text in utterances by different speakers. This makes it possible to intuitively grasp the atmosphere of the conversation.
 また、音声認識装置100aは、認識結果テキストまたはイベントを示す情報を修正する修正部106をさらに備える。音声認識部102は、ユーザAおよびユーザBの音声(第1の音情報および前記第2の音情報)を音声認識処理してそれぞれの認識結果テキストを取得する。また、非言語音声認識部103は、ユーザAおよびユーザBの音声(第1の音情報および前記第2の音情報)を非言語音声認識処理してそれぞれのイベント発生を判断する。修正部106は、これら音情報のそれぞれの認識結果テキストまたはイベントタグ情報を用いて修正する。 In addition, the speech recognition device 100a further includes a correction unit 106 that corrects the recognition result text or the information indicating the event. The voice recognition unit 102 performs voice recognition processing on the voices of user A and user B (the first sound information and the second sound information) to obtain respective recognition result texts. In addition, the non-verbal speech recognition unit 103 performs non-linguistic speech recognition processing on the speech of user A and user B (the first sound information and the second sound information) to determine the occurrence of each event. The correction unit 106 corrects the sound information using the recognition result text or event tag information.
 これにより、認識結果テキストおよびイベントタグ情報を修正し、正しい認識結果を得ることができる。 As a result, the recognition result text and event tag information can be corrected, and correct recognition results can be obtained.
 また、音声認識装置100において、音声認識部102および非言語音声認識部103は、ユーザAおよびBの両方の音情報(第1の音情報、第2の音情報)に対して音声認識処理およびイベントの発生の判断を行う。そして、結果出力部105は、スコア判定部104の判定結果に従って、認識結果テキストとイベントタグ情報を取得する。そして、結果出力部105は、例えばユーザBの笑いのイベントタグ情報とともにその笑いの認識結果テキストをディスプレイに出力する。このディスプレイを見た操作ユーザは、ユーザBのイベントタグ情報と認識結果テキストを切り替えて表示させることができる。すなわち、結果出力部105は、操作ユーザが切替操作をすると(例えばポインタによる選択)、イベントタグ情報と認識結果との切り替え表示のための出力制御を行う。 In the speech recognition apparatus 100, the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and Determines whether an event has occurred. Then, the result output unit 105 acquires the recognition result text and event tag information according to the determination result of the score determination unit 104 . Then, the result output unit 105 outputs, for example, the laughter recognition result text together with the event tag information of user B's laughter to the display. The operating user who sees this display can switch the display between the event tag information of user B and the recognition result text. That is, when the operating user performs a switching operation (for example, selection by a pointer), the result output unit 105 performs output control for switching display between the event tag information and the recognition result.
 出力先がディスプレイではなく、外部端末である場合には、結果出力部105は、ユーザBのイベントタグ情報と認識結果テキストを外部端末に出力する。この出力はイベントタグ情報と認識結果テキストの切り替えを可能にする出力制御に相当する。この外部端末側では、イベントタグ情報を表示するとともに、その外部端末のユーザ操作に従って、イベントタグ情報と認識結果テキストとの表示の切り替えを可能にする。 When the output destination is not the display but the external terminal, the result output unit 105 outputs the event tag information of user B and the recognition result text to the external terminal. This output corresponds to output control that enables switching between event tag information and recognition result text. On the external terminal side, the event tag information is displayed, and the display of the event tag information and the recognition result text can be switched according to the user's operation of the external terminal.
 上記実施形態においては、音声認識装置100内で、信頼度の判定およびそれに応じた結果出力を行っていたが、それら処理を外部端末に依頼してもよい。すなわち、音声認識処理に基づいた結果を出力しないための処理として、音声認識装置100は、音声認識処理による認識結果テキストおよびその信頼度、並びに、非言語音声認識処理による認識結果(イベント等)およびその信頼度を、外部端末に出力する。外部端末は、それら情報に基づいて認識結果テキスト等を得ることができる。 In the above embodiment, determination of reliability and output of the result are performed within the speech recognition apparatus 100, but these processes may be requested to an external terminal. That is, as a process for not outputting the result based on the speech recognition process, the speech recognition apparatus 100 outputs the recognition result text by the speech recognition process and its reliability, and the recognition result (event etc.) by the non-language speech recognition process and The reliability is output to an external terminal. The external terminal can obtain the recognition result text and the like based on the information.
 また、本開示の音声認識装置100において、音情報処理部は、音声認識部102および非言語音声認識部103のそれぞれの認識結果を判定するスコア判定部104と、その判定に従って音声認識部102による認識結果テキストを加工して出力する結果出力部105と、をさらに有する。すなわち、結果出力部105は、スコア判定部104による判定結果に基づいて、認識結果テキストのうち、非言語音の部分を出力しないようにし、言語の部分のみを出力する。 In the speech recognition apparatus 100 of the present disclosure, the sound information processing unit includes a score determination unit 104 that determines the recognition results of the speech recognition unit 102 and the non-language speech recognition unit 103, and the voice recognition unit 102 according to the determination. and a result output unit 105 for processing and outputting the recognition result text. That is, the result output unit 105 does not output the non-verbal sound portion of the recognition result text based on the determination result of the score determination unit 104, and outputs only the language portion.
 これにより非言語音部分の認識結果テキストを出力しないことから読みやすい認識結果テキストを得ることができる。 This makes it possible to obtain easy-to-read recognition result text by not outputting the recognition result text for the non-verbal sound part.
 また、本開示の音声認識装置100において、音声認識部102は、音情報が言語であることに対する言語音信頼度、すなわち認識結果テキストに対する信頼度を導出し、非言語音声認識部103は、音情報が非言語であることに対する非言語音信頼度、すなわちイベントに対する信頼度を導出する。そして、スコア判定部104は、これら信頼度(言語音信頼度および非言語音信頼度)に基づいて、音声認識部102および非言語音声認識部103による認識結果(認識結果テキストおよび各イベント)を判定する。 In addition, in the speech recognition apparatus 100 of the present disclosure, the speech recognition unit 102 derives the verbal sound reliability for the sound information being a language, that is, the reliability for the recognition result text, and the non-language speech recognition unit 103 derives the sound A non-speech sound confidence level for information being non-verbal, ie a confidence level for an event, is derived. Then, the score determination unit 104 determines the recognition results (recognition result text and each event) by the speech recognition unit 102 and the non-language speech recognition unit 103 based on these reliability levels (the verbal sound reliability level and the non-verbal sound reliability level). judge.
 ここで、非言語音信頼度を示すイベントの信頼度は、音声波形信号(音情報)が非言語音であることの信頼度、および音声波形信号(音情報)が非言語音でないことの信頼度を示す。すなわち、肯定イベントおよび否定イベントのそれぞれを示す。 Here, the reliability of the event indicating the non-verbal sound reliability is the reliability that the speech waveform signal (sound information) is non-verbal sound and the reliability that the speech waveform signal (sound information) is not non-verbal sound. degree. That is, it indicates positive events and negative events, respectively.
 そして、スコア判定部104は、肯定イベントおよび否定イベントのそれぞれの信頼度の少なくとも一方に対して重み付け処理を行って、判定処理を行う。 Then, the score determination unit 104 performs determination processing by weighting at least one of the reliability of each of the positive event and the negative event.
 このような重み付け処理を行うことで、ユーザの属性若しくは種別、会議の内容に応じた判定を行うことができる。 By performing such a weighting process, it is possible to make a judgment according to the user's attribute or type and the content of the meeting.
 上記において非言語言は、笑い声、咳、頷き、くしゃみ、およびキーボード音の少なくとも一つである。これらに限られるものではない。 In the above, the nonverbal language is at least one of laughter, coughing, nodding, sneezing, and keyboard sounds. It is not limited to these.
 また、本開示において、音声認識部102は、所定の音声認識単位(文単位、文節単位、単語単位など)で音声認識を行い、非言語音声認識部103は、所定の音声認識単位に応じた時間単位で非言語音声認識を行う。 Further, in the present disclosure, the speech recognition unit 102 performs speech recognition in a predetermined speech recognition unit (sentence unit, clause unit, word unit, etc.), and the non-language speech recognition unit 103 performs speech recognition according to the predetermined speech recognition unit. Perform non-verbal speech recognition in units of time.
 これにより、音声認識の単位に合わせた非言語音の認識を可能にする。 This makes it possible to recognize non-verbal sounds according to the unit of speech recognition.
 スコア判定部104は、音声認識単位とは異なる判定単位で、認識結果を判定してもよい。例えば、単語単位で音声認識および非言語音声認識をして、スコア判定に際しては、文単位としてもよい。 The score determination unit 104 may determine the recognition result in a determination unit different from the voice recognition unit. For example, speech recognition and non-language speech recognition may be performed on a word-by-word basis, and score determination may be made on a sentence-by-sentence basis.
 上記実施形態の説明に用いたブロック図は、機能単位のブロックを示している。これらの機能ブロック(構成部)は、ハードウェアおよびソフトウェアの少なくとも一方の任意の組み合わせによって実現される。また、各機能ブロックの実現方法は特に限定されない。すなわち、各機能ブロックは、物理的または論理的に結合した1つの装置を用いて実現されてもよいし、物理的または論理的に分離した2つ以上の装置を直接的または間接的に(例えば、有線、無線などを用いて)接続し、これら複数の装置を用いて実現されてもよい。機能ブロックは、上記1つの装置または上記複数の装置にソフトウェアを組み合わせて実現されてもよい。 The block diagram used in the description of the above embodiment shows blocks for each function. These functional blocks (components) are realized by any combination of at least one of hardware and software. Also, the method of implementing each functional block is not particularly limited. That is, each functional block may be implemented using one device physically or logically coupled, or directly or indirectly using two or more physically or logically separate devices (e.g. , wired, wireless, etc.) and may be implemented using these multiple devices. A functional block may be implemented by combining software in the one device or the plurality of devices.
 機能には、判断、決定、判定、計算、算出、処理、導出、調査、探索、確認、受信、送信、出力、アクセス、解決、選択、選定、確立、比較、想定、期待、見做し、報知(broadcasting)、通知(notifying)、通信(communicating)、転送(forwarding)、構成(configuring)、再構成(reconfiguring)、割り当て(allocating、mapping)、割り振り(assigning)などがあるが、これらに限られない。たとえば、送信を機能させる機能ブロック(構成部)は、送信部(transmitting unit)や送信機(transmitter)と呼称される。いずれも、上述したとおり、実現方法は特に限定されない。 Functions include judging, determining, determining, calculating, calculating, processing, deriving, examining, searching, checking, receiving, transmitting, outputting, accessing, resolving, selecting, choosing, establishing, comparing, assuming, expecting, assuming, Broadcasting, notifying, communicating, forwarding, configuring, reconfiguring, allocating, mapping, assigning, etc. can't For example, a functional block (component) that makes transmission work is called a transmitting unit or transmitter. In either case, as described above, the implementation method is not particularly limited.
 例えば、本開示の一実施の形態における音声認識装置100、100aおよび100bは、本開示の音声認識方法の処理を行うコンピュータとして機能してもよい。以下、音声認識装置100ついて説明するが、音声認識装置100aおよび100bについても同様である。図16は、本開示の一実施の形態に係る音声認識装置100のハードウェア構成の一例を示す図である。上述の音声認識装置100は、物理的には、プロセッサ1001、メモリ1002、ストレージ1003、通信装置1004、入力装置1005、出力装置1006、バス1007などを含むコンピュータ装置として構成されてもよい。 For example, the speech recognition devices 100, 100a, and 100b according to the embodiment of the present disclosure may function as computers that perform processing of the speech recognition method of the present disclosure. Although the speech recognition device 100 will be described below, the same applies to the speech recognition devices 100a and 100b. FIG. 16 is a diagram showing an example of the hardware configuration of the speech recognition device 100 according to an embodiment of the present disclosure. The speech recognition device 100 described above may be physically configured as a computer device including a processor 1001, a memory 1002, a storage 1003, a communication device 1004, an input device 1005, an output device 1006, a bus 1007, and the like.
 なお、以下の説明では、「装置」という文言は、回路、デバイス、ユニットなどに読み替えることができる。音声認識装置100のハードウェア構成は、図に示した各装置を1つまたは複数含むように構成されてもよいし、一部の装置を含まずに構成されてもよい。 In the following explanation, the term "apparatus" can be read as a circuit, device, unit, or the like. The hardware configuration of the speech recognition apparatus 100 may be configured to include one or more of each device shown in the figure, or may be configured without some of the devices.
 音声認識装置100における各機能は、プロセッサ1001、メモリ1002などのハードウェア上に所定のソフトウェア(プログラム)を読み込ませることによって、プロセッサ1001が演算を行い、通信装置1004による通信を制御したり、メモリ1002およびストレージ1003におけるデータの読み出しおよび書き込みの少なくとも一方を制御したりすることによって実現される。 Each function of the speech recognition apparatus 100 is performed by causing the processor 1001 to perform calculations, controlling communication by the communication apparatus 1004, and controlling the It is realized by controlling at least one of data reading and writing in 1002 and storage 1003 .
 プロセッサ1001は、例えば、オペレーティングシステムを動作させてコンピュータ全体を制御する。プロセッサ1001は、周辺装置とのインターフェース、制御装置、演算装置、レジスタなどを含む中央処理装置(CPU:Central Processing Unit)によって構成されてもよい。例えば、上述の音声認識部102および非言語音声認識部103などは、プロセッサ1001によって実現されてもよい。 The processor 1001, for example, operates an operating system and controls the entire computer. The processor 1001 may be configured by a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic device, registers, and the like. For example, the speech recognition unit 102 and the non-verbal speech recognition unit 103 described above may be implemented by the processor 1001 .
 また、プロセッサ1001は、プログラム(プログラムコード)、ソフトウェアモジュール、データなどを、ストレージ1003および通信装置1004の少なくとも一方からメモリ1002に読み出し、これらに従って各種の処理を実行する。プログラムとしては、上述の実施の形態において説明した動作の少なくとも一部をコンピュータに実行させるプログラムが用いられる。例えば、音声認識部102は、メモリ1002に格納され、プロセッサ1001において動作する制御プログラムによって実現されてもよく、他の機能ブロックについても同様に実現されてもよい。上述の各種処理は、1つのプロセッサ1001によって実行される旨を説明してきたが、2以上のプロセッサ1001により同時または逐次に実行されてもよい。プロセッサ1001は、1以上のチップによって実装されてもよい。なお、プログラムは、電気通信回線を介してネットワークから送信されても良い。 The processor 1001 also reads programs (program codes), software modules, data, etc. from at least one of the storage 1003 and the communication device 1004 to the memory 1002, and executes various processes according to them. As the program, a program that causes a computer to execute at least part of the operations described in the above embodiments is used. For example, the speech recognition unit 102 may be implemented by a control program stored in the memory 1002 and running on the processor 1001, and other functional blocks may be implemented similarly. Although it has been described that the above-described various processes are executed by one processor 1001, they may be executed simultaneously or sequentially by two or more processors 1001. FIG. Processor 1001 may be implemented by one or more chips. Note that the program may be transmitted from a network via an electric communication line.
 メモリ1002は、コンピュータ読み取り可能な記録媒体であり、例えば、ROM(Read Only Memory)、EPROM(Erasable Programmable ROM)、EEPROM(Electrically Erasable Programmable ROM)、RAM(Random Access Memory)などの少なくとも1つによって構成されてもよい。メモリ1002は、レジスタ、キャッシュ、メインメモリ(主記憶装置)などと呼ばれてもよい。メモリ1002は、本開示の一実施の形態に係る音声認識方法を実施するために実行可能なプログラム(プログラムコード)、ソフトウェアモジュールなどを保存することができる。 The memory 1002 is a computer-readable recording medium, and is composed of at least one of, for example, ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), etc. may be The memory 1002 may also be called a register, cache, main memory (main storage device), or the like. The memory 1002 can store executable programs (program code), software modules, etc. for implementing a speech recognition method according to an embodiment of the present disclosure.
 ストレージ1003は、コンピュータ読み取り可能な記録媒体であり、例えば、CD-ROM(Compact Disc ROM)などの光ディスク、ハードディスクドライブ、フレキシブルディスク、光磁気ディスク(例えば、コンパクトディスク、デジタル多用途ディスク、Blu-ray(登録商標)ディスク)、スマートカード、フラッシュメモリ(例えば、カード、スティック、キードライブ)、フロッピー(登録商標)ディスク、磁気ストリップなどの少なくとも1つによって構成されてもよい。ストレージ1003は、補助記憶装置と呼ばれてもよい。上述の記憶媒体は、例えば、メモリ1002およびストレージ1003の少なくとも一方を含むデータベース、サーバその他の適切な媒体であってもよい。 The storage 1003 is a computer-readable recording medium, for example, an optical disc such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disc, a magneto-optical disc (for example, a compact disc, a digital versatile disc, a Blu-ray disk), smart card, flash memory (eg, card, stick, key drive), floppy disk, magnetic strip, and/or the like. Storage 1003 may also be called an auxiliary storage device. The storage medium described above may be, for example, a database, server, or other suitable medium including at least one of memory 1002 and storage 1003 .
 通信装置1004は、有線ネットワークおよび無線ネットワークの少なくとも一方を介してコンピュータ間の通信を行うためのハードウェア(送受信デバイス)であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュールなどともいう。通信装置1004は、例えば周波数分割複信(FDD:Frequency Division Duplex)および時分割複信(TDD:Time Division Duplex)の少なくとも一方を実現するために、高周波スイッチ、デュプレクサ、フィルタ、周波数シンセサイザなどを含んで構成されてもよい。例えば、上述の音声取得部101などは、通信装置1004によって実現されてもよい。音声取得部101は、送信部と受信部とで、物理的に、または論理的に分離された実装がなされてもよい。 The communication device 1004 is hardware (transmitting/receiving device) for communicating between computers via at least one of a wired network and a wireless network, and is also called a network device, a network controller, a network card, a communication module, or the like. The communication device 1004 includes a high-frequency switch, a duplexer, a filter, a frequency synthesizer, etc., in order to realize at least one of, for example, frequency division duplex (FDD) and time division duplex (TDD). may consist of For example, the voice acquisition unit 101 and the like described above may be implemented by the communication device 1004 . The voice acquisition unit 101 may be physically or logically separated into a transmitting unit and a receiving unit.
 入力装置1005は、外部からの入力を受け付ける入力デバイス(例えば、キーボード、マウス、マイクロフォン、スイッチ、ボタン、センサなど)である。出力装置1006は、外部への出力を実施する出力デバイス(例えば、ディスプレイ、スピーカー、LEDランプなど)である。なお、入力装置1005および出力装置1006は、一体となった構成(例えば、タッチパネル)であってもよい。 The input device 1005 is an input device (for example, keyboard, mouse, microphone, switch, button, sensor, etc.) that receives input from the outside. The output device 1006 is an output device (eg, display, speaker, LED lamp, etc.) that outputs to the outside. Note that the input device 1005 and the output device 1006 may be integrated (for example, a touch panel).
 また、プロセッサ1001、メモリ1002などの各装置は、情報を通信するためのバス1007によって接続される。バス1007は、単一のバスを用いて構成されてもよいし、装置間ごとに異なるバスを用いて構成されてもよい。 Each device such as the processor 1001 and the memory 1002 is connected by a bus 1007 for communicating information. The bus 1007 may be configured using a single bus, or may be configured using different buses between devices.
 また、音声認識装置100は、マイクロプロセッサ、デジタル信号プロセッサ(DSP:Digital Signal Processor)、ASIC(Application Specific Integrated Circuit)、PLD(Programmable Logic Device)、FPGA(Field Programmable Gate Array)などのハードウェアを含んで構成されてもよく、当該ハードウェアにより、各機能ブロックの一部または全てが実現されてもよい。例えば、プロセッサ1001は、これらのハードウェアの少なくとも1つを用いて実装されてもよい。 The speech recognition device 100 also includes hardware such as a microprocessor, a digital signal processor (DSP), an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), and an FPGA (Field Programmable Gate Array). , and part or all of each functional block may be implemented by the hardware. For example, processor 1001 may be implemented using at least one of these pieces of hardware.
 情報の通知は、本開示において説明した態様/実施形態に限られず、他の方法を用いて行われてもよい。例えば、情報の通知は、物理レイヤシグナリング(例えば、DCI(Downlink Control Information)、UCI(Uplink Control Information))、上位レイヤシグナリング(例えば、RRC(Radio Resource Control)シグナリング、MAC(Medium Access Control)シグナリング、報知情報(MIB(Master Information Block)、SIB(System Information Block)))、その他の信号またはこれらの組み合わせによって実施されてもよい。また、RRCシグナリングは、RRCメッセージと呼ばれてもよく、例えば、RRC接続セットアップ(RRC Connection Setup)メッセージ、RRC接続再構成(RRC Connection Reconfiguration)メッセージなどであってもよい。 Notification of information is not limited to the aspects/embodiments described in the present disclosure, and may be performed using other methods. For example, notification of information includes physical layer signaling (e.g. DCI (Downlink Control Information), UCI (Uplink Control Information)), upper layer signaling (e.g. RRC (Radio Resource Control) signaling, MAC (Medium Access Control) signaling, It may be implemented by broadcast information (MIB (Master Information Block), SIB (System Information Block))), other signals, or a combination thereof. RRC signaling may also be called an RRC message, and may be, for example, an RRC connection setup message, an RRC connection reconfiguration message, or the like.
 本開示において説明した各態様/実施形態の処理手順、シーケンス、フローチャートなどは、矛盾の無い限り、順序を入れ替えてもよい。例えば、本開示において説明した方法については、例示的な順序を用いて様々なステップの要素を提示しており、提示した特定の順序に限定されない。 The order of the processing procedures, sequences, flowcharts, etc. of each aspect/embodiment described in the present disclosure may be changed as long as there is no contradiction. For example, the methods described in this disclosure present elements of the various steps using a sample order, and are not limited to the specific order presented.
 入出力された情報等は特定の場所(例えば、メモリ)に保存されてもよいし、管理テーブルを用いて管理してもよい。入出力される情報等は、上書き、更新、または追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 Input/output information may be stored in a specific location (for example, memory) or managed using a management table. Input/output information and the like may be overwritten, updated, or appended. The output information and the like may be deleted. The entered information and the like may be transmitted to another device.
 判定は、1ビットで表される値(0か1か)によって行われてもよいし、真偽値(Boolean:trueまたはfalse)によって行われてもよいし、数値の比較(例えば、所定の値との比較)によって行われてもよい。 The determination may be made by a value represented by one bit (0 or 1), by a true/false value (Boolean: true or false), or by numerical comparison (for example, a predetermined value).
 本開示において説明した各態様/実施形態は単独で用いてもよいし、組み合わせて用いてもよいし、実行に伴って切り替えて用いてもよい。また、所定の情報の通知(例えば、「Xであること」の通知)は、明示的に行うものに限られず、暗黙的(例えば、当該所定の情報の通知を行わない)ことによって行われてもよい。 Each aspect/embodiment described in the present disclosure may be used alone, may be used in combination, or may be used by switching along with execution. In addition, the notification of predetermined information (for example, notification of “being X”) is not limited to being performed explicitly, but may be performed implicitly (for example, not notifying the predetermined information). good too.
 以上、本開示について詳細に説明したが、当業者にとっては、本開示が本開示中に説明した実施形態に限定されるものではないということは明らかである。本開示は、請求の範囲の記載により定まる本開示の趣旨および範囲を逸脱することなく修正および変更態様として実施することができる。したがって、本開示の記載は、例示説明を目的とするものであり、本開示に対して何ら制限的な意味を有するものではない。 Although the present disclosure has been described in detail above, it is clear to those skilled in the art that the present disclosure is not limited to the embodiments described in the present disclosure. The present disclosure can be practiced with modifications and variations without departing from the spirit and scope of the disclosure as defined by the claims. Accordingly, the description of the present disclosure is for illustrative purposes and is not meant to be limiting in any way.
 ソフトウェアは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語と呼ばれるか、他の名称で呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、プログラム、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順、機能などを意味するよう広く解釈されるべきである。 Software, whether referred to as software, firmware, middleware, microcode, hardware description language or otherwise, includes instructions, instruction sets, code, code segments, program code, programs, subprograms, and software modules. , applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, and the like.
 また、ソフトウェア、命令、情報などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、有線技術(同軸ケーブル、光ファイバケーブル、ツイストペア、デジタル加入者回線(DSL:Digital Subscriber Line)など)および無線技術(赤外線、マイクロ波など)の少なくとも一方を使用してウェブサイト、サーバ、または他のリモートソースから送信される場合、これらの有線技術および無線技術の少なくとも一方は、伝送媒体の定義内に含まれる。 In addition, software, instructions, information, etc. may be transmitted and received via a transmission medium. For example, if the Software uses wired technology (coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), etc.) and/or wireless technology (infrared, microwave, etc.), the website, Wired and/or wireless technologies are included within the definition of transmission media when sent from a server or other remote source.
 本開示において説明した情報、信号などは、様々な異なる技術のいずれかを使用して表されてもよい。例えば、上記の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、またはこれらの任意の組み合わせによって表されてもよい。 The information, signals, etc. described in this disclosure may be represented using any of a variety of different technologies. For example, data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may refer to voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these. may be represented by a combination of
 なお、本開示において説明した用語および本開示の理解に必要な用語については、同一のまたは類似する意味を有する用語と置き換えてもよい。 The terms explained in the present disclosure and the terms necessary for understanding the present disclosure may be replaced with terms having the same or similar meanings.
 本開示で使用する「判断(determining)」、「決定(determining)」という用語は、多種多様な動作を包含する場合がある。「判断」、「決定」は、例えば、判定(judging)、計算(calculating)、算出(computing)、処理(processing)、導出(deriving)、調査(investigating)、探索(looking up、search、inquiry)(例えば、テーブル、データベースまたは別のデータ構造での探索)、確認(ascertaining)した事を「判断」「決定」したとみなす事などを含み得る。また、「判断」、「決定」は、受信(receiving)(例えば、情報を受信すること)、送信(transmitting)(例えば、情報を送信すること)、入力(input)、出力(output)、アクセス(accessing)(例えば、メモリ中のデータにアクセスすること)した事を「判断」「決定」したとみなす事などを含み得る。また、「判断」、「決定」は、解決(resolving)、選択(selecting)、選定(choosing)、確立(establishing)、比較(comparing)などした事を「判断」「決定」したとみなす事を含み得る。つまり、「判断」「決定」は、何らかの動作を「判断」「決定」したとみなす事を含み得る。また、「判断(決定)」は、「想定する(assuming)」、「期待する(expecting)」、「みなす(considering)」などで読み替えられてもよい。 The terms "determining" and "determining" used in this disclosure may encompass a wide variety of actions. "Judgement" and "determination" are, for example, judging, calculating, computing, processing, deriving, investigating, looking up, searching, inquiring (eg, lookup in a table, database, or other data structure); Also, "judgment" and "determination" are used for receiving (e.g., receiving information), transmitting (e.g., transmitting information), input, output, access (accessing) (for example, accessing data in memory) may include deeming that a "judgment" or "decision" has been made. In addition, "judgment" and "decision" are considered to be "judgment" and "decision" by resolving, selecting, choosing, establishing, comparing, etc. can contain. In other words, "judgment" and "decision" may include considering that some action is "judgment" and "decision". Also, "judgment (decision)" may be read as "assuming", "expecting", "considering", or the like.
 「接続された(connected)」、「結合された(coupled)」という用語、またはこれらのあらゆる変形は、2またはそれ以上の要素間の直接的または間接的なあらゆる接続または結合を意味し、互いに「接続」または「結合」された2つの要素間に1またはそれ以上の中間要素が存在することを含むことができる。要素間の結合または接続は、物理的なものであっても、論理的なものであっても、或いはこれらの組み合わせであってもよい。例えば、「接続」は「アクセス」で読み替えられてもよい。本開示で使用する場合、2つの要素は、1またはそれ以上の電線、ケーブルおよびプリント電気接続の少なくとも一つを用いて、並びにいくつかの非限定的かつ非包括的な例として、無線周波数領域、マイクロ波領域および光(可視および不可視の両方)領域の波長を有する電磁エネルギーなどを用いて、互いに「接続」または「結合」されると考えることができる。 The terms "connected," "coupled," or any variation thereof mean any direct or indirect connection or coupling between two or more elements, It can include the presence of one or more intermediate elements between two elements being "connected" or "coupled." Couplings or connections between elements may be physical, logical, or a combination thereof. For example, "connection" may be read as "access". As used in this disclosure, two elements are defined using at least one of one or more wires, cables, and printed electrical connections and, as some non-limiting and non-exhaustive examples, in the radio frequency domain. , electromagnetic energy having wavelengths in the microwave and light (both visible and invisible) regions, and the like.
 本開示において使用する「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 The term "based on" as used in this disclosure does not mean "based only on" unless otherwise specified. In other words, the phrase "based on" means both "based only on" and "based at least on."
 本開示において使用する「第1の」、「第2の」などの呼称を使用した要素へのいかなる参照も、それらの要素の量または順序を全般的に限定しない。これらの呼称は、2つ以上の要素間を区別する便利な方法として本開示において使用され得る。したがって、第1および第2の要素への参照は、2つの要素のみが採用され得ること、または何らかの形で第1の要素が第2の要素に先行しなければならないことを意味しない。 Any reference to elements using the "first," "second," etc. designations used in this disclosure does not generally limit the quantity or order of those elements. These designations may be used in this disclosure as a convenient method of distinguishing between two or more elements. Thus, reference to a first and second element does not imply that only two elements can be employed or that the first element must precede the second element in any way.
 本開示において、「含む(include)」、「含んでいる(including)」およびそれらの変形が使用されている場合、これらの用語は、用語「備える(comprising)」と同様に、包括的であることが意図される。さらに、本開示において使用されている用語「または(or)」は、排他的論理和ではないことが意図される。 Where "include," "including," and variations thereof are used in this disclosure, these terms are inclusive, as is the term "comprising." is intended. Furthermore, the term "or" as used in this disclosure is not intended to be an exclusive OR.
 本開示において、例えば、英語でのa, anおよびtheのように、翻訳により冠詞が追加された場合、本開示は、これらの冠詞の後に続く名詞が複数形であることを含んでもよい。 In this disclosure, when articles are added by translation, such as a, an and the in English, the disclosure may include that nouns following these articles are plural.
 本開示において、「AとBが異なる」という用語は、「AとBが互いに異なる」ことを意味してもよい。なお、当該用語は、「AとBがそれぞれCと異なる」ことを意味してもよい。「離れる」、「結合される」などの用語も、「異なる」と同様に解釈されてもよい。 In the present disclosure, the term "A and B are different" may mean "A and B are different from each other." The term may also mean that "A and B are different from C". Terms such as "separate," "coupled," etc. may also be interpreted in the same manner as "different."
100…音声認識装置、101…音声取得部、102…音声認識部、103…非言語音声認識部、104…スコア判定部、105…結果出力部、106…修正部、107…集計部。 DESCRIPTION OF SYMBOLS 100... Speech recognition apparatus, 101... Speech acquisition part, 102... Speech recognition part, 103... Non-language speech recognition part, 104... Score determination part, 105... Result output part, 106... Correction part, 107... Aggregation part.

Claims (10)

  1.  音情報を取得する音情報取得部と、
     前記音情報のうち第1の音情報を音声認識処理して認識結果を取得するとともに、前記第1の音情報に関連する第2の音情報が非言語によるイベントの発生であることを判断し、前記イベントを示す情報を前記認識結果に関連付けて出力する音情報処理部と、
    を備える音声認識装置。
    a sound information acquisition unit that acquires sound information;
    A speech recognition process is performed on the first sound information among the sound information to obtain a recognition result, and it is determined that the second sound information related to the first sound information is the occurrence of a non-verbal event. a sound information processing unit that outputs information indicating the event in association with the recognition result;
    A speech recognition device with a
  2.  前記音情報処理部は、
     前記認識結果に前記イベントを示す付加情報を付加する、
    請求項1に記載の音声認識装置。
    The sound information processing unit
    adding additional information indicating the event to the recognition result;
    The speech recognition device according to claim 1.
  3.  前記音情報処理部は、
     話者ごとに前記認識結果および前記イベントを示す情報を出力するとともに、
     一の話者による第1の音情報の認識結果に、異なる話者による第2の音情報のイベントを示す付加情報を付加する際、前記付加情報を、前記認識結果に対応付けて出力する、
    請求項2に記載の音声認識装置。
    The sound information processing unit
    outputting information indicating the recognition result and the event for each speaker;
    When adding additional information indicating an event of second sound information by a different speaker to the recognition result of the first sound information by one speaker, outputting the additional information in association with the recognition result;
    3. The speech recognition device according to claim 2.
  4.  前記音情報処理部は、
     前記認識結果を前記イベントに従って加工する、
    請求項1~3のいずれか一項に記載の音声認識装置。
    The sound information processing unit
    processing the recognition result according to the event;
    A speech recognition device according to any one of claims 1 to 3.
  5.  前記音情報処理部は、前記イベントの発生頻度が所定条件を満たす場合に、当該イベントを示す情報を前記認識結果に関連付けて出力する、
    請求項1~4のいずれか一項に記載の音声認識装置。
    When the occurrence frequency of the event satisfies a predetermined condition, the sound information processing unit outputs information indicating the event in association with the recognition result.
    A speech recognition device according to any one of claims 1 to 4.
  6.  前記音情報処理部は、
     前記第1の音情報を含む所定の条件を満たした音情報群に対する認識結果に、前記イベントを示す情報を関連付ける、
    請求項1~5のいずれか一項に記載の音声認識装置。
    The sound information processing unit
    Associating information indicating the event with a recognition result for a sound information group that satisfies a predetermined condition including the first sound information;
    The speech recognition device according to any one of claims 1 to 5.
  7.  前記所定の条件とは、前記音情報群が、文単位、段落単位、または話題単位に区分されていることである、
     請求項6に記載の音声認識装置。
    The predetermined condition is that the sound information group is divided into sentence units, paragraph units, or topic units.
    7. The speech recognition device according to claim 6.
  8.  前記音情報処理部は、前記第1の音情報の話者と異なる話者によるイベントが発生した場合に、当該イベントを示す情報を、前記音情報群に対する認識結果の末尾に付加する、
    請求項6または7に記載の音声認識装置。
    When an event by a speaker different from the speaker of the first sound information occurs, the sound information processing unit adds information indicating the event to the end of the recognition result for the sound information group.
    8. The speech recognition device according to claim 6 or 7.
  9.  前記認識結果または前記イベントを示す情報を修正する修正部、
    をさらに備え、
     前記音情報処理部は、
     前記第1の音情報および前記第2の音情報を音声認識処理してそれぞれの認識結果を取得し、前記第1の音情報および前記第2の音情報を非言語音声認識処理してそれぞれのイベント発生を判断し、
     前記修正部は、前記第1の音情報および前記第2の音情報のそれぞれの前記認識結果または前記イベントを示す情報を用いて修正する、
    請求項1~8のいずれか一項に記載の音声認識装置。
    a correction unit that corrects information indicating the recognition result or the event;
    further comprising
    The sound information processing unit
    Speech recognition processing is performed on the first sound information and the second sound information to obtain respective recognition results, and non-verbal speech recognition processing is performed on the first sound information and the second sound information to obtain respective recognition results. determine the occurrence of an event,
    The correction unit corrects the first sound information and the second sound information using information indicating the recognition result or the event, respectively.
    A speech recognition device according to any one of claims 1 to 8.
  10.  前記音情報処理部は、
     前記第2の音情報に対して音声認識処理をして認識結果を取得し、
     前記イベントを示す情報とともに、前記第2の音情報の認識結果を出力し、
     前記イベントを示す情報の提示と、前記第2の音情報の認識結果を提示との切替を可能とする出力制御を行う、
    請求項1~9のいずれか一項に記載の音声認識装置。
     
    The sound information processing unit
    obtaining a recognition result by performing speech recognition processing on the second sound information;
    outputting the recognition result of the second sound information together with the information indicating the event;
    performing output control that enables switching between presentation of information indicating the event and presentation of the recognition result of the second sound information;
    A speech recognition device according to any one of claims 1 to 9.
PCT/JP2022/014596 2021-06-01 2022-03-25 Sound recognition device WO2022254909A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2023525437A JPWO2022254909A1 (en) 2021-06-01 2022-03-25

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021-092495 2021-06-01
JP2021092495 2021-06-01

Publications (1)

Publication Number Publication Date
WO2022254909A1 true WO2022254909A1 (en) 2022-12-08

Family

ID=84323055

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/014596 WO2022254909A1 (en) 2021-06-01 2022-03-25 Sound recognition device

Country Status (2)

Country Link
JP (1) JPWO2022254909A1 (en)
WO (1) WO2022254909A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004075168A1 (en) * 2003-02-19 2004-09-02 Matsushita Electric Industrial Co., Ltd. Speech recognition device and speech recognition method
JP2005065252A (en) * 2003-07-29 2005-03-10 Fuji Photo Film Co Ltd Cell phone
JP2012208630A (en) * 2011-03-29 2012-10-25 Mizuho Information & Research Institute Inc Speech management system, speech management method and speech management program
JP2015158582A (en) * 2014-02-24 2015-09-03 日本放送協会 Voice recognition device and program
JP2018045639A (en) * 2016-09-16 2018-03-22 株式会社東芝 Dialog log analyzer, dialog log analysis method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004075168A1 (en) * 2003-02-19 2004-09-02 Matsushita Electric Industrial Co., Ltd. Speech recognition device and speech recognition method
JP2005065252A (en) * 2003-07-29 2005-03-10 Fuji Photo Film Co Ltd Cell phone
JP2012208630A (en) * 2011-03-29 2012-10-25 Mizuho Information & Research Institute Inc Speech management system, speech management method and speech management program
JP2015158582A (en) * 2014-02-24 2015-09-03 日本放送協会 Voice recognition device and program
JP2018045639A (en) * 2016-09-16 2018-03-22 株式会社東芝 Dialog log analyzer, dialog log analysis method, and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
UENO, HIROSHI; INOUE, MASASHI: "A text dialogue system that gives individuality to the agreeable responses", IPSJ SIG TECHNICAL REPORTS, INFORMATION PROCESSING SOCIETY OF JAPAN, JP, vol. 2015-NL-221, no. 10, 30 April 2015 (2015-04-30), JP , pages 1 - 9, XP009541669, ISSN: 2188-8779 *

Also Published As

Publication number Publication date
JPWO2022254909A1 (en) 2022-12-08

Similar Documents

Publication Publication Date Title
CN108962282B (en) Voice detection analysis method and device, computer equipment and storage medium
US11450311B2 (en) System and methods for accent and dialect modification
CN108962255B (en) Emotion recognition method, emotion recognition device, server and storage medium for voice conversation
US8386265B2 (en) Language translation with emotion metadata
KR20190046623A (en) Dialog system with self-learning natural language understanding
US10839788B2 (en) Systems and methods for selecting accent and dialect based on context
US11688416B2 (en) Method and system for speech emotion recognition
KR20170047268A (en) Orphaned utterance detection system and method
CN107578770B (en) Voice recognition method and device for network telephone, computer equipment and storage medium
US11093110B1 (en) Messaging feedback mechanism
WO2018093692A1 (en) Contextual dictionary for transcription
US20210232776A1 (en) Method for recording and outputting conversion between multiple parties using speech recognition technology, and device therefor
US10282599B2 (en) Video sentiment analysis tool for video messaging
US11683283B2 (en) Method for electronic messaging
CN111414772A (en) Machine translation method, device and medium
US20220414132A1 (en) Subtitle rendering based on the reading pace
US20210065708A1 (en) Information processing apparatus, information processing system, information processing method, and program
US10468031B2 (en) Diarization driven by meta-information identified in discussion content
WO2022254909A1 (en) Sound recognition device
US20190384466A1 (en) Linking comments to segments of a media presentation
WO2022254912A1 (en) Speech recognition device
US11416530B1 (en) Subtitle rendering based on the reading pace
CN112509570B (en) Voice signal processing method and device, electronic equipment and storage medium
WO2019235100A1 (en) Interactive device
JP2021082125A (en) Dialogue device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22815666

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023525437

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE