WO2022254909A1

WO2022254909A1 - Sound recognition device

Info

Publication number: WO2022254909A1
Application number: PCT/JP2022/014596
Authority: WO
Inventors: 悠輔中島; 拓加藤; 太一片山; 圭菊入
Original assignee: 株式会社Ｎｔｔドコモ
Priority date: 2021-06-01
Filing date: 2022-03-25
Publication date: 2022-12-08
Also published as: JPWO2022254909A1

Abstract

The purpose of the present invention is to provide a sound recognition device capable of ascertaining the mood of a conversation at, for example, a meeting. A sound recognition device 100 comprises: a sound acquisition unit 101 that functions as an audio information acquisition unit which acquires audio information; a sound recognition unit 102 that acquires a recognition result text of a sound of a user A (first audio information) out of the audio information; a nonverbal sound recognition unit 103 that determines that a sound of a user B (second audio information) with respect to (associated with) the sound of the user A (first audio information) is the occurrence of a nonverbal event (for example, a nod); and a result output unit 105 that outputs event tag information indicative of an event in association with the recognition result text.

Description

voice recognition device

The present invention relates to a speech recognition device that performs speech recognition.

Patent Document 1 describes a conference support system that recognizes text data representing speech from speech intervals while distinguishing between speech intervals and non-speech intervals contained in speech data in conferences and the like.

JP 2018-45208 A

However, in meetings, etc., it is not possible to recognize laughter and the like, and it is not possible to provide that atmosphere. Therefore, it is difficult to analyze the quality, such as the atmosphere of the meeting.

An object of the present invention is to provide a speech recognition device that can grasp the atmosphere of a conversation in a meeting or the like.

A speech recognition apparatus according to the present invention includes a sound information acquisition unit for acquiring sound information, a speech recognition process for first sound information among the sound information to acquire a recognition result, and a recognition result associated with the first sound information. a sound information processing unit that determines that the second sound information to be received is the occurrence of a nonverbal event, and outputs information indicating the event in association with the recognition result.

According to the present invention, it is possible to obtain speech recognition results that make it easy to grasp the atmosphere of a conversation in a meeting or the like.

1 is a block diagram showing a functional configuration of a speech recognition device 100 according to the present disclosure; FIG. FIG. 4 is an explanatory diagram showing the reliability of speech recognition results and the reliability of each recognition event; It is a figure which shows the example of a processing result with respect to a certain utterance. FIG. 10 is a diagram showing an outline of determination processing when changing a determination interval; 4 is a flowchart showing the operation of the speech recognition device 100; FIG. 10 is a diagram showing processing when the utterance is in English; FIG. 10 is a diagram showing an example of output of results in the process of adding event tag information; , and a diagram showing an example of processing determination for user A and user B. FIG. It is a figure which shows the example of a result output in another example. It is a figure which shows the example of a result output in another example. An example in which event tag information and recognition result text are associated will be described. FIG. 11 is a block diagram showing a functional configuration of a speech recognition device 100a according to a modification of the present disclosure; It is a figure which shows the specific example of correction. FIG. 11 is a block diagram showing the functional configuration of a speech recognition device 100b of another modified example; 4 is a graph showing the frequency of event types for each topic; 1 is a diagram illustrating an example hardware configuration of a speech recognition device 100 according to an embodiment of the present disclosure; FIG.

An embodiment of the present disclosure will be described with reference to the accompanying drawings. Where possible, the same parts are denoted by the same reference numerals, and overlapping descriptions are omitted.

FIG. 1 is a block diagram showing the functional configuration of the speech recognition device 100 according to the present disclosure. As shown in the figure, the speech recognition device 100 includes a speech acquisition unit 101, a speech recognition unit 102, a nonverbal speech recognition unit 103 (nonverbal sound recognition unit), a score determination unit 104, and a result output unit 105. It is Each configuration will be described below.

The speech acquisition unit 101 is a part that acquires speech in meetings, lectures, or the like. For example, the voice acquisition unit 101 is a microphone. In addition, it is good also as a part which acquires the audio|voice signal transmitted not only by this but by wire or radio. The speech acquisition unit 101 detects speech segments from the speech waveform signal and outputs the segments to the speech recognition unit 102 and the non-language speech recognition unit 103 .

The speech recognition unit 102 performs speech recognition processing on the verbal or non-verbal sounds in the speech section output from the speech acquisition unit 101 using a known language model and acoustic model, and acquires the recognition result text for each recognition unit. It is also a part that derives the reading and reliability of the recognition result text. The speech recognition unit 102 outputs the recognition result text, its reading and reliability in units of utterance, sentence, phrase, word, kana or phoneme, or time as recognition units.

This reliability is information indicating how much the recognition result in the speech recognition process can be trusted, and is generally indicated between 0 and 1, but is limited to this. Instead, it may be expressed as an integer or between 0 and 100. Moreover, it is good also as a normalized numerical value. In the present disclosure, the reliability of the speech recognition result is obtained based on the reliability stored in the language model and the acoustic model, but not limited to this, for example, end-to-end speech recognition, etc. Other known reliability derivation methods may be used.

The non-verbal speech recognition unit 103 uses a known non-verbal sound recognition model to perform recognition processing on the non-verbal sound on the verbal sound or the non-verbal sound in the speech section output from the speech acquisition unit 101 . part. The non-verbal speech recognition unit 103 generates a reliability for each event according to the type of non-verbal sound for each recognition time according to the recognition unit of speech recognition performed by the speech recognition unit 102 . For example, when the speech recognition unit 102 recognizes each word, the utterance time for the recognized word becomes the recognition unit of the non-verbal speech recognition unit 103 .

　Events indicate types of non-verbal sounds, and are positive events and negative events. For example, in the present disclosure, non-verbal sounds include non-verbal sounds based on sounds such as laughter, backtracking, nodding (affirmative or negative), sighs, sneezes, coughs, keyboard sounds, BGM, etc. Indicates sound such as musical sound. For example, a positive event indicates laughter, and a negative event indicates no laughter. The non-verbal speech recognizer 103 recognizes these non-verbal sounds and generates confidence levels for positive and negative events.

　The degree of reliability for an event is information that indicates how much the recognition result in non-verbal speech recognition processing can be trusted. This reliability is generally represented between 0 and 1, but is not limited to this, and may be represented by an integer or between 0 and 100. Moreover, it is good also as a normalized numerical value. This reliability is obtained based on the reliability stored in the recognition model for recognizing non-verbal sounds, but is not limited to this, and other known reliability derivation methods may be used.

FIG. 2 is an explanatory diagram showing the reliability of speech recognition result text and the reliability of each event. As shown in the figure, it is assumed that there is an utterance content "ahahaha" ("ahahaha" is uttered). This indicates laughter.

Since the speech recognition unit 102 tries to recognize the speech as a speech sound, it recognizes the text data as "Aboha" (converted into Japanese kanji) and the reading as "ahahaha". The speech recognition unit 102 derives a confidence level for the recognized recognition result text using the acoustic model and the language model.

On the other hand, the non-verbal speech recognition unit 103 derives the reliability for each event. In FIG. 2, the reliability is derived according to the presence or absence of laughter. That is, the degree of confidence that "ahahaha" ("ahahaha" is uttered) is not laughter and the degree of confidence that it is laughter are calculated. In FIG. 2, the reliability is calculated as 0.3 without laughter and 0.7 with laughter.

In addition, the reliability of text data as laughter may be calculated. In addition, as non-verbal sounds, other types of reliability may be calculated, such as the presence or absence of coughing or the presence or absence of keyboard sounds.

The non-verbal speech recognition unit 103 has recognition models for recognizing laughter, coughing, or keyboard sounds, respectively, and can calculate the respective reliability based on these recognition models. Of course, other recognition models for recognizing backtracking sounds, nodding sounds, and sneezes may be provided. A recognition model may also be provided that outputs the reliability of each event with respect to a plurality of events.

The score determination unit 104 determines the recognition result text and the event based on the reliability of the recognition result text for the verbal sound recognized by the speech recognition unit 102 and the reliability for the event recognized by the non-verbal speech recognition unit 103. This is the part that determines which of the above is appropriate.

The detailed processing will be explained using FIG. In FIG. 2, it is assumed that "ahahaha" ("ahahaha") is input as a voice and processed by the voice recognition unit 102 and the non-verbal voice recognition unit 103, respectively. The score determination unit 104 determines the reliability of the recognition result text “Aboha” (Japanese in which “ahahaha” is recognized (here, an example of mistranslation)) of the speech recognition unit 102 and the non-verbal speech recognition unit 103 positive events and negative events (no/with laughter) are compared with respective reliability. In FIG. 2, the speech recognition unit 102 outputs the reliability of the recognition result text: 0.3. In addition, the nonverbal speech recognition unit 103 outputs reliability of negative event (no laughter): 0.3 and reliability of positive event (with laughter): 0.7.

The score determination unit 104 determines the recognition result text or event with the highest reliability. In FIG. 2, the reliability of the positive event (with laughter): 0.7 is the highest reliability, so "ahahaha" is determined as an event indicating laughter.

Note that, as described above, the score determination unit 104 may determine the recognition result text or event based on the highest reliability, or may use values adjusted by weighting for each reliability. For example, the score determination unit 104 may determine which event the utterance content is based on a value obtained by multiplying the reliability of the positive event or the negative event by a predetermined coefficient. More specifically, the score determination unit 104 may determine the non-verbal sound by comparing the value obtained by multiplying the reliability of the positive event by 2 with the reliability of the negative event.

Also, the score determination unit 104 multiplies the reliability of the positive event by 0.7, subtracts 0.1, and compares it with a threshold to determine whether the positive event is appropriate for the non-verbal sound. good. These coefficients and threshold values may be stored in a memory or the like as fixed values in advance, or an input unit may be provided so that they can be input from the outside. Further, the given coefficients and thresholds may be varied according to predetermined formulas. Also, the score determination unit 104 is not limited to comparison with the binary reliability of each of the positive event and the negative event, and may include other values, for example, comparison with three or more values. For example, laughter and cough may be treated as affirmative events, no laughter and no cough as negative events, and keyboard sounds as noise events.

Such weighting adjustments are determined, for example, according to the attributes or types of users who use the speech recognition device 100 of the present disclosure, or the contents of the conference. For example, in the case of meeting content that is likely to cause laughter, it is easy to recognize laughter, but there are meeting content that is not. In such a case, it is difficult to recognize laughter as laughter. For such meetings or users, adjustments can be made as described above to enable accurate recognition.

The result output unit 105 selects a highly reliable recognition result text based on the determination result of the score determination unit 104, or adds and outputs event tag information according to the event. When outputting, the result output unit 105 inputs the information of the event determined by the score determination unit 104 and acquires the event tag information corresponding to the event from the storage unit or the like. The storage section or the like stores these event tag information in advance. Event tag information is, for example, information that graphically represents non-verbal sounds, and in the case of laughter, it is a graphical mark of laughter. Event tag information may be, for example, predefined text information. The output may be output to an external terminal, or may be displayed on a display.

FIG. 3 is a diagram showing an example of processing results for a certain utterance. FIG. 3 shows, as examples of processing results, recognition result text, recognition result morpheme, recognition result text reliability, event reliability (laughter), event reliability (cough), event reliability (keyboard sound), judgment result, and result output. , and the supplemental output. It also shows the reliability, result output, etc. in units of words.

The following utterance content is acquired, and the recognition result by the speech recognition unit 102 is obtained.

Speech contents: I, um, hahaha, something, (Cough: Van Gogh), (Keyboard sound: Kakakaka), nice.
The content of the above utterance is in Japanese, and is uttered as watashiwa ano- hahaha nanka (gohho) kakakaka ii. In Japanese, the ``ha'' following the subject is pronounced wa.

Recognition result text: I am a mother school The above is the recognition result text converted into Japanese. There is a mix of coughing and keyboard noises here. This recognition result text is segmented into watashi, wa, ano-, hahaha, nanka, gohho, kakakaka, and ii, and converted into Japanese when explained in terms of romaji notation.
Here, recognition results: "I" (watashi) and "wa" (wa) are obtained for the utterance: "I am" (watashiwa). Confidences of 0.95 and 0.91 are derived for the recognition result text, respectively. On the other hand, the reliability of each event by the nonverbal speech recognition unit 103 for "I" (watashi) is as follows.

Confidence level of laughter affirmative event: 0.23
Reliability of laughter denial event: 0.90
Confidence for cough positive event: 0.21
Confidence for negative cough event: 0.70
Confidence of keyboard sound positive event: 0.15
Confidence of keyboard sound negative event: 0.75
The reliability of each flag by the non-verbal speech recognition unit 103 for "ha" (wa) is as follows.

Confidence of laughter affirmative event: 0.35
Reliability of laughter denial event: 0.85
Confidence for cough positive event: 0.05
Confidence for negative cough event: 0.81
Confidence of keyboard sound positive event: 0.12
Confidence of keyboard sound negative event: 0.85
The score determination unit 104 selects the most reliable recognition result texts "I" (watashi) and "wa" (wa) based on these degrees of reliability, and the result output unit 105 outputs "I am" ( watashiwa) is output.

On the other hand, the speech recognition unit 102 sets the speech recognition confidence level to 0.9 for the utterance: "ano-" and the recognition result text: "ano-", and recognizes it as a filler. ing.

In addition, the non-verbal speech recognition unit 103 recognizes the utterance "hahaha" (hahaha) and the recognition result text "mother" (text in which hahaha is recognized) (reliability: 0.1). Event (with laughter): A reliability of 0.7 is calculated. Since the reliability of the recognition result text and the reliability of the affirmative event (with laughter) is higher than the reliability of other affirmative events (coughing, keyboard sound), the score determination unit 104 determines that the recognition result: “mother” is laughter. Determine that there is. The result output unit 105 outputs event tag information indicating that laughter occurred as a supplemental output. Note that the result output unit 105 does not have to output the event tag information.

In the present disclosure, the score determination unit 104 determines that a recognition result text determined as a filler based on the recognition result morpheme is a filler even if the recognition result text has a high degree of reliability. Then, the result output unit 105 does not output the recognition result text as a result output. Note that event tag information indicating a filler may be output as a supplemental output as necessary.

By the way, in the example of FIG. 3, the reliability is calculated for each word, and the determination is made based on the reliability of each word, but it is not limited to this. The score determination unit 104 may perform determination by changing the recognition unit of the recognition result text output by the speech recognition unit 102 and the non-language speech recognition unit 103 to the determination unit for score determination. For example, in FIG. 3, the speech recognition unit 102 and the non-verbal speech recognition unit 103 calculate the reliability for each word, but the score determination unit 104 integrates the reliability for each clause or sentence, Alternatively, determination may be made based on the reliability integrated in units of sentences. By changing the score determination range in this way, for example, it is possible to change the addition position of the event tag information indicating the non-verbal sound, and the text becomes easier to read. For example, when score determination is made for each sentence, event tag information is added to the end of the sentence.

FIG. 4 is a diagram showing an overview of the processing contents when the score determination unit is changed to the sentence unit. For convenience of explanation, descriptions of event reliability and the like are simplified as compared with FIG. Assume that the following utterance content is input and a recognition result text is obtained.

Content of utterance: I, um, (laughter: hahaha), something, (cough: Van Gogh), (keyboard sound: kakakaka), nice It is uttered as kakakaka ii. In Japanese, the ``ha'' following the subject is pronounced wa.
Recognition result text: I, umm, maternal, something, Van Gogh, kakkaka, good The above is the recognition result text converted to Japanese. Cough mixed with keyboard noise. This recognition result text is segmented into watashi, wa, ano-, hahaha, nanka, gohho, kakakaka, and ii, and converted into Japanese when explained in terms of romaji notation.
Here, the score determination unit 104 performs score determination on a sentence-by-sentence basis. That is, the score determination unit 104 adds up the speech recognition reliability and the non-language speech recognition reliability in the recognition result text of one sentence. Taking FIG. 4 as an example, the score determination unit 104 calculates the total value of the reliability of the recognition result text and the total of the reliability of the event (the reliability of each of the positive event and the negative event) in the sentence. . Based on this total value, the adequacy of the recognition result text and the presence or absence of an event are determined. In the example of FIG. 4, the score determination unit 104 determines that the total value of the reliability of the recognition result text and the affirmative event is equal to or greater than the predetermined value. I judge.

Then, the score determination unit 104 outputs to the result output unit 105 each of the recognition result texts "I", "Ha", "Ah", "Something", and "Good" whose reliability is equal to or higher than a predetermined value. Output event information (laughter, cough, keyboard sounds). The result output unit 105 acquires the event tag information from the event information and outputs it together with the recognition result text whose reliability is equal to or higher than a predetermined value.

As a result, event tag information can be output for each sentence, and event tag information can be added to the end of the recognition result text, making it easier to read.

Next, the operation of the speech recognition device 100 of the present disclosure will be described. FIG. 5 is a flow chart showing the operation of the speech recognition apparatus 100. As shown in FIG. The speech acquisition unit 101 acquires a speech waveform signal (S101), detects a speech segment from the speech waveform signal, and converts the speech (or other sound) in the speech segment into a speech signal for use by the speech recognition unit 102 and the non-language signal. It is output to the speech recognition unit 103 (S102).

The speech recognition unit 102 performs speech recognition processing on the speech signal and outputs the recognition result text, reading, and reliability (S103). The non-verbal speech recognition unit 103 also performs non-verbal speech recognition processing on the speech signal and outputs the reliability of each event for each recognition target time (S104).

The score determination unit 104 determines the validity of the recognition result text or event for each recognition target based on the reliability of the recognition result text by speech recognition processing and the reliability of each event recognized by non-verbal speech recognition. (S105).

The result output unit 105 selects an appropriate recognition result text from the recognition result texts based on the determination result, or acquires and outputs event tag information (S106).

Through such processing, it is possible to recognize verbal and non-verbal sounds during a meeting.

Next, we will explain application examples for other languages. Although the above example of utterance is intended for Japanese, the same processing can naturally be performed in other languages as well. FIG. 6 is a diagram showing processing when the utterance is in English. For convenience of explanation, the description is simplified. In FIG. 6, the following utterance is made.

Utterance content: I go (laughter: hahaha) to (cough: off coff cough) (keyboard sound: clatter) school.
Recognition result text: I go ah the her head to Costoco caca grata.
Here, the recognition result texts "I", "go", "to", and "school" recognized by the speech recognition unit 102 have high speech recognition reliability.

On the other hand, the recognition result texts "the her head", "Costoco", and "caca gratta" recognized by the speech recognition unit 102 have low recognition result text reliability but high event reliability. For example, the event reliability of laughter is high for the recognition result text "the her head". This is because we tried to recognize the ``hahaha'' of laughter as a speech sound.

The same is true for the recognition result texts "costoco" and "cacagrata", and as a result of trying to recognize coughing or keyboard sounds as speech sounds, the speech reliability is derived to be low. On the other hand, since these utterances are coughs and keyboard sounds, it is necessary not to output the recognition result text for them.

In the example of FIG. 6, the following results are output based on the recognition result text reliability and event reliability.

Result output: I go to school.
In this way, non-verbal sounds can be prevented from being output even for English speech recognition.

Next, when the speech recognition apparatus 100 according to the present disclosure recognizes a non-verbal sound, event tag information (a mark such as an image, a symbol, or a text) indicating what kind of event the non-verbal sound is related to The process of adding will be described. In the following, when there are expressions and processes peculiar to Japanese, they will be explained using Japanese notation.

FIG. 7 is a diagram showing a specific example thereof. FIG. 7(a) is a diagram showing actual utterance contents. FIG. 7(b) is a diagram showing a result output based on the recognition result.

In FIG. 7, conversations with users A and B are shown. In order to simplify the explanation, the content of the conversation is omitted and indicated as "---". In FIG. 7(a), user A is having a conversation while laughing, and user B is giving a nod in response to the conversation. The speech recognition unit 102 and the non-verbal speech recognition unit 103 can recognize who the speaker is based on speech channel or sound source separation.
In English, for example, it is as follows.
Mr. A I am Japanese. (laughs) I live in Tokyo.
Mr. B yeah

In FIG. 7, the speech recognition unit 102 and the non-language speech recognition unit 103 distinguish between user A's and user B's utterances (sound sources) and perform speech recognition processing and non-language speech recognition processing, respectively. The speech recognition unit 102 and the non-language speech recognition unit 103 recognize the elapsed time of speech, perform speech recognition processing and non-language speech recognition processing on user A's speech, and recognize the elapsed time of the speech. . Similarly, the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and non-language speech recognition processing on user B's speech, and recognize the elapsed time of the speech. Therefore, the result output unit 105 can use these elapsed times to indicate that user B is giving a back-hand in response to user A's utterance.

FIG. 8 is a diagram showing an example of processing determination for user A and user B. FIG. As shown in the figure, the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and non-language speech recognition processing separately for user A and user B based on the sound source separation technique.

As shown in FIG. 8, the speech recognition unit 102 and the non-verbal speech recognition unit 103 that recognize the voice of User A recognize User A's utterance "desu." there is). This "desu." indicates the ending of a word in Japanese. In Japanese, the ending is a verb, but in other languages it is not necessarily a verb. At the same time, the speech recognition unit 102 and the non-verbal speech recognition unit 103 that recognize the voice of the user B recognize the backtracking event of the user B "un". Also, the speech recognition unit 102 and the non-verbal speech recognition unit 103 each grasp the passage of time. Although omitted in FIG. 8, the elapsed time is associated with each recognition result text and managed. In addition, when recognizing the ending of an English conversation, it is judged from the whole conversation.

The result output unit 105 outputs user A's "is" (desu) and user B's event tag information (backhand mark) according to the determination process of the score determination unit 104 . At that time, the result output unit 105 outputs each recognized recognition result text "is."

FIG. 7(b) is a diagram showing a specific example of the result output. As shown in the figure, the result output unit 105 outputs backhand marks according to the elapsed time of user B's speech. As a result, this backtracking mark corresponds to the position of the recognition result text of User A's "I am --- desu." (meaning "I am ---.").

In this way, by adding event tag information according to the utterances of multiple users and their utterance timing, it is possible to accurately express the atmosphere of the conversation.

FIG. 9 is a diagram showing a result output example in another example. As shown in the figure, FIG. 9(a) shows the utterance content. The speech recognition unit 102 and the non-language speech recognition unit 103 recognize the contents of user A's and user B's utterances, respectively. The result output unit 105 outputs recognition result text and event tag information according to the determination result of the score determination unit 104 .

In FIG. 9(b), the result output unit 105 adds event tag information to the recognition result text for each sentence. That is, event tag information is added to each user's utterance. In FIG. 9B, the result output unit 105 adds user B's event tag information: backhand mark x1 to user A's recognition result text "----desu." ) is added. Similarly, in the recognition result text of the next utterance of user A "----- what is it?" , event tag information of users A and B: Mr. A's comedy mark x1 and Mr. B's backtracking mark x1 are added. As described above, the nonverbal speech recognition unit 103 recognizes the occurrence time of user B's backtracking, and can insert it into the recognition result text of user A according to the time.

In FIG. 9(c), the result output unit 105 outputs the recognition result text and the event tag information “---- desu. am -----. I think ----------) Outputs Mr. A's laughter mark x 1, Mr. B's backtracking mark x 2'. Then, the result output unit 105 operates to output the event tag information to the end of the recognition result text of user A's utterance.

That is, the result output unit 105 outputs the event tag information of the mark indicating that user A laughed and the mark indicating that user B gave a backlash to the recognition result text. At that time, the nonverbal speech recognition unit 103 recognizes that the user A laughed once, and that the user B backed up twice. Append to Here, utterances and responses to the utterances in conversations of a plurality of users are treated as one paragraph unit. The paragraph unit means the unit of the user who uttered the recognition result text, but it may also be determined based on the interval between utterances.

In the example of FIG. 9D, similar processing is performed, but the result output unit 105 adds laughter marks to the recognition result text of user A, and event tag information (backhand mark) which is the recognition result of user B is added to the end of user A's recognition result text.

That is, the result output unit 105 outputs the recognition result text and the event tag information of the user A based on the speaker including the recognition result text, that is, the utterance of the user A, and when the utterance of the user B is only an event , the event tag information of user B is output at the end of the recognition result text (or event tag information) of user A. In addition, when the recognition result of user B's utterance includes the recognition result text, the result output unit 105 may output the event tag information together with the recognition result text of user B. FIG.

Next, a result output example in another example is shown. FIG. 10 is a diagram showing a result output example in another example. According to this figure, the result output unit 105 can recognize the break of the topic based on the contents of user A's and user B's utterances (or recognition result text). For example, when the result output unit 105 detects a character string indicating an intention to change the topic, such as "by the way", the result output unit 105 performs processing to put a mark at the end of the topic so far. In FIG. 10, the result output unit 105 determines that a topic ends when user B responds with "hmm" (hmmm), and puts a smiley mark for user A and a user Add a backtracking mark for B. In addition, it is also possible to estimate a topic or divide a scene using a topic estimation engine or a topic division engine, which are known techniques.

This makes it possible to add event tag information for each topic. As topics, it is possible to distinguish between an explanation time period and a question-and-answer time period in a meeting or the like.

In FIG. 10B, a mark corresponding to the number of times or a number indicating the number of times is added. may be marked with an image that emphasizes its frequency. For example, if there are ten or more laughs, a large smile mark is added.

FIG. 7 and FIG. 9 show the case where laughter occurs as an example, but the speech recognition apparatus 100 stores event tag information (image, mark, text) for each event and according to the frequency of occurrence in a memory or the like. The event tag information may be prepared in a unit (not shown) and the event tag information may be changed according to the frequency. That is, a reference value for the frequency of occurrence may be prepared and the event tag information may be changed accordingly.

For example, if laughter occurs two or more times, a mark indicating laughter is added, and if cough occurs five times or more, a cough mark is added. Event tag information may also be provided according to the frequency. For example, when coughing occurs less than 5 times, a mark indicating coughing is added, and when coughing occurs 5 times or more, the event tag information is concerned. For example, a mark indicating a face is added. In a meeting or the like, sections in which there are many laughing events (at least at a predetermined frequency) can be replaced with an approving mark or the like instead of the laughing mark.

Also, as for the keyboard sound, if the keyboard sound occurs less than a predetermined number of times, a warning mark may be added to that effect. If the volume is above a certain level, a mark indicating a stronger display (for example, a loud keystroke sound) may be added. In addition, when speech recognition is performed by the speech recognition unit 102 and the recognition result text is output over a certain amount (when the reliability of the recognition result text is equal to or higher than a predetermined value), a mark indicating a stronger warning message is added. may

These reference values or the predetermined number of times may be predetermined values, or may be determined based on the frequency of occurrence of events in the entire conversation. For example, if the frequency of occurrence of a certain event in all utterances is averaged in units of time (or in units of paragraphs or sentences, etc.) as a reference value (or a predetermined number of times), and if the frequency of occurrence exceeds that value, the event tag information is It may be added.

Also, as event tag information, pictograms such as smiles or videos may be used instead of marks. Also, instead of adding the event tag information, the character color of the recognition result text of the utterance uttered immediately before may be changed. For example, for a smiling face, the recognition result text may be given a bright color (for example, yellow), and the background may be given a bright color (the color of blue sky). For example, for a sigh, the recognition result text may be dark (eg, gray) and the background may be dark (gray).

In the above description, the contents of the event tag information are changed according to the type of event, the score determined by the score determination unit 104, and the frequency of the event. You may change the display form of event tag information.

Also, if a dark event such as a sigh occurs, a warning or advice may be displayed, and a message to speak in a bright voice may be output.

In the present disclosure, the speech recognition device 100 may perform speech recognition processing and non-language speech recognition processing on conversations of one or more users in real time, or may perform speech recognition processing and non-language speech recognition processing on recorded conversation data. Non-verbal speech recognition processing may be performed.

When speech recognition processing and non-language speech recognition processing are performed in real time at a conference or the like, the speech recognition apparatus 100 outputs (displays) only to a specific person (terminal) or outputs for all to see. (display). Also, if there is a display column associated with each speaker (terminal) in a web conference or the like, it may be displayed in the column corresponding to the speaker. Further, it may be displayed and shown only to user A, or may be displayed on the screens of other participants such as user C or D as well.

Next, an example will be described in which event tag information and recognition result text are associated with each other and can be switched and displayed. FIG. 11 is a diagram showing an example thereof. FIG. 11(a) shows actual utterance content, and FIG. 11(b) shows an example of output of results. As shown in FIG. 11B , the result output unit 105 outputs the recognition result text recognized by the speech recognition unit 102 and the event tag information recognized by the non-language speech recognition unit 103 . At this time, the result output unit 105 performs recognition result output management by associating the event tag information (laughing mark L1 or backtracking marks L2 and L3) with the recognition result text.

When the result output unit 105 displays the recognition result text and the event tag information on the display as the result output, the event tag information L (L1 to L3) is selected by the operation of the user viewing the display (clicking with a mouse, etc.). Then, the result output unit 105 receives the selection, outputs the recognition result text of the event tag information to the display, and the display displays it (FIG. 11(c)).

Note that this process may be performed when the result of determination by the score determination unit 104 satisfies a predetermined condition. That is, when the score determination unit 104 does not reach a predetermined numerical value for a certain recognition result text and the event reliability corresponding thereto, it may be difficult to accurately determine which is appropriate. In that case, the result output unit 105 may acquire the recognition result text, the event tag information, and the reliability of each from the score determination unit 104, and switch the display by user operation.

When the result output unit 105 outputs to another external terminal, the recognition result is output in association with the recognition result text. The external terminal displays the recognition result text and the event tag information, and when the event tag information L (L1 to L3) is selected (clicked with a mouse, etc.) by user operation of the external terminal, the external terminal: Display the recognition result text linked to the event tag information.

Next, the correction processing of speech recognition results will be explained. FIG. 12 is a block diagram showing the functional configuration of the speech recognition device 100a according to the modification of the present disclosure. As shown in the figure, the speech recognition device 100a includes a correction unit 106 in addition to the functional configuration of the speech recognition device 100 in FIG.

The correction unit 106 is a part that corrects the recognition result text or event tag information of the result output that is output from the result output unit 105 and displayed on the display. This display displays the recognition result text and event tag information, and the correction unit 106 accepts the correction portion indicated by the pointer or the like (part of the recognition result text or event tag information) according to the user's operation. One or a plurality of correction candidates corresponding to the corrected portion are displayed to the user on the display. For example, the correction candidates are displayed in a pulldown.

It should be noted that the result output unit 105 associates the recognition result text and event tag information for the same utterance with each other and manages (stores) them as correction candidates. Note that the recognition result text includes text converted into kanji, text in hiragana, text in katakana, and other converted symbols or text.

Then, the correction unit 106 switches the one correction candidate selected by the user, and the result output unit 105 outputs and displays the correction candidate on the display. The correction candidates are recognition result texts and events recognized by the speech recognition unit 102 and the non-language speech recognition unit 103 . As described above, the candidate corrections may include text converted to kanji, plain hiragana text, plain katakana text, or other converted symbols or text.

FIG. 13 is a diagram showing a specific example of the correction. FIG. 13(a) shows utterance content, and FIG. 13(b) is a diagram showing an example of a correction screen by the user. As shown in the figure, the correction unit 106 moves the pointer P according to user's operation. The correction unit 106 displays a correction candidate B when the correction portion indicated by the pointer P is selected. When the user selects an arbitrary candidate from the correction candidates B, the result output unit 105 outputs the candidate as a result output to the display. In FIG. 13, the correction candidate B includes "mother", and a text that recognizes "hahaha" as it is is also selectable. Note that in the present disclosure, "mother" is misrecognized text.

In the diagram, the event tag information is explained as the correction target, but of course the recognition result text may also be the correction target.

Next, we will explain how to count the frequency of occurrence of events. The speech recognition device 100a according to the modification of the present disclosure may count the frequency of events when performing speech recognition. By summarizing the occurrence frequency of events, it is possible to judge the importance of the conversation, the quality of the conversation, and the like. FIG. 14 is a block diagram showing the functional configuration of a speech recognition device 100b of another modification. This speech recognition device 100b has a counting unit 107 in addition to the functional configuration of the speech recognition device 100. FIG.

This tabulation unit 107 is a part that tabulates the frequency of occurrence of events recognized by the non-verbal speech recognition unit 103 determined by the score determination unit 104 . For example, the tabulation unit 107 tabulates the types of events such as laughter and nod determined by the score determination unit 104 and their occurrence frequencies. At that time, the aggregating unit 107 may aggregate by time period, paragraph, topic, or speaker. Also, it may be aggregated by having a long pause, changing paragraphs, or the like. As a result, the user who analyzes the speech recognition result can determine which utterance is important and its importance level based on the amount of laughter or nod for each classification such as time period and topic.

Note that, when judging the delimiters of topics, the counting unit 107 determines a predetermined character string (for example, “by the way”) obtained by speech recognition by the speech recognition unit 102 as the delimiters of topics. Of course, other methods may be used to determine the topic. Speakers can also be similarly distinguished by the speech recognition unit 102 and the non-verbal speech recognition unit 103 based on speakers distinguished by source separation or speech channels.

FIG. 15 is a graph showing the frequency of event types for each topic. The speech recognition device 100b can provide information on the frequency of occurrence of each event for each topic in order to provide this graph. A table may be used instead of the graph. Also, instead of the topic, the occurrence frequency of each event for each speaker may be provided. Accordingly, it is possible to obtain a positive or negative analysis result of being a topic or speaker for each topic or speaker.

Similarly, it is possible to refer to the time of event occurrence and classify by time zone. A time period in which many laughter sounds or nods occur is regarded as a positive time period, and a time period in which there are few laughter sounds or nods or many sighs is regarded as a negative time period, and this can be used as an analysis result.

Next, the effects of the speech recognition device 100 of the present disclosure will be described.

The speech recognition apparatus 100 of the present disclosure includes a speech acquisition unit 101 functioning as a sound information acquisition unit that acquires sound information, The recognition unit 102 and a non-verbal device that determines that the (related) user B's voice (second sound information) with respect to the user A's voice (first sound information) is the occurrence of a nonverbal event (for example, nodding). It includes a language speech recognition unit 103 and a result output unit 105 that outputs event tag information indicating an event in association with the recognition result text.

With this configuration, it is possible to associate the occurrence of an event with the recognition result and present it to the user. For example, event tag information (additional information) such as a mark indicating laughter, which is one of non-verbal sound recognition results, is added to the recognition result text, or the recognition result text is processed by coloring. Therefore, when the occurrence of an event is visually associated, the user who receives the event tag information can intuitively and intuitively recognize the event. In the present disclosure, regardless of whether the user is different, by adding event tag information to the recognition result text, it is possible to intuitively or sensuously recognize the event.

Additional information such as these marks may be information represented by colors in addition to symbols.

Also, the result output unit 105 in the speech recognition apparatus 100 outputs the event tag information in association with the recognition result text when the occurrence frequency of the event satisfies a predetermined condition.

For example, when the frequency of occurrence of laughter is a predetermined number or more, a mark indicating laughter is added to the recognition result text. This makes it possible to intuitively grasp the atmosphere of the place of conversation. That is, the atmosphere of the place is different between one or two times of laughter and the case of more laughter.

In addition, the result output unit 105 adds event tag information (indicating an event) of user B to the recognition result text for the sound information group (user A's utterance and user B's utterance) that satisfies a predetermined condition including user A's utterance. information). For example, the predetermined condition is that the sound information group is divided into sentence units, paragraph units, or topic units. By adding event tag information to the end of a sentence, the end of a paragraph, or the end of a topic, it is possible to easily grasp the event in a unified conversation.

Also, the result output unit 105 adds event tag information to the end of the recognition result text of user A when an event such as user B occurs. That is, the event tag information is added to the end of the recognition result text in utterances by different speakers. This makes it possible to intuitively grasp the atmosphere of the conversation.

In addition, the speech recognition device 100a further includes a correction unit 106 that corrects the recognition result text or the information indicating the event. The voice recognition unit 102 performs voice recognition processing on the voices of user A and user B (the first sound information and the second sound information) to obtain respective recognition result texts. In addition, the non-verbal speech recognition unit 103 performs non-linguistic speech recognition processing on the speech of user A and user B (the first sound information and the second sound information) to determine the occurrence of each event. The correction unit 106 corrects the sound information using the recognition result text or event tag information.

As a result, the recognition result text and event tag information can be corrected, and correct recognition results can be obtained.

In the speech recognition apparatus 100, the speech recognition unit 102 and the non-language speech recognition unit 103 perform speech recognition processing and Determines whether an event has occurred. Then, the result output unit 105 acquires the recognition result text and event tag information according to the determination result of the score determination unit 104 . Then, the result output unit 105 outputs, for example, the laughter recognition result text together with the event tag information of user B's laughter to the display. The operating user who sees this display can switch the display between the event tag information of user B and the recognition result text. That is, when the operating user performs a switching operation (for example, selection by a pointer), the result output unit 105 performs output control for switching display between the event tag information and the recognition result.

When the output destination is not the display but the external terminal, the result output unit 105 outputs the event tag information of user B and the recognition result text to the external terminal. This output corresponds to output control that enables switching between event tag information and recognition result text. On the external terminal side, the event tag information is displayed, and the display of the event tag information and the recognition result text can be switched according to the user's operation of the external terminal.

In the above embodiment, determination of reliability and output of the result are performed within the speech recognition apparatus 100, but these processes may be requested to an external terminal. That is, as a process for not outputting the result based on the speech recognition process, the speech recognition apparatus 100 outputs the recognition result text by the speech recognition process and its reliability, and the recognition result (event etc.) by the non-language speech recognition process and The reliability is output to an external terminal. The external terminal can obtain the recognition result text and the like based on the information.

In the speech recognition apparatus 100 of the present disclosure, the sound information processing unit includes a score determination unit 104 that determines the recognition results of the speech recognition unit 102 and the non-language speech recognition unit 103, and the voice recognition unit 102 according to the determination. and a result output unit 105 for processing and outputting the recognition result text. That is, the result output unit 105 does not output the non-verbal sound portion of the recognition result text based on the determination result of the score determination unit 104, and outputs only the language portion.

This makes it possible to obtain easy-to-read recognition result text by not outputting the recognition result text for the non-verbal sound part.

In addition, in the speech recognition apparatus 100 of the present disclosure, the speech recognition unit 102 derives the verbal sound reliability for the sound information being a language, that is, the reliability for the recognition result text, and the non-language speech recognition unit 103 derives the sound A non-speech sound confidence level for information being non-verbal, ie a confidence level for an event, is derived. Then, the score determination unit 104 determines the recognition results (recognition result text and each event) by the speech recognition unit 102 and the non-language speech recognition unit 103 based on these reliability levels (the verbal sound reliability level and the non-verbal sound reliability level). judge.

Here, the reliability of the event indicating the non-verbal sound reliability is the reliability that the speech waveform signal (sound information) is non-verbal sound and the reliability that the speech waveform signal (sound information) is not non-verbal sound. degree. That is, it indicates positive events and negative events, respectively.

Then, the score determination unit 104 performs determination processing by weighting at least one of the reliability of each of the positive event and the negative event.

By performing such a weighting process, it is possible to make a judgment according to the user's attribute or type and the content of the meeting.

In the above, the nonverbal language is at least one of laughter, coughing, nodding, sneezing, and keyboard sounds. It is not limited to these.

Further, in the present disclosure, the speech recognition unit 102 performs speech recognition in a predetermined speech recognition unit (sentence unit, clause unit, word unit, etc.), and the non-language speech recognition unit 103 performs speech recognition according to the predetermined speech recognition unit. Perform non-verbal speech recognition in units of time.

This makes it possible to recognize non-verbal sounds according to the unit of speech recognition.

The score determination unit 104 may determine the recognition result in a determination unit different from the voice recognition unit. For example, speech recognition and non-language speech recognition may be performed on a word-by-word basis, and score determination may be made on a sentence-by-sentence basis.

The block diagram used in the description of the above embodiment shows blocks for each function. These functional blocks (components) are realized by any combination of at least one of hardware and software. Also, the method of implementing each functional block is not particularly limited. That is, each functional block may be implemented using one device physically or logically coupled, or directly or indirectly using two or more physically or logically separate devices (e.g. , wired, wireless, etc.) and may be implemented using these multiple devices. A functional block may be implemented by combining software in the one device or the plurality of devices.

Functions include judging, determining, determining, calculating, calculating, processing, deriving, examining, searching, checking, receiving, transmitting, outputting, accessing, resolving, selecting, choosing, establishing, comparing, assuming, expecting, assuming, Broadcasting, notifying, communicating, forwarding, configuring, reconfiguring, allocating, mapping, assigning, etc. can't For example, a functional block (component) that makes transmission work is called a transmitting unit or transmitter. In either case, as described above, the implementation method is not particularly limited.

For example, the

speech recognition devices

100, 100a, and 100b according to the embodiment of the present disclosure may function as computers that perform processing of the speech recognition method of the present disclosure. Although the speech recognition device 100 will be described below, the same applies to the

speech recognition devices

100a and 100b. FIG. 16 is a diagram showing an example of the hardware configuration of the speech recognition device 100 according to an embodiment of the present disclosure. The speech recognition device 100 described above may be physically configured as a computer device including a processor 1001, a memory 1002, a storage 1003, a communication device 1004, an input device 1005, an output device 1006, a bus 1007, and the like.

In the following explanation, the term "apparatus" can be read as a circuit, device, unit, or the like. The hardware configuration of the speech recognition apparatus 100 may be configured to include one or more of each device shown in the figure, or may be configured without some of the devices.

Each function of the speech recognition apparatus 100 is performed by causing the processor 1001 to perform calculations, controlling communication by the communication apparatus 1004, and controlling the It is realized by controlling at least one of data reading and writing in 1002 and storage 1003 .

The processor 1001, for example, operates an operating system and controls the entire computer. The processor 1001 may be configured by a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic device, registers, and the like. For example, the speech recognition unit 102 and the non-verbal speech recognition unit 103 described above may be implemented by the processor 1001 .

The processor 1001 also reads programs (program codes), software modules, data, etc. from at least one of the storage 1003 and the communication device 1004 to the memory 1002, and executes various processes according to them. As the program, a program that causes a computer to execute at least part of the operations described in the above embodiments is used. For example, the speech recognition unit 102 may be implemented by a control program stored in the memory 1002 and running on the processor 1001, and other functional blocks may be implemented similarly. Although it has been described that the above-described various processes are executed by one processor 1001, they may be executed simultaneously or sequentially by two or more processors 1001. FIG. Processor 1001 may be implemented by one or more chips. Note that the program may be transmitted from a network via an electric communication line.

The memory 1002 is a computer-readable recording medium, and is composed of at least one of, for example, ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), etc. may be The memory 1002 may also be called a register, cache, main memory (main storage device), or the like. The memory 1002 can store executable programs (program code), software modules, etc. for implementing a speech recognition method according to an embodiment of the present disclosure.

The storage 1003 is a computer-readable recording medium, for example, an optical disc such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disc, a magneto-optical disc (for example, a compact disc, a digital versatile disc, a Blu-ray disk), smart card, flash memory (eg, card, stick, key drive), floppy disk, magnetic strip, and/or the like. Storage 1003 may also be called an auxiliary storage device. The storage medium described above may be, for example, a database, server, or other suitable medium including at least one of memory 1002 and storage 1003 .

The communication device 1004 is hardware (transmitting/receiving device) for communicating between computers via at least one of a wired network and a wireless network, and is also called a network device, a network controller, a network card, a communication module, or the like. The communication device 1004 includes a high-frequency switch, a duplexer, a filter, a frequency synthesizer, etc., in order to realize at least one of, for example, frequency division duplex (FDD) and time division duplex (TDD). may consist of For example, the voice acquisition unit 101 and the like described above may be implemented by the communication device 1004 . The voice acquisition unit 101 may be physically or logically separated into a transmitting unit and a receiving unit.

The input device 1005 is an input device (for example, keyboard, mouse, microphone, switch, button, sensor, etc.) that receives input from the outside. The output device 1006 is an output device (eg, display, speaker, LED lamp, etc.) that outputs to the outside. Note that the input device 1005 and the output device 1006 may be integrated (for example, a touch panel).

Each device such as the processor 1001 and the memory 1002 is connected by a bus 1007 for communicating information. The bus 1007 may be configured using a single bus, or may be configured using different buses between devices.

The speech recognition device 100 also includes hardware such as a microprocessor, a digital signal processor (DSP), an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), and an FPGA (Field Programmable Gate Array). , and part or all of each functional block may be implemented by the hardware. For example, processor 1001 may be implemented using at least one of these pieces of hardware.

Notification of information is not limited to the aspects/embodiments described in the present disclosure, and may be performed using other methods. For example, notification of information includes physical layer signaling (e.g. DCI (Downlink Control Information), UCI (Uplink Control Information)), upper layer signaling (e.g. RRC (Radio Resource Control) signaling, MAC (Medium Access Control) signaling, It may be implemented by broadcast information (MIB (Master Information Block), SIB (System Information Block))), other signals, or a combination thereof. RRC signaling may also be called an RRC message, and may be, for example, an RRC connection setup message, an RRC connection reconfiguration message, or the like.

The order of the processing procedures, sequences, flowcharts, etc. of each aspect/embodiment described in the present disclosure may be changed as long as there is no contradiction. For example, the methods described in this disclosure present elements of the various steps using a sample order, and are not limited to the specific order presented.

Input/output information may be stored in a specific location (for example, memory) or managed using a management table. Input/output information and the like may be overwritten, updated, or appended. The output information and the like may be deleted. The entered information and the like may be transmitted to another device.

The determination may be made by a value represented by one bit (0 or 1), by a true/false value (Boolean: true or false), or by numerical comparison (for example, a predetermined value).

Each aspect/embodiment described in the present disclosure may be used alone, may be used in combination, or may be used by switching along with execution. In addition, the notification of predetermined information (for example, notification of “being X”) is not limited to being performed explicitly, but may be performed implicitly (for example, not notifying the predetermined information). good too.

Although the present disclosure has been described in detail above, it is clear to those skilled in the art that the present disclosure is not limited to the embodiments described in the present disclosure. The present disclosure can be practiced with modifications and variations without departing from the spirit and scope of the disclosure as defined by the claims. Accordingly, the description of the present disclosure is for illustrative purposes and is not meant to be limiting in any way.

Software, whether referred to as software, firmware, middleware, microcode, hardware description language or otherwise, includes instructions, instruction sets, code, code segments, program code, programs, subprograms, and software modules. , applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, and the like.

In addition, software, instructions, information, etc. may be transmitted and received via a transmission medium. For example, if the Software uses wired technology (coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), etc.) and/or wireless technology (infrared, microwave, etc.), the website, Wired and/or wireless technologies are included within the definition of transmission media when sent from a server or other remote source.

The information, signals, etc. described in this disclosure may be represented using any of a variety of different technologies. For example, data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may refer to voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these. may be represented by a combination of

The terms explained in the present disclosure and the terms necessary for understanding the present disclosure may be replaced with terms having the same or similar meanings.

The terms "determining" and "determining" used in this disclosure may encompass a wide variety of actions. "Judgement" and "determination" are, for example, judging, calculating, computing, processing, deriving, investigating, looking up, searching, inquiring (eg, lookup in a table, database, or other data structure); Also, "judgment" and "determination" are used for receiving (e.g., receiving information), transmitting (e.g., transmitting information), input, output, access (accessing) (for example, accessing data in memory) may include deeming that a "judgment" or "decision" has been made. In addition, "judgment" and "decision" are considered to be "judgment" and "decision" by resolving, selecting, choosing, establishing, comparing, etc. can contain. In other words, "judgment" and "decision" may include considering that some action is "judgment" and "decision". Also, "judgment (decision)" may be read as "assuming", "expecting", "considering", or the like.

The terms "connected," "coupled," or any variation thereof mean any direct or indirect connection or coupling between two or more elements, It can include the presence of one or more intermediate elements between two elements being "connected" or "coupled." Couplings or connections between elements may be physical, logical, or a combination thereof. For example, "connection" may be read as "access". As used in this disclosure, two elements are defined using at least one of one or more wires, cables, and printed electrical connections and, as some non-limiting and non-exhaustive examples, in the radio frequency domain. , electromagnetic energy having wavelengths in the microwave and light (both visible and invisible) regions, and the like.

The term "based on" as used in this disclosure does not mean "based only on" unless otherwise specified. In other words, the phrase "based on" means both "based only on" and "based at least on."

Any reference to elements using the "first," "second," etc. designations used in this disclosure does not generally limit the quantity or order of those elements. These designations may be used in this disclosure as a convenient method of distinguishing between two or more elements. Thus, reference to a first and second element does not imply that only two elements can be employed or that the first element must precede the second element in any way.

Where "include," "including," and variations thereof are used in this disclosure, these terms are inclusive, as is the term "comprising." is intended. Furthermore, the term "or" as used in this disclosure is not intended to be an exclusive OR.

In this disclosure, when articles are added by translation, such as a, an and the in English, the disclosure may include that nouns following these articles are plural.

In the present disclosure, the term "A and B are different" may mean "A and B are different from each other." The term may also mean that "A and B are different from C". Terms such as "separate," "coupled," etc. may also be interpreted in the same manner as "different."

DESCRIPTION OF SYMBOLS 100... Speech recognition apparatus, 101... Speech acquisition part, 102... Speech recognition part, 103... Non-language speech recognition part, 104... Score determination part, 105... Result output part, 106... Correction part, 107... Aggregation part.

Claims

a sound information acquisition unit that acquires sound information;
A speech recognition process is performed on the first sound information among the sound information to obtain a recognition result, and it is determined that the second sound information related to the first sound information is the occurrence of a non-verbal event. a sound information processing unit that outputs information indicating the event in association with the recognition result;
A speech recognition device with a
The sound information processing unit
adding additional information indicating the event to the recognition result;
The speech recognition device according to claim 1.
The sound information processing unit
outputting information indicating the recognition result and the event for each speaker;
When adding additional information indicating an event of second sound information by a different speaker to the recognition result of the first sound information by one speaker, outputting the additional information in association with the recognition result;
3. The speech recognition device according to claim 2.
The sound information processing unit
processing the recognition result according to the event;
A speech recognition device according to any one of claims 1 to 3.
When the occurrence frequency of the event satisfies a predetermined condition, the sound information processing unit outputs information indicating the event in association with the recognition result.
A speech recognition device according to any one of claims 1 to 4.
The sound information processing unit
Associating information indicating the event with a recognition result for a sound information group that satisfies a predetermined condition including the first sound information;
The speech recognition device according to any one of claims 1 to 5.
The predetermined condition is that the sound information group is divided into sentence units, paragraph units, or topic units.
7. The speech recognition device according to claim 6.
When an event by a speaker different from the speaker of the first sound information occurs, the sound information processing unit adds information indicating the event to the end of the recognition result for the sound information group.
8. The speech recognition device according to claim 6 or 7.
a correction unit that corrects information indicating the recognition result or the event;
further comprising
The sound information processing unit
Speech recognition processing is performed on the first sound information and the second sound information to obtain respective recognition results, and non-verbal speech recognition processing is performed on the first sound information and the second sound information to obtain respective recognition results. determine the occurrence of an event,
The correction unit corrects the first sound information and the second sound information using information indicating the recognition result or the event, respectively.
A speech recognition device according to any one of claims 1 to 8.
The sound information processing unit
obtaining a recognition result by performing speech recognition processing on the second sound information;
outputting the recognition result of the second sound information together with the information indicating the event;
performing output control that enables switching between presentation of information indicating the event and presentation of the recognition result of the second sound information;
A speech recognition device according to any one of claims 1 to 9.