CN110914897B - Speech recognition system and speech recognition device - Google Patents

Speech recognition system and speech recognition device Download PDF

Info

Publication number
CN110914897B
CN110914897B CN201980000774.5A CN201980000774A CN110914897B CN 110914897 B CN110914897 B CN 110914897B CN 201980000774 A CN201980000774 A CN 201980000774A CN 110914897 B CN110914897 B CN 110914897B
Authority
CN
China
Prior art keywords
information
data
unit
character string
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201980000774.5A
Other languages
Chinese (zh)
Other versions
CN110914897A (en
Inventor
菊田敦
越田高广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lingyang Electronics Co ltd
Original Assignee
Lingyang Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lingyang Electronics Co ltd filed Critical Lingyang Electronics Co ltd
Publication of CN110914897A publication Critical patent/CN110914897A/en
Application granted granted Critical
Publication of CN110914897B publication Critical patent/CN110914897B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a voice recognition system and a voice recognition device, which can improve the recognition accuracy. The speech recognition system is characterized by comprising: an acquisition unit that acquires at least one voice data; an extraction unit that extracts a silent start section and a silent end section included in the speech data, and extracts an arrangement of phonemes and pause sections sandwiched between the silent start section and the silent end section as recognition target data; a detection unit that selects the phoneme information corresponding to the arrangement included in the identification target data with reference to a character string database, and detects a plurality of character string information and the category ID associated with the selected phoneme information as candidate data; a calculation unit that generates a sentence in which a plurality of candidate data are combined based on the syntax information with reference to a syntax database, and calculates a reliability corresponding to each of the candidate data included in the sentence; a selection unit; and a generating unit.

Description

Speech recognition system and speech recognition device
Technical Field
The present invention relates to a speech recognition system and a speech recognition apparatus.
Background
Conventionally, as technologies related to speech recognition, for example, a cognitive function evaluation device of patent document 1, a system for grasping utterance contents of patent document 2, and the like have been proposed.
In the cognitive function evaluation device of patent document 1, a speech element analysis unit receives target data showing temporal variations in instantaneous sound pressure of a specific phoneme included in the speech of a target person over a target period. Then, the phoneme analysis unit divides the target period into a plurality of frames, and obtains the frequency of a specific phoneme for each of 2 or more target frames. The feature analysis unit obtains a feature amount for the frequency of a specific speech element obtained for each target frame. The evaluation unit evaluates the cognitive function of the subject based on the feature value.
Patent document 2 discloses a system for grasping the content of utterances based on extraction of core words from recorded speech data, an indexing method using the system, a method for grasping the content of utterances, and the like, in which phoneme-based speech recognition is performed on the recorded speech data, indexed data is stored, and the content of utterances based on the core words is grasped using the data, thereby accurately, easily, and quickly grasping the content of utterances.
Documents of the prior art
Patent document
Patent document 1: japanese laid-open patent publication No. 2018-50847
Patent document 2: japanese laid-open patent publication No. 2015-539364
Disclosure of Invention
Problems to be solved by the invention
Here, in the technology relating to speech recognition, application to various fields is expected, and on the other hand, there is a problem of improving recognition accuracy. Although a method using phonemes has been attracting attention in order to improve the recognition accuracy, it is still only a problem to improve the recognition accuracy due to variations in the acquisition of phoneme arrangements from speech data.
In this regard, in patent document 1, accuracy is improved by obtaining a feature amount for a specific phoneme frequency based on the speech of a target person and evaluating the cognitive function of the target person based on the feature amount. However, the technique disclosed in patent document 1 cannot recognize the content of the voice uttered by the subject person.
Further, patent document 2 discloses a technique for grasping the utterance content by grasping the utterance content based on the core word. However, in the technique disclosed in patent document 2, when the utterance contains a core word having a similar phoneme, the recognition accuracy may be deteriorated. Under such circumstances, a speech recognition technique capable of improving the recognition accuracy is desired.
The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition system and a speech recognition apparatus capable of improving recognition accuracy.
Means for solving the problems
The speech recognition system according to claim 1 is characterized by comprising: an acquisition unit that acquires at least one voice data; an extraction unit that extracts a silent start interval and a silent end interval included in the speech data, and extracts an arrangement of phonemes and pause intervals sandwiched between the silent start interval and the silent end interval as recognition target data; a character string database that stores character string information acquired in advance, phoneme information associated with the character string information, and a type ID assigned to the character string information; a detection unit that selects the phoneme information corresponding to the alignment included in the identification target data with reference to the character string database, and detects a plurality of character string information and the category ID associated with the selected phoneme information as candidate data; a syntax database storing syntax information indicating an arrangement order of the type IDs acquired in advance; a calculation unit that generates a sentence in which a plurality of candidate data are combined based on the grammar information by referring to the grammar database, and calculates a reliability corresponding to each of the candidate data included in the sentence; a selection unit that selects evaluation data from the plurality of candidate data according to the reliability; and a generation unit that generates identification information from the evaluation data.
The speech recognition system according to claim 2 is characterized in that, in claim 1, the extraction means extracts a plurality of pieces of the recognition target data having different arrangements of the phonemes and the rest periods from 1 piece of the speech data.
The speech recognition system according to claim 3 is characterized in that in claim 1 or 2, the calculation means generates a plurality of the sentences, and at least one of the types and combinations of the candidate data is different for each of the plurality of the sentences.
A speech recognition system according to claim 4 is the speech recognition system according to any one of claims 1 to 3, further comprising a reference database in which the character string information, a reference sentence composed of a combination of the character string information, and a threshold value assigned to each of the character string information are stored, wherein the generating means comprises: a specifying unit that specifies a 1 st reference sentence corresponding to the evaluation data among the reference sentences with reference to the reference database; and a comparison unit that compares the reliability corresponding to the evaluation data with a 1 st threshold value given to 1 st character string information included in the 1 st reference sentence, wherein the generation unit generates the identification information based on a comparison result of the comparison unit.
The speech recognition system according to claim 5 is characterized in that, in the 4 th invention, the speech recognition system further includes an updating unit that updates the threshold value stored in the reference database based on a plurality of the candidate data and a plurality of the reliability degrees.
The speech recognition system according to claim 6 is characterized in that, in the 4 th or 5 th invention, the speech recognition system further includes a reflection unit that acquires an evaluation result of a user who evaluates the recognition information and reflects the evaluation result in the threshold value of the reference database.
The speech recognition system according to claim 7 is characterized in that, in any one of claims 1 to 6, the acquisition means acquires condition information indicating a condition for generating the speech data.
The speech recognition system according to claim 8 is characterized in that, in claim 7, the detection unit filters the contents of the character string database to be referred to, based on the condition information.
The speech recognition system according to claim 9 is characterized in that, in any one of claims 1 to 8, the speech recognition system further includes an output unit that outputs the recognition information including information for controlling a traveling speed of the vehicle.
The speech recognition system according to claim 10 is characterized in that, in any one of claims 1 to 9, the pause interval includes any one of breath sounds and lip sounds.
The speech recognition system according to claim 11 is characterized in that, in any one of claims 1 to 10, the character string information includes languages of two or more countries.
The speech recognition apparatus according to claim 12 is characterized by comprising: an acquisition unit that acquires at least one voice data; an extraction unit that extracts a silent start section and a silent end section included in the speech data, and extracts an arrangement of phonemes and pause sections sandwiched between the silent start section and the silent end section as recognition target data; a character string database that stores character string information acquired in advance, phoneme information associated with the character string information, and a type ID assigned to the character string information; a detection unit that selects the phoneme information corresponding to the array included in the recognition target data with reference to the character string database, and detects a plurality of character string information and the category ID associated with the selected phoneme information as candidate data; a syntax database storing syntax information indicating an arrangement order of the type IDs acquired in advance; a calculation unit that generates a sentence in which a plurality of candidate data are combined based on the grammar information by referring to the grammar database, and calculates a reliability corresponding to each of the candidate data included in the sentence; a selection unit that selects evaluation data from the plurality of candidate data according to the reliability; and a generation unit that generates identification information based on the evaluation data.
Effects of the invention
According to the invention 1 to 11, the extracting unit extracts an array of phonemes and pause intervals as the identification target data. The detection unit selects phoneme information corresponding to the arrangement of the recognition target data and detects candidate data. Therefore, erroneous recognition can be reduced as compared with the case where only phonemes in the recognition target data are considered to detect candidate data for alignment. This can improve the recognition accuracy.
Further, according to the 1 st to 11 th aspects of the present invention, the character string database stores phoneme information corresponding to the arrangement of phonemes and pause intervals, and character string information associated with the phoneme information. Therefore, compared to data stored for pattern matching of the entire phoneme, reduction of data capacity and simplification of data accumulation can be achieved.
Particularly, according to the 2 nd invention, the extraction unit extracts a plurality of pieces of recognition target data from 1 piece of voice data. Therefore, even when speech data in which a difference occurs in the arrangement of phonemes and pause intervals is acquired, it is possible to suppress a decrease in recognition accuracy. This can further improve the recognition accuracy.
In particular, according to the 3 rd invention, the calculation unit generates a plurality of sentences. That is, even when there are a plurality of patterns combined from the candidate data, it is possible to generate words corresponding to all the patterns. Therefore, for example, erroneous recognition can be reduced as compared with a search method of pattern matching or the like. This can further improve the recognition accuracy.
In particular, according to the 4 th invention, the comparison unit compares the reliability with the 1 st threshold. Therefore, by performing the threshold-based determination also on the evaluation data relatively selected from the plurality of candidate data, it is possible to further reduce the number of false identifications. This can further improve the recognition accuracy.
In particular, according to the 5 th invention, the updating unit updates the threshold value based on the candidate data and the reliability. Therefore, compared to the case where a predetermined threshold value is always used, it is possible to generate recognition information corresponding to the quality of the voice data to be acquired. This can expand the range of available environments.
In particular, according to the 6 th invention, the reflecting unit reflects the evaluation result into the threshold value. Therefore, when the identification information deviates from the user's identification, the improvement can be easily performed. This can achieve continuous improvement in recognition accuracy.
Particularly, according to the 7 th aspect of the present invention, the acquisition means acquires the condition information. That is, the acquisition unit acquires various conditions such as the ambient environment at the time of acquiring the voice data, noise included in the voice data, and the type of the voice collecting device that selects the voice as the condition information. Therefore, the setting of each cell and each database corresponding to the condition information can be performed. This can improve the recognition accuracy regardless of the environment of use or the like.
In particular, according to the 8 th invention, the detecting unit screens the contents of the character string database to be referred to, based on the condition information. Therefore, by storing different character string information and the like for each condition information in the character string database, it is possible to detect candidate data suitable for each condition information. This can improve the recognition accuracy for each piece of condition information.
In particular, according to the 9 th invention, the output unit outputs the identification information. That is, the present invention can be used for driving assistance of a user with improvement in recognition accuracy. This enables application to a wide range of applications.
In particular, according to the 10 th invention, the rest period includes any one of breath sounds and lip sounds. Therefore, even a difference in speech data that is difficult to judge only by a phoneme can be easily judged, and recognition target data can be extracted. This can further improve the recognition accuracy.
According to the 12 th aspect of the present invention, the extracting unit extracts an array of phonemes and pause intervals as the recognition target data. The detection unit selects phoneme information corresponding to the arrangement of the recognition target data and detects candidate data. Therefore, erroneous recognition can be reduced as compared with the case where only phonemes in the recognition target data are considered and the candidate data are detected for the alignment. This can improve the recognition accuracy.
Further, according to the 12 th aspect of the present invention, the character string database stores phoneme information corresponding to the arrangement of phonemes and rest periods, and character string information associated with the phoneme information. Therefore, compared to data stored for pattern matching of the entire phoneme, reduction of data capacity and simplification of data accumulation can be achieved.
Drawings
Fig. 1 is a schematic diagram showing an example of the configuration of the speech recognition system according to the present embodiment.
Fig. 2 (a) is a schematic diagram showing an example of the configuration of the speech recognition device of the present embodiment, fig. 2 (b) is a schematic diagram showing an example of the function of the speech recognition device of the present embodiment, and fig. 2 (c) is a schematic diagram showing an example of the generation unit of the present embodiment.
Fig. 3 is a schematic diagram showing an example of each function of the voice recognition apparatus according to the present embodiment.
Fig. 4 is a diagram showing an example of a character string database, a grammar database, and a reference database.
Fig. 5 (a) is a flowchart showing an example of the operation of the speech recognition system according to the present embodiment, fig. 5 (b) is a flowchart showing an example of the generation means, and fig. 5 (c) is a flowchart showing an example of the reflection means.
Fig. 6 is a schematic diagram showing an example of the update unit.
Fig. 7 (a) is a flowchart showing an example of the updating means, and fig. 7 (b) is a flowchart showing an example of the setting means.
Fig. 8 is a schematic diagram showing an example of condition information.
Fig. 9 is a schematic diagram showing a modification of the reference database.
Detailed Description
Hereinafter, an example of a speech recognition system and a speech recognition apparatus according to an embodiment of the present invention will be described with reference to the drawings.
(Structure of the Speech recognition System 100)
An example of the configuration of the speech recognition system 100 according to the present embodiment will be described with reference to fig. 1 to 4. Fig. 1 is a schematic diagram showing the overall configuration of a speech recognition system 100 according to the present embodiment.
The speech recognition system 100 generates recognition information corresponding to the speech of the user by referring to a character string database and a grammar database which are constructed according to the use of the user. The character string database stores character strings (character string information) assumed to be spoken by the user and phonemes (phoneme information) corresponding to the character strings. Therefore, the character string and the phoneme can be stored to generate the identification information corresponding to the application, and the application can be extended.
In particular, the inventors have found that: by classifying the phoneme arrangement (phoneme information) stored in the character string database based on the pause interval included in the speech, the accuracy of the recognition information for the speech can be significantly improved.
The grammar database stores grammar information necessary for generating a sentence composed of character string information. The syntax information includes a plurality of pieces of information indicating the arrangement order of the category IDs associated with each piece of the character string information. By referring to the grammar database, after detecting the character string information based on the arrangement of phonemes classified according to the rest sections, it is possible to easily combine the character string information. This enables generation of recognition information that takes into account the grammar of the speech. As a result, speech recognition based on the content of speech uttered by the user or the like can be realized with high accuracy.
As shown in fig. 1, the speech recognition system 100 has a speech recognition apparatus 1. In the speech recognition system 100, for example, speech of a user or the like is collected using the speech collection device 2 or the like, and recognition information corresponding to the speech is generated using the speech recognition device 1. The identification information includes, for example, information for controlling the control device 3 and the like, and voice information for responding to the user, in addition to text data obtained by converting voice into a character string.
In the speech recognition system 100, the speech recognition apparatus 1 may be connected to the speech collection apparatus 2 and the control apparatus 3 directly or via the public communication network 4, for example. The speech recognition apparatus 1 may be connected to a server 5 or a user terminal 6 owned by a user or the like via the public communication network 4, for example.
< Speech recognition apparatus 1>
Fig. 2 (a) is a schematic diagram showing an example of the configuration of the speech recognition apparatus 1. As the speech recognition apparatus 1, a single board computer such as Raspberry Pi (registered trademark) or an electronic device such as a Personal Computer (PC) may be used. The speech recognition apparatus 1 includes a casing 10, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a storage Unit 104, and I/fs 105 to 107. The respective structures 101 to 107 are connected by an internal bus 110.
The CPU 101 controls the entire voice recognition apparatus 1. The ROM 102 stores operation codes of the CPU 101. The RAM 103 is a work area used when the CPU 101 works. The storage unit 104 stores various information such as a character string database. As the storage unit 104, for example, an HDD (Hard Disk Drive), an SSD (Solid State Drive), or the like can be used in addition to the SD memory card.
The I/F105 is an interface for transmitting and receiving various information to and from the voice collecting apparatus 2, the control apparatus 3, the public communication network 4, and the like. The I/F106 is an interface for transmitting and receiving various kinds of information to and from an input section 108 connected according to the use. As the input unit 108, for example, a keyboard can be used, and a user or the like who manages the voice recognition system 100 can input or select various information or a control command of the voice recognition apparatus 1 via the input unit 108. The I/F107 is an interface for transmitting and receiving various kinds of information to and from an output unit 109 connected according to the use. The output unit 109 outputs various information, identification information, processing status of the speech recognition apparatus 1, and the like stored in the storage unit 104. As the output section 109, a display, for example, a touch panel type may be used. In this case, the output section 109 may be configured to include the input section 108. For example, the same interface may be used for I/F105 to I/F107.
Fig. 2 (b) is a schematic diagram showing an example of the function of the speech recognition apparatus 1. The speech recognition device 1 includes an acquisition unit 11, an extraction unit 12, a storage unit 13, a detection unit 14, a calculation unit 15, a selection unit 16, a generation unit 17, and an output unit 18. The speech recognition apparatus 1 may have a reflection unit 19, for example. Each function shown in fig. 2 (b) is realized by the CPU 101 executing a program stored in the storage unit 104 or the like with the RAM 103 as a work area. Further, a part of each function may be realized by using a known speech recognition engine such as Julius or a known general-purpose programming language such as Python, for example, and may perform processing such as extraction and generation of various data. Further, a part of each function may be controlled by artificial intelligence. Here, "artificial intelligence" may be a technique based on any known artificial intelligence technique.
< acquisition section 11>
The acquisition unit 11 acquires at least one piece of speech data. The acquisition unit 11 acquires, as voice data, data obtained by subjecting a voice signal acquired by using the voice acquisition device 2 or the like to pulse modulation such as PCM (pulse code modulation). The acquisition unit 11 may acquire, for example, a plurality of pieces of voice data at a time according to the type of the voice collecting apparatus 2.
The acquisition unit 11 may acquire a plurality of pieces of voice data at the same time, for example. In this case, the voice recognition apparatus 1 may be connected to a plurality of voice collecting apparatuses 2, and may be connected to a voice collecting apparatus 2 capable of simultaneously collecting a plurality of voices. The acquisition unit 11 acquires various information (data) from the voice collecting apparatus 2 and the like via, for example, the I/F105 and the I/F106, in addition to the voice data.
< extraction section 12>
The extraction unit 12 extracts a silence start section and a silence end section included in the speech data. The extraction unit 12 extracts, as the recognition target data, an arrangement of phonemes and pause intervals between the silence start interval and the silence end interval.
The extraction unit 12 extracts, for example, a silent state (silent section) of 100 milliseconds to 1 second as a silent start section and a silent end section. The extraction unit 12 assigns phonemes and pause sections to a section (speech section) between the silence start section and the silence end section. The extraction unit 12 extracts the arrangement of the phonemes and pause intervals allocated to each as recognition target data.
Phonemes are well known phonemes comprising a vowel and a consonant. The pause interval indicates an interval shorter than the silence start interval and the silence end interval, and indicates, for example, an interval (length) similar to the interval of the phoneme. The extraction unit 12 may set the length of the rest interval after determining the length of each phoneme or the length of the entire recognition target data, and extract the array to which the phonemes and the rest intervals are assigned as the recognition target data. That is, the extraction unit 12 may set the length of the pause interval in accordance with the length of the phoneme or the length of the entire recognition target data.
For example, as shown in fig. 3, the extraction unit 12 extracts a silence start interval "silB" and a silence end interval "silE", and extracts an arrangement "a/k/a/r/i/w/o/ts/u/k/e/t/e" ("indicates a pause interval") in the speech interval as the object identification data. The extraction unit 12 may extract a plurality of pieces of object recognition data arranged differently from each other from 1 piece of voice data, for example. In this case, speech recognition can be performed in consideration of the deviation caused by the allocation of the phoneme and the rest section in the extraction unit 12. For example, the extraction unit 12 can extract 1 or more and 5 or less pieces of object identification data, thereby suppressing the processing time and improving the identification accuracy. The extraction unit 12 may extract, as the object identification data, an array including at least one of the silent start section and the silent end section.
The rest interval may include at least one of a breath sound and a lip sound, for example. That is, the extraction unit 12 may extract at least one of a breath sound and a lip sound included in the pause period as the identification target data. In this case, by including at least one of breath sounds and lip sounds in phoneme information stored in a character string database described later, identification information with higher accuracy can be generated.
< storage section 13, database >
The storage unit 13 stores various data in the storage unit 104, and retrieves various data from the storage unit 104. The storage unit 13 retrieves various databases stored in the storage unit 104 as necessary.
For example, as shown in fig. 4, the storage unit 104 stores a character string database and a grammar database, and may store a reference database, for example.
The character string database stores character string information acquired in advance, phoneme information associated with the character string information, and a type ID assigned to the character string information. The character string database is used when the detection unit 14 detects candidate data.
The phoneme information includes a plurality of permutations of phonemes expected to be uttered by the user (e.g., the 1 st phoneme information "a/k/a/r/i", etc.). The phoneme sequence may include pause intervals, for example, as "h/i/i/t/e", in addition to intervals separated according to the pause intervals, and may be arbitrarily set according to the use conditions. Note that the phoneme information may include at least one of a silence start section and a silence end section, for example.
The character string information contains a character string associated with the arrangement of each phoneme (for example, 1 st character string information "brightness 1236312426. Therefore, the character string information may use a character string having no meaning in addition to the expression elements having meaning such as words and morphemes. The character string information may include, for example, not only japanese but also two or more languages, and may include character strings such as numerals and abbreviations used in places of use. Note that different phoneme sequences may be associated with the same character string information.
The type ID is associated with the character string information and indicates an arrangement position (for example, 1 st type ID "1") where a word or the like of the character string information is located when the word or the like is supposed to be used based on the grammar. For example, when the grammar (sentence) of the voice can be expressed as "object" + "assist word" + "action", the type ID is "1" for the character string information of "object" as the voice, the type ID is "2" for the character string information of "assist word" as the voice, and the type ID is "3" for the character string information of "action" as the voice.
The syntax database stores syntax information indicating the arrangement order of a plurality of type IDs acquired in advance. The grammar database is used when calculating the reliability by the calculation unit 15. For example, in the case of using the 1 st grammatical information "1, 2, 3" as the grammatical information, a sentence indicating "object" + "auxiliary word" + "action" can be generated as a candidate of the voice. The syntax information includes a plurality of arrangement orders (arrangement order of category IDs) such as 1 st syntax information "1, 2, 3", 2 nd syntax information "4, 5, 6", and 3 rd syntax information "2, 1, 3", for example.
The reference database stores character string information acquired in advance, reference sentences composed of character strings, and thresholds given to the respective character string information, and may store phoneme information associated with the character string information, for example. The reference database is used when the identification information is generated by the generation unit 17 as necessary. In addition, it is possible to reduce the data capacity by making the character string information and the phoneme information stored in the reference database, for example, equal to the character string information and the phoneme information stored in the character string database.
< detection section 14>
The detection unit 14 refers to the character string database and selects phoneme information corresponding to the arrangement of phonemes included in the recognition target data. Further, the detection unit 14 detects a plurality of character string information and category IDs associated with the selected phoneme information as candidate data.
For example, as shown in fig. 3, the detection unit 14 selects the phoneme information "a/k/a/r/i", "w/o", "ts/u/k/e/t/e" corresponding to the recognition target data, and detects character string information and the category IDs "bright \12363/1", "v 12434/2", "v 1238812390123903" associated with the respective phoneme information as candidate data. At this time, the number of candidate data is increased correspondingly to the number of recognition target data. Note that the arrays of phonemes may be classified in advance by dividing them into pause intervals, or may be classified based on phoneme information including phonemes and pause intervals.
< calculating section 15>
The calculation unit 15 refers to the grammar database and generates a sentence in which a plurality of candidate data are combined based on grammar information. Further, the calculation unit 15 calculates the reliability corresponding to each candidate data included in the sentence.
For example, as shown in fig. 3, the computing section 15 associates each category ID included in the 1 st grammar information "1, 2, 3" with the category ID of each candidate data "category" no \1236312426/1 "," no 12434/2 "," no 12388 \12390/3 ", to generate a sentence" no \12363/1 "," 12434/2 "\" 123698812390/3. In this case, for example, when the syntax information is "3, 1, 2", there are generated "1238812369850" \.
The calculating part 15 calculates the reliability "0.982", "1.000" and "0.990" corresponding to each candidate data "bright 12363/1", "12434/2", "12388 \/12390/3". The calculation unit 15 calculates the reliability for each candidate data in the range of 0.000 to 1.000. The calculation unit 15 may set a level (level 1 to level 5 in fig. 3) indicating a priority for each sentence, for example. By setting the rank, a term classified at an arbitrary rank or less (for example, rank 6 or less) can be excluded from the evaluation target. Therefore, the number of candidate data selected as evaluation data described later can be reduced, and the processing speed can be improved.
For example, the calculation unit 15 may calculate the reliability that differs for each candidate data when the same candidate data is included in terms having different contents. For example, when the reliability "0.982", "1.000" and "0.990" corresponding to the candidate data "bright 123632626/1", "v 12434/2" and "123881236990/3" contained in statement 1 are calculated, the reliability "0.942", "1.000" and "0.02363" corresponding to the candidate data "bright 12312426/1", "v 12434/2" and "green 1235690/3" contained in statement 2 are calculated. That is, even for the same candidate data "brightness 1236312426, the reliability can be calculated to be different depending on the contents of the sentence and the order of combination.
As the reliability, for example, a relative value corresponding to the type and number of candidate data detected by the detection unit 14 may be used in addition to a preset value. For example, as the number of types of candidate data increases with respect to 1 type ID, the reliability can be calculated to be low.
< selection section 16>
The selection unit 16 selects evaluation data from the plurality of candidate data according to the reliability. The selection unit 16 selects, for example, a candidate data, of which each type ID has the highest reliability calculated, as evaluation data. For example, the selection unit 16 selects candidate data "123881236990/3/0.990" having the highest reliability among the candidate data "1238812390/3/0.990" and "bola 1235612390/3/0.023" of the same category ID "3" as evaluation data. The selection unit 16 may select a plurality of candidate data for 1 category ID as evaluation data. In this case, the generation unit 17 described later may select 1 candidate data from the plurality of candidate data.
< creation section 17>
The generation unit 17 generates identification information from the evaluation data. The generation unit 17 may convert the evaluation data into a voice data format or a control data format for controlling the control device 3 and generate the evaluation data as the identification information, for example, instead of converting the evaluation data into a text format and generating the evaluation data as the identification information. That is, the identification information includes information for controlling the control device 3 (for example, information for controlling the traveling speed of the vehicle). The method of conversion into a text format, a voice data format, or a control data format based on the evaluation data may use a known technique, or may use a database or the like storing each data format as needed.
The generation unit 17 may include, for example, a specification unit 17a and a comparison unit 17b. The specification unit 17a specifies the 1 st reference sentence corresponding to the evaluation data among the reference sentences with reference to the reference database. For example, in the case where "brightness \12363 \ 12426261", "\12434/2", "123881239090/3" is selected as the evaluation data, the specifying section 17a specifies the 1 st reference sentence shown in fig. 4. In this case, as each character string information (1 st character string information) included in the 1 st reference sentence, a character string equivalent to the candidate data included in the evaluation data is specified.
The comparison unit 17b compares the reliability corresponding to the evaluation data with a threshold (1 st threshold) given to the 1 st character string information. The comparison section 17b compares, for example, the 1 st threshold values "0.800", "0.900" and "0.880" of "reliability" 0.982"," 1.000"," 0.990 "of" evaluation data "brightness 12363", "1242634", "1238812390". In this case, the generating unit 17 generates the identification information based on the comparison result. For example, the generating unit 17 may generate the identification information when the reliability is equal to or higher than the 1 st threshold. For example, the generation unit 17 may generate different generation information depending on whether the reliability is equal to or higher than the 1 st threshold or lower than the 1 st threshold.
< output unit 18>
The output unit 18 outputs the identification information. The output unit 18 outputs the identification information to the control device 3 and the like via the I/F105. The output unit 18 may output the identification information to the output unit 109 via, for example, the I/F107. The output unit 18 outputs various information (data) in addition to the identification information to the control device 3 via, for example, the I/F105 and the I/F107.
< reflection section 19>
The reflection unit 19 obtains the evaluation result of the user or the like who evaluates the identification information, and reflects the evaluation result on the threshold value of the reference database. For example, when the evaluation result is poor with respect to the identification information (that is, when the identification information obtained for the voice data deviates from the request of the user or the like), the reflection unit 19 improves the identification information by changing the threshold value. In this case, for example, the evaluation result may be reflected in the threshold value by using a known mechanical learning method or the like.
< Speech Collection apparatus 2>
The speech sound collection device 2 may include a known microphone, and may further include a DSP (digital signal processor), for example. When the speech sound collection device 2 includes a DSP, the speech sound collection device 2 generates data obtained by pulse-modulating a speech sound signal collected by a microphone with PCM or the like, and transmits the data to the speech recognition device 1.
The speech sound collection device 2 may be connected to the speech recognition device 1 directly or via the public communication network 4, for example. In addition, when the voice collecting device 2 has only a microphone, the voice recognition device 1 may generate data that is pulse-modulated.
< control device 3>
The control device 3 is a device capable of receiving and controlling the recognition information from the voice recognition device 1. As the control device 3, for example, in addition to an illumination device such as an LED, an in-vehicle device (for example, a device directly connected to a brake system for controlling the traveling speed of a vehicle), an automatic vending machine capable of changing a display language, a locking device, an audio device, a massage machine, or the like can be used. The control device 3 may be connected to the voice recognition device 1 directly or via the public communication network 4, for example.
< public communication network 4>
The public communication network 4 is the internet or the like connected to the voice recognition apparatus 1 via a communication circuit. The public communication network 4 may also be constituted by a so-called optical fiber communication network. The public communication network 4 is not limited to a wired communication network, and may be realized by a known communication network such as a wireless communication network.
< Server 5>
The server 5 stores the above-described various information. The server 5 stores various information transmitted via the public communication network 4, for example. The server 5 may store the same information as the storage unit 104, for example, and may transmit and receive various information to and from the voice recognition apparatus 1 via the public communication network 4. That is, the voice recognition apparatus 1 may be configured such that the server 5 is used instead of the storage unit 104. In particular, the server 5 can minimize the update function in the speech recognition apparatus 1 and the data capacity to be stored by updating the above-described databases. Therefore, the speech recognition apparatus 1 can be used in a state not normally connected to the public communication network 4, and can be used in a state connected to the public communication network 4 only when updating is necessary. This can greatly expand the use destinations of the speech recognition apparatus 1.
< user terminal 6>
The user terminal 6 represents a terminal owned by a user of the speech recognition system 100, for example. As the user terminal 6, a mobile phone (portable terminal) is mainly used, and in addition to this, electronic devices such as a smart phone, a tablet terminal, a wearable terminal, a personal computer, and an IoT (Internet of Things) device, and all user terminals implemented by electronic devices may be used. The user terminal 6 may be connected to the speech recognition apparatus 1 directly, for example, in addition to being connected to the speech recognition apparatus 1 via the public communication network 4, for example. The user or the like may acquire the recognition information from the speech recognition apparatus 1 via the user terminal 6, for example, or may collect speech using the user terminal 6 instead of the speech collection apparatus 2, for example.
(an example of the operation of the speech recognition system 100)
Next, an example of the operation of the speech recognition system 100 according to the present embodiment will be described. Fig. 5 (a) is a flowchart showing an example of the operation of the speech recognition system 100 according to the present embodiment.
< obtaining Unit S110>
First, at least one piece of voice data is acquired (acquisition section S110). The acquisition unit 11 acquires voice data from the voice collecting apparatus 2 and the like. The acquisition unit 11 stores the voice data in the storage unit 104, for example, via the storage unit 13.
< extraction Unit S120>
Next, identification target data is extracted (extraction unit S120). The extraction unit 12 extracts the voice data from the storage unit 104 via the storage unit 13, for example, and extracts the silence start section and the silence end section included in the voice data. The extraction unit 12 extracts, as the identification target data, an array of phonemes and pause intervals sandwiched between the silent start interval and the silent end interval. The extraction unit 12 stores the identification target data in the storage unit 104, for example, via the storage unit 13. The extraction unit 12 may acquire a plurality of pieces of speech data at a time.
The extraction unit 12 extracts a plurality of pieces of recognition data from, for example, 1 piece of voice data. In this case, the plurality of identification data have different permutations of phonemes and pause intervals (for example, permutations a to C in fig. 3). The extraction unit 12 sets, for example, different conditions, and extracts a plurality of identification data within a range of variation when the conditions are set to the same condition.
For example, when the rest section includes at least one of breath sounds and lip sounds, the extraction unit 12 may extract, as the identification target data, an array including at least one of breath sounds and lip sounds.
< detecting Unit S130>
Next, candidate data is detected from the identification target data (detection unit S130). The detection unit 14 extracts the identification target data from the storage unit 104 via the storage unit 13, for example. The detection unit 14 refers to the character string database and selects phoneme information corresponding to the arrangement of the recognition target data. Further, the detection section 14 detects a plurality of character string information and category IDs associated with the selected phoneme information as candidate data. The detection unit 14 stores the candidate data in the storage unit 104, for example, via the storage unit 13. The arrangement of the recognition target data may indicate, for example, an arrangement of phonemes between a pair of pause sections, and another pause section may be arranged between a pair of pause sections.
< calculating Unit S140>
Next, the reliability corresponding to each candidate data is calculated (calculating section S140). The calculation unit 15 extracts the candidate data from the storage unit 104 via the storage unit 13, for example. The calculation unit 15 refers to the grammar database and generates a sentence in which a plurality of candidate data are combined based on grammar information. Further, the calculation unit 15 calculates the reliability corresponding to each candidate data included in the sentence. The calculation unit 15 stores the candidate data and the reliability in the storage unit 104, for example, via the storage unit 13. The generation of the sentence and the calculation of the reliability may be realized by using a known speech recognition engine such as Julius as the calculation unit 15.
The calculation unit 15 can generate a plurality of words in accordance with the type of grammar information in the grammar database. Further, the calculation unit 15 can perform speech recognition suitable for the situation with high accuracy by selecting the type of the grammar information.
< selection Unit S150>
Next, evaluation data is selected according to the reliability (selecting section S150). The selection unit 16 extracts the candidate data and the reliability from the storage unit 104, for example, via the storage unit 13. The selection unit 16 selects, for example, a candidate data for which the highest reliability is calculated for each type ID among the plurality of candidate data, as the evaluation data. The selection unit 16 stores the evaluation data in the storage unit 104, for example, via the storage unit 13.
< creation Unit S160>
Next, identification information is generated from the evaluation data (generating section S160). The generation unit 17 takes out the evaluation data from the storage unit 104 via the storage unit 13, for example. The generation unit 17 converts the evaluation data into arbitrary data by using, for example, the above-described known technique, and generates the data as identification information.
For example, as shown in fig. 5 (b), the generation unit S160 may also have a specification unit S161 and a comparison unit S162.
The specification unit S161 specifies the 1 st reference sentence corresponding to the evaluation data. The specification unit 17a specifies the 1 st reference sentence corresponding to the evaluation data among the reference sentences with reference to the reference database.
The comparison unit S162 compares the reliability corresponding to the evaluation data with the 1 st threshold value given to the 1 st character string information included in the 1 st reference sentence. For example, as shown in fig. 3, the comparison unit 17b may determine that the recognition is correct when the reliability of the evaluation data is equal to or higher than the 1 st threshold. Then, the identification information is generated based on the judgment (comparison result) of the comparing unit 17b. In addition, in the case where the comparison unit 17b determines that the reliability of the evaluation data is smaller than the 1 st threshold and determines that the recognition is erroneous, the processing is terminated as it is or performed again by the extraction means S120, and besides, for example, recognition information for urging a user or the like to utter a voice again may be generated.
< output Unit S170>
Then, the identification information is output as necessary (output unit S170). The output unit 18 displays the identification information on the output unit 109 via the I/F107, and also outputs the identification information for controlling the control device 3 and the like via, for example, the I/F105.
< reflection unit S180>
Further, for example, the evaluation result of the user or the like who evaluates the identification information may be acquired and reflected in the threshold value of the reference database (reflecting section S180). In this case, the reflection unit 19 obtains the evaluation result made by the user or the like via the obtaining unit 11. The reflection unit 19 changes the threshold value in accordance with the evaluation value and the like included in the evaluation result so that the comparison result in the comparison unit S162 is improved (the recognition accuracy is improved).
In addition, the reflection unit 19 may reflect the evaluation result to at least one of the character string database and the grammar database, for example, in addition to the evaluation result to the reference database. The calculation unit 15 may reflect the evaluation result in the calculation of the reliability.
This completes the operation of the speech recognition system 100 according to the present embodiment.
According to the speech recognition system 100 of the present embodiment, the extraction unit S120 extracts an array of phonemes and rest intervals as recognition target data. The detection unit S130 selects phoneme information corresponding to the arrangement of the recognition target data and detects candidate data. Therefore, erroneous recognition can be reduced as compared with the case where only phoneme in the recognition target data is considered as the alignment detection candidate data. This can improve the recognition accuracy.
Further, since the recognition accuracy can be improved, it is not necessary to perform a preceding voice input to improve the accuracy. Here, the preliminary voice input indicates a voice for starting voice recognition before acquiring voice data. By using the prior voice input, the recognition accuracy can be improved, while the convenience is reduced. In this regard, according to the speech recognition system 100 of the present embodiment, it is not necessary to perform a previous speech input, and thus convenience can be improved.
In addition, according to the speech recognition system 100 of the present embodiment, the preliminary speech input may be performed as necessary. This can further improve the recognition accuracy.
Further, according to the speech recognition system 100 of the present embodiment, the character string database stores phoneme information corresponding to the arrangement of phonemes and rest periods, and character string information associated with the phoneme information. Therefore, compared to data stored for pattern matching of the entire phoneme, reduction of data capacity and simplification of data accumulation can be achieved.
In particular, by filtering the character string information stored in the character string database based on the usage environment of the speech recognition system 100, the data capacity can be reduced, and the range of use can be increased without connecting to, for example, the public communication network 4. In addition, the time from acquisition of the voice data to generation of the identification information can be significantly shortened.
Further, according to the voice recognition system 100 of the present embodiment, the extraction unit S120 extracts a plurality of pieces of recognition target data from 1 piece of voice data. Therefore, even when speech data is acquired in which the phoneme and pause intervals are misaligned, it is possible to suppress a decrease in recognition accuracy. This can further improve the recognition accuracy.
Further, according to the speech recognition system 100 of the present embodiment, the calculation unit S140 generates a plurality of sentences. That is, even when there are a plurality of patterns combined from candidate data, it is possible to generate words corresponding to all the patterns. Therefore, erroneous recognition can be reduced as compared with, for example, a search method of pattern matching or the like. This can further improve the recognition accuracy.
Further, according to the speech recognition system 100 of the present embodiment, the comparison unit S162 compares the reliability with the 1 st threshold. Therefore, by performing the threshold-based determination also on the evaluation data relatively selected from the plurality of candidate data, it is possible to further reduce the number of false identifications. This can further improve the recognition accuracy.
Further, according to the speech recognition system 100 of the present embodiment, the reflection unit S180 reflects the evaluation result to the threshold value. Therefore, when the identification information deviates from the identification of the user, the improvement can be easily performed. This can achieve continuous improvement in recognition accuracy.
Further, according to the voice recognition system 100 of the present embodiment, the output unit S170 outputs the recognition information. As described above, the speech recognition system 100 according to the present embodiment can generate recognition information with higher accuracy than the conventional system. Therefore, when the control device 3 and the like are controlled based on the identification information, it is possible to significantly suppress erroneous operations of the control device 3 and the like. For example, when the vehicle braking is controlled using the voice recognition system 100, it is possible to achieve accuracy to the extent that it does not cause an obstacle to normal traveling. That is, the present invention can be used as driving assistance for the user with the improvement of the recognition accuracy. This enables the use of the composition in a wide range of applications.
In addition, according to the speech recognition system 100 of the present embodiment, the rest period includes at least one of breath sounds and lip sounds. Therefore, even a difference in speech data that is difficult to determine only by phoneme can be easily determined, and the recognition target data can be extracted. This can further improve the recognition accuracy.
According to the speech recognition apparatus 1 of the present embodiment, the extraction unit 12 extracts an array of phonemes and pause intervals as recognition target data. The detection unit 14 selects phoneme information corresponding to the arrangement of the recognition target data and detects candidate data. Therefore, erroneous recognition can be reduced as compared with the case where only phonemes in the recognition target data are considered and the candidate data are detected for the alignment. This can improve the recognition accuracy.
Further, according to the speech recognition device 1 of the present embodiment, the character string database stores phoneme information corresponding to the arrangement of phonemes and rest sections, and character string information associated with the phoneme information. Therefore, compared to data stored for pattern matching of the entire phoneme, the data capacity can be reduced and data storage can be simplified.
(modification 1 of the configuration of the speech recognition system 100)
Next, a description will be given of a 1 st modification of the speech recognition system 100 according to the present embodiment. The above-described embodiment is different from the modification 1 in that the generation unit 17 includes the update unit 17c. Note that the same configurations as those described above will not be described.
For example, as shown in fig. 6, the updating unit 17c of the generating unit 17 updates the threshold value stored in the reference database based on the candidate data and the reliability. That is, the threshold value can be updated to a value corresponding to the candidate data and the content of the reliability.
The updating unit 17c calculates, for example, an average value of a plurality of reliability degrees associated with each type ID. The updating unit 17c updates the threshold value based on the calculated average value.
When the threshold value is updated, a value obtained by multiplying the average value by a predetermined coefficient may be used as the updated threshold value, in addition to using the calculated average value as the threshold value. In addition, for the threshold value before updating, a value obtained by four arithmetic operations on a value obtained by multiplying the coefficient by the average value may be used as the threshold value after updating.
By updating the threshold value in accordance with the candidate data and the content of the reliability, it is possible to set a threshold value corresponding to the quality of the voice data even when, for example, the voice data is likely to contain noise or the like. Further, even when a plurality of pieces of character string information associated with 1 type ID are detected and the reliability of each piece of character string information is low, it is possible to prevent all the reliabilities from being smaller than the threshold.
The updating unit 17c may calculate an average value of the reliability degrees other than the lowest reliability degree among the reliability degrees associated with the respective kinds of IDs, for example. In this case, the updated threshold value tends to be higher than the threshold value before the update. This can reduce erroneous recognition.
The updating unit 17c may calculate an average value of the reliability degrees other than the lowest reliability degree and the highest reliability degree among the reliability degrees associated with the respective kinds of IDs, for example. In this case, the updated threshold value tends to be lower than the threshold value before updating. This can improve the recognition rate. Further, the threshold value can be prevented from varying before and after updating.
(modification 1 of operation of Speech recognition System 100)
Next, a description will be given of a 1 st modification of the speech recognition system 100 according to the present embodiment. Fig. 7 (a) is a flowchart showing an example of the updating section S163 in the 1 st modification.
As shown in fig. 7 (a), after the selection section S150 is operated, the threshold stored in the reference database is updated based on the plurality of candidate data and the plurality of reliability levels (updating section S163). The updating unit 17c, for example, fetches the candidate data, the reliability, and the reference database from the storage unit 104 via the storage unit 13.
For example, as shown in fig. 6, the update unit 17c calculates an average value "0.940" of the plurality of reliability degrees "0.982", "0.942", and "0.897" associated with the type ID "1" included in the ranks 1, 2, and 4. Then, the updating unit 17c uses, as the updated threshold value, a value "0.846" obtained by multiplying the calculated average value by a coefficient (e.g., 0.9), for example.
Then, the above-described specifying means S161 and the like are operated, and the operation of the speech recognition system 100 of the present embodiment is ended.
According to the present modification, the updating unit 17c in the updating unit S163 updates the threshold value based on the candidate data and the reliability. Therefore, compared to the case where a predetermined threshold value is always used, it is possible to generate recognition information corresponding to the quality in the speech data to be acquired. This can expand the range of available environments.
(modification 2 of operation of Speech recognition System 100)
Next, a 2 nd modification of the speech recognition system 100 according to the present embodiment will be described. The above embodiment differs from the 2 nd modification in that setting means S190 is provided. Note that the same structure as that described above will not be described.
For example, as shown in fig. 7 (b), after the generation unit S160, the setting unit S190 is operated. The setting unit S190 screens the contents of each database to be referred to based on the identification information. After the operation of the setting section S190 is performed, the operation of the acquisition section S110 is performed.
For example, when the setting unit S190 generates the "music pattern", the detection unit S130 thereafter, the detection unit 14 filters and refers to the phoneme information, the character string information, and the genre ID classified as the "music pattern" in the character string database. Therefore, compared to the case where the set unit S190 is not implemented, it is possible to define phoneme information and the like that refer to specific content. This can significantly improve the recognition accuracy.
(modification of acquisition section S110)
Next, a modified example of the acquisition unit S110 of the present embodiment will be described. The above-described embodiment is different from the present modification in that the acquisition unit 11 acquires condition information. Note that the same structure as the above-described structure will not be described.
In the acquisition unit S110, the acquisition unit 11 acquires condition information indicating a condition for generating the voice data. For example, as shown in fig. 8, the condition information includes environment information, noise information, speech collection device information, user information, and sound characteristic information. In addition, similarly to the setting unit S190 described above, for example, the detection unit 14 may filter the content of at least one of the character string database and the grammar database to be referred to, based on the condition information. For example, the reflection unit 19 may use the condition information for updating the threshold value of the reference database.
The condition information may be generated by the voice collecting apparatus 2, for example, or may be previously created by a user or the like. For example, the acquisition unit 11 may acquire a part of the voice data as the condition information.
The environment information includes information on the installation environment of the voice collecting apparatus 2, and indicates, for example, the size outside the house or inside the house. By using the environment information, for example, reflection conditions of indoor speech sound can be taken into consideration, and the accuracy of the extracted recognition target data and the like can be improved.
The noise information includes information on noise that can be collected by the voice collection device 2, and indicates, for example, a sound other than a user, an air conditioning sound, and the like. By using the noise information, unnecessary data included in the speech data can be removed in advance, and the accuracy of the extracted recognition target data and the like can be improved.
The voice collecting device information includes information on the type, performance, and the like of the voice collecting device 2, and includes, for example, the number of microphones, the type of microphone, and the like. By using the speech sound collection device information, it is possible to select a database or the like corresponding to the situation in which the speech sound data is generated, and it is possible to improve the accuracy of speech sound recognition.
The user information includes information on the number of users, nationality, gender, and the like. The sound characteristic information includes information on the sound volume, sound pressure, habit, state of the living tongue, and the like of the sound. By using the user information, the characteristics of the voice data can be defined in advance, and the accuracy of voice recognition can be improved.
According to the present modification, the acquisition unit S110 acquires the condition information. That is, the acquisition unit S110 acquires, as the condition information, various conditions such as the ambient environment at the time of acquiring the voice data, noise included in the voice data, and the type of the voice collecting device 2 that collects the voice. Therefore, the setting of each cell and each database corresponding to the condition information can be performed. This can improve the recognition accuracy regardless of the use environment or the like.
Further, according to the present modification, the detection unit S130 filters the contents of the character string database to be referred to according to the condition information. Therefore, by storing different character string information and the like for each condition information in the character string database, it is possible to detect candidate data corresponding to each condition information. This can improve the recognition accuracy for each piece of condition information.
(modification of reference database)
Next, a modified example of the reference database of the present embodiment will be described. The above-described embodiment is different from the present modification in that the content of the information stored in the reference database is different. Note that the same configurations as those described above will not be described.
For example, as shown in fig. 9, the reference database stores past evaluation data acquired in advance, reference sentences associated with the past evaluation data, and the degree of correlation between the past evaluation data and the reference sentences.
The generation unit 17 refers to, for example, a reference database, and selects the 1 st evaluation data (the dotted frame in the "past evaluation data" in fig. 9) corresponding to the evaluation data from the past evaluation data. Then, the generation unit 17 acquires the 1 st reference term (the dashed line frame within the "reference term" in fig. 9) corresponding to the 1 st evaluation data among the reference terms. Further, the generating unit 17 acquires the 1 st correlation (such as "65%" in fig. 9) between the 1 st evaluation data and the 1 st reference word among the correlations. The 1 st evaluation data and the 1 st reference sentence may include a plurality of data.
The generating unit 17 generates identification information from the 1 st correlation value. The generation unit 17 compares, for example, the 1 st correlation degree with a threshold value acquired in advance, and generates the identification information with reference to the 1 st reference word associated with the 1 st correlation degree exceeding the threshold value.
As the past evaluation data, in addition to selecting information that partially or completely matches the evaluation data, for example, similar (including the same concept or the like) information may be used. In the case where evaluation data and past evaluation data are expressed by a combination of a plurality of character strings, for example, any combination of noun-verb, noun-adjective, adjective-verb, and noun-noun may be used.
The correlation (1 st correlation) is expressed in 3 stages or more by, for example, a percentage. For example, in the case where the reference database is constituted by a neural network, the 1 st correlation indicates a weight variable associated with the selected past evaluation target information.
In the case of using the above-mentioned reference database, it is characterized in that: speech recognition can be realized by the correlation degree set to 3 stages or more. For example, the correlation can be described by a numerical value of 0 to 100%, but the correlation is not limited thereto, and may be constituted by any number of stages as long as the correlation can be described by a numerical value of 3 stages or more.
From the correlation degree and the like, the 1 st reference sentence selected as a candidate of the identification information of the evaluation data can be selected in the order of the correlation degree and the like from high to low or from low to high. By selecting in the order of the degree of correlation in this manner, the 1 st reference sentence with a high possibility of matching the situation can be preferentially selected. On the other hand, since the 1 st reference sentence which is less likely to be in the matching condition is not excluded and the sentence can still be selected, the sentence is not to be discarded and can still be selected as a candidate of the identification information.
In addition to the above, even a very low evaluation with a correlation degree of 1% or the like, for example, is not missed and can be selected. That is, even if the correlation or the like is a very low value, the correlation is shown as a few signs, and excessive selection and erroneous recognition of the discarded object can be suppressed.
Although the embodiments of the present invention have been described, these embodiments are shown as examples and are not intended to limit the scope of the invention. These new embodiments can be implemented in other various ways, and various omissions, substitutions, and changes can be made without departing from the spirit of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalent scope thereof.
Description of the reference symbols
1: a voice recognition device; 2: a voice collecting device; 3: a control device; 4: a public communication network; 5: a server; 6: a user terminal; 10: a housing; 11: an acquisition unit; 12: an extraction unit; 13: a storage unit; 14: a detection unit; 15: a calculation unit; 16: a selection unit; 17: a generation unit; 17a: a specifying section; 17b: a comparison unit; 17c: an update unit; 18: an output section; 19: a reflection unit; 100: a speech recognition system; 101: a CPU;102: a ROM;103: a RAM;104: a storage unit; 105: I/F;106: I/F;107: I/F;108: an input section; 109: an output section; 110: an internal bus; s110: an acquisition unit; s120: an extraction unit; s130: a detection unit; s140: a calculation unit; s150: a selection unit; s160: a generating unit; s161: a specifying unit; s162: a comparison unit; s163: an update unit; s170: an output unit; s180: a reflection unit; s190: and a setting unit.

Claims (12)

1. A speech recognition system, comprising:
an acquisition unit that acquires at least one piece of voice data;
an extraction unit that extracts a silent start section and a silent end section included in the speech data, and extracts an arrangement of phonemes and pause sections sandwiched between the silent start section and the silent end section as recognition target data;
a character string database that stores character string information acquired in advance, phoneme information associated with the character string information, and a type ID assigned to the character string information;
a detection unit that refers to the character string database, selects the phoneme information corresponding to the arrangement of the recognition target data, and detects a plurality of the character string information and the category IDs associated with the selected phoneme information as candidate data;
a syntax database storing syntax information indicating an arrangement order of the type IDs acquired in advance;
a calculation unit that generates a sentence in which a plurality of candidate data are combined based on the grammar information with reference to the grammar database, and calculates a reliability corresponding to each of the candidate data included in the sentence;
a selecting unit that selects evaluation data from the plurality of candidate data according to the reliability; and
and a generation unit that generates identification information from the evaluation data.
2. The speech recognition system of claim 1,
the extraction unit extracts a plurality of the recognition target data from 1 of the voice data,
the plurality of pieces of identification target data have different permutations of the phonemes and the rest periods.
3. The speech recognition system of claim 1 or 2,
the computation unit generates a plurality of the statements,
in the plurality of sentences, at least any one of the kinds and combinations of the candidate data is different from each other.
4. The speech recognition system of claim 1 or 2,
the speech recognition system further includes a reference database storing the character string information acquired in advance, a reference sentence obtained by combining the character string information, and a threshold value assigned to each of the character string information,
the generation unit has:
a specifying unit that specifies a 1 st reference sentence corresponding to the evaluation data among the reference sentences with reference to the reference database; and
a comparison unit that compares the reliability corresponding to the evaluation data with a 1 st threshold value given to 1 st character string information included in the 1 st reference sentence,
the generation unit generates the identification information according to the comparison result of the comparison unit.
5. The speech recognition system of claim 4,
the speech recognition system further has an updating unit that updates the threshold value stored in the reference database according to a plurality of the candidate data and a plurality of the reliability degrees.
6. The speech recognition system of claim 4,
the voice recognition system further includes a reflection unit that acquires an evaluation result of a user who evaluates the recognition information and reflects the evaluation result in the threshold value of the reference database.
7. The speech recognition system of claim 1 or 2,
the acquisition unit acquires condition information indicating a condition for generating the speech data.
8. The speech recognition system of claim 7,
the detection unit screens the contents of the character string database referred to according to the condition information.
9. The speech recognition system of claim 1 or 2,
the speech recognition system further has an output unit that outputs the recognition information,
the identification information includes information for controlling the traveling speed of the vehicle.
10. The speech recognition system of claim 1 or 2,
the rest section includes at least one of breath sounds and lip sounds.
11. The speech recognition system of claim 1 or 2,
the character string information includes languages of two or more countries.
12. A speech recognition apparatus comprising:
an acquisition unit that acquires at least one piece of speech data;
an extraction unit that extracts a silent start section and a silent end section included in the speech data, and extracts an arrangement of phonemes and pause sections sandwiched between the silent start section and the silent end section as recognition target data;
a character string database that stores character string information acquired in advance, phoneme information associated with the character string information, and a type ID assigned to the character string information;
a detection unit that selects the phoneme information corresponding to the array included in the recognition target data with reference to the character string database, and detects a plurality of character string information and the category ID associated with the selected phoneme information as candidate data;
a syntax database storing syntax information indicating an arrangement order of the type IDs acquired in advance;
a calculation unit that generates a sentence in which a plurality of candidate data are combined based on the grammar information by referring to the grammar database, and calculates a reliability corresponding to each of the candidate data included in the sentence;
a selection unit that selects evaluation data from the plurality of candidate data according to the reliability; and
and a generation unit that generates identification information based on the evaluation data.
CN201980000774.5A 2018-06-18 2019-01-18 Speech recognition system and speech recognition device Active CN110914897B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP2018-115243 2018-06-18
JP2018115243A JP6462936B1 (en) 2018-06-18 2018-06-18 Speech recognition system and speech recognition device
PCT/JP2019/001408 WO2019244385A1 (en) 2018-06-18 2019-01-18 Speech recognition system and speech recognition device

Publications (2)

Publication Number Publication Date
CN110914897A CN110914897A (en) 2020-03-24
CN110914897B true CN110914897B (en) 2023-03-07

Family

ID=65228956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980000774.5A Active CN110914897B (en) 2018-06-18 2019-01-18 Speech recognition system and speech recognition device

Country Status (3)

Country Link
JP (1) JP6462936B1 (en)
CN (1) CN110914897B (en)
WO (1) WO2019244385A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7195947B2 (en) * 2019-01-22 2022-12-26 菱洋エレクトロ株式会社 Speech recognition system and speech recognition device
CN110136715B (en) * 2019-05-16 2021-04-06 北京百度网讯科技有限公司 Speech recognition method and device
CN110322883B (en) * 2019-06-27 2023-02-17 上海麦克风文化传媒有限公司 Voice-to-text effect evaluation optimization method
CN112509609B (en) * 2020-12-16 2022-06-10 北京乐学帮网络技术有限公司 Audio processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1573924A (en) * 2003-06-20 2005-02-02 PtoPA株式会社 Speech recognition apparatus, speech recognition method, conversation control apparatus, conversation control method
JP2007148118A (en) * 2005-11-29 2007-06-14 Infocom Corp Voice interactive system
JP2012068354A (en) * 2010-09-22 2012-04-05 National Institute Of Information & Communication Technology Speech recognizer, speech recognition method and program
JP2013050742A (en) * 2012-12-11 2013-03-14 Ntt Docomo Inc Speech recognition device and speech recognition method
CN105718503A (en) * 2014-12-22 2016-06-29 卡西欧计算机株式会社 Voice retrieval apparatus, and voice retrieval method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6223155B1 (en) * 1998-08-14 2001-04-24 Conexant Systems, Inc. Method of independently creating and using a garbage model for improved rejection in a limited-training speaker-dependent speech recognition system
JP2003219332A (en) * 2002-01-23 2003-07-31 Canon Inc Program reservation apparatus and method, and program
JP2009116075A (en) * 2007-11-07 2009-05-28 Xanavi Informatics Corp Speech recognition device
JP5310563B2 (en) * 2007-12-25 2013-10-09 日本電気株式会社 Speech recognition system, speech recognition method, and speech recognition program
JP5243886B2 (en) * 2008-08-11 2013-07-24 旭化成株式会社 Subtitle output device, subtitle output method and program
JP5493537B2 (en) * 2009-07-24 2014-05-14 富士通株式会社 Speech recognition apparatus, speech recognition method and program thereof
KR101537370B1 (en) * 2013-11-06 2015-07-16 주식회사 시스트란인터내셔널 System for grasping speech meaning of recording audio data based on keyword spotting, and indexing method and method thereof using the system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1573924A (en) * 2003-06-20 2005-02-02 PtoPA株式会社 Speech recognition apparatus, speech recognition method, conversation control apparatus, conversation control method
JP2007148118A (en) * 2005-11-29 2007-06-14 Infocom Corp Voice interactive system
JP2012068354A (en) * 2010-09-22 2012-04-05 National Institute Of Information & Communication Technology Speech recognizer, speech recognition method and program
JP2013050742A (en) * 2012-12-11 2013-03-14 Ntt Docomo Inc Speech recognition device and speech recognition method
CN105718503A (en) * 2014-12-22 2016-06-29 卡西欧计算机株式会社 Voice retrieval apparatus, and voice retrieval method

Also Published As

Publication number Publication date
CN110914897A (en) 2020-03-24
JP6462936B1 (en) 2019-01-30
WO2019244385A1 (en) 2019-12-26
JP2019219456A (en) 2019-12-26

Similar Documents

Publication Publication Date Title
CN110914897B (en) Speech recognition system and speech recognition device
CN110808039B (en) Information processing apparatus, information processing method, and recording medium
US11361763B1 (en) Detecting system-directed speech
US20230410833A1 (en) User presence detection
WO2020247231A1 (en) Multiple classifications of audio data
JP4906379B2 (en) Speech recognition apparatus, speech recognition method, and computer program
CN108630231B (en) Information processing apparatus, emotion recognition method, and storage medium
JP6085538B2 (en) Sound recognition apparatus, sound recognition method, and sound recognition program
US10553206B2 (en) Voice keyword detection apparatus and voice keyword detection method
KR20100115093A (en) Apparatus for detecting voice and method thereof
KR20080086791A (en) Feeling recognition system based on voice
JP7389421B2 (en) Device for estimating mental and nervous system diseases
CN112750445B (en) Voice conversion method, device and system and storage medium
CN110827853A (en) Voice feature information extraction method, terminal and readable storage medium
KR20180033875A (en) Method for translating speech signal and electronic device thereof
CN109074809B (en) Information processing apparatus, information processing method, and computer-readable storage medium
JP7178890B2 (en) Speech recognition system and speech recognition device
US11961510B2 (en) Information processing apparatus, keyword detecting apparatus, and information processing method
JP7195947B2 (en) Speech recognition system and speech recognition device
CN113593523A (en) Speech detection method and device based on artificial intelligence and electronic equipment
US11430435B1 (en) Prompts for user feedback
CN110419078B (en) System and method for automatic speech recognition
WO2021061512A1 (en) Multi-assistant natural language input processing
JP2020008730A (en) Emotion estimation system and program
US11869531B1 (en) Acoustic event detection model selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant