WO2016129188A1 - Dispositif de traitement de reconnaissance de parole, procédé de traitement de reconnaissance de parole, et programme - Google Patents

Dispositif de traitement de reconnaissance de parole, procédé de traitement de reconnaissance de parole, et programme Download PDF

Info

Publication number
WO2016129188A1
WO2016129188A1 PCT/JP2015/086000 JP2015086000W WO2016129188A1 WO 2016129188 A1 WO2016129188 A1 WO 2016129188A1 JP 2015086000 W JP2015086000 W JP 2015086000W WO 2016129188 A1 WO2016129188 A1 WO 2016129188A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
voice
permutation
speech
recognition result
Prior art date
Application number
PCT/JP2015/086000
Other languages
English (en)
Japanese (ja)
Inventor
久 坂本
Original Assignee
Necソリューションイノベータ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Necソリューションイノベータ株式会社 filed Critical Necソリューションイノベータ株式会社
Priority to JP2016574636A priority Critical patent/JP6429294B2/ja
Publication of WO2016129188A1 publication Critical patent/WO2016129188A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the present invention relates to a speech recognition processing device for recognizing information by human speech, a speech recognition processing method, and a program for causing a computer to execute the method.
  • Patent Literature 1 and Non-Patent Literature 1 include all of a large amount of data related to these language models and teacher data.
  • the operation of such a voice recognition system is often performed in a personal computer (PC) or a terminal device such as a smartphone or a tablet terminal that has been increasingly used in recent years.
  • PC personal computer
  • terminal device such as a smartphone or a tablet terminal
  • main storage device and the auxiliary storage device of these terminal devices are increasing in capacity, it is necessary to store all of the large amount of data necessary for the speech recognition system in the terminal device. Difficult from a viewpoint.
  • Non-Patent Document 2 a cloud-type speech recognition service is provided (see Non-Patent Document 2).
  • a large amount of data necessary for speech recognition processing is stored not on the terminal device but on a cloud platform built in a data center. If this service is used, the terminal device is connected to the data center via a network, so that a result of voice recognition processing using the large amount of data can be obtained.
  • the speed of information processing has increased, so that a user operating a terminal device can immediately obtain a speech recognition result from the cloud platform when inputting voice into the terminal device. In this way, the user can obtain a highly accurate speech recognition result without accumulating a large amount of language models and teacher data in the terminal device.
  • the cloud-type voice recognition service disclosed in Non-Patent Document 2 solves the problems of the techniques disclosed in Patent Document 1 and Non-Patent Document 1.
  • this cloud-type speech recognition service is not suitable for recognition processing of long conversational sentences composed of multiple sentences because short sentences such as short words at a word level and relatively short conversational sentences are processed. .
  • An object of the present invention is to provide a speech recognition processing device, a speech recognition processing method, and a program that enable recognition processing of a long conversation sentence.
  • a speech recognition processing apparatus includes a voice sampling unit that acquires input voice as voice data, a voice sampling unit that divides the voice data into a plurality of voice data pieces, and each of the plurality of voice data pieces.
  • Voice dividing means for assigning permutation numbers in accordance with the order input to the memory, storage means for storing the permutation numbers, and assigning permutation numbers to a plurality of preset communication ports, and distributing voice data pieces via the network
  • a permutation number is assigned, and the speech recognition result is stored in an area of the storage means in which the permutation number matching the assigned permutation number is stored.
  • Recognition request transmission / reception means recognition result aggregating means for generating a recognition result sentence in which speech recognition results stored in the storage means together with the permutation number are arranged according to the permutation number, display means for displaying the generated recognition result sentence, It is the structure which has.
  • a speech recognition processing method is a speech recognition processing method by an information processing apparatus, which acquires input speech as speech data, divides speech data into a plurality of speech data pieces, A permutation number is assigned to each of the audio data pieces according to the order in which the audio data is acquired, the permutation numbers are stored in the storage means, and the audio data pieces are assigned to the plurality of preset communication ports while associating the permutation numbers with the network.
  • a voice recognition result which is a result of recognition processing of the voice data piece by the voice recognition server, is received from the voice recognition server via the communication port, the received voice recognition result is transmitted to the communication port.
  • a permutation number associated with the assigned permutation number is assigned, and the speech recognition result is stored in the storage means area where the permutation number matching the assigned permutation number is stored. It was paid, in which the speech recognition result stored in the storage means together with the permutation numbers to generate a recognition result sentences arranged according permutation number, displays the generated recognition results statement.
  • a program is a procedure for acquiring input audio as audio data in a computer, dividing the audio data into a plurality of audio data pieces, and acquiring the audio data for each of the plurality of audio data pieces.
  • a permutation associated with the received speech recognition result corresponding to the communication port Voice recognition in the area of the storage means in which the number assignment procedure and the permutation number matching the assigned permutation number are stored
  • a procedure for storing the results, a procedure for generating a recognition result sentence in which the speech recognition results stored in the storage means together with the permutation numbers are arranged according to the permutation numbers, and a procedure for displaying the generated recognition result sentences. is there.
  • FIG. 1 is a block diagram for explaining the configuration of the speech recognition processing apparatus of the present embodiment.
  • FIG. 2 is a diagram showing a configuration example of data stored in the permutation number storage means shown in FIG.
  • FIG. 3 is a diagram showing another configuration example of data stored in the permutation number storage means shown in FIG.
  • FIG. 4 is a flowchart showing an operation procedure performed by the speech recognition processing apparatus according to the present embodiment.
  • FIG. 5 is a flowchart showing the detailed operation of step S02 shown in FIG.
  • FIG. 6 is a flowchart showing the detailed operation of step S05 shown in FIG.
  • FIG. 7 is a block diagram for explaining the configuration of the speech recognition processing apparatus according to the first embodiment.
  • FIG. 8 is a diagram illustrating a configuration of data stored in the permutation number storage unit in the first embodiment.
  • FIG. 9 is a diagram showing the transmission contents of the recognition request transmission / reception means in the first embodiment.
  • FIG. 10 is a diagram illustrating the contents received by the recognition request transmission / reception unit according to the first embodiment.
  • FIG. 11 is a diagram showing an example when data is stored in the field shown in FIG. 12 is a diagram illustrating an example of the screen of the display unit illustrated in FIG. 7 in the first embodiment.
  • FIG. 13 is a block diagram showing another configuration example of the speech recognition processing apparatus of this embodiment.
  • FIG. 1 is a block diagram for explaining the configuration of the speech recognition processing apparatus of this embodiment.
  • the voice recognition processing device 1 is an information processing device that outputs information in which a conversation made by the speaker 4 is converted to text so that the viewer 5 can view it.
  • the speech recognition processing device 1 may be a desktop or notebook PC, or a portable information terminal such as a PDA (Personal Digital Assistant) smaller than the PC.
  • PDA Personal Digital Assistant
  • the number of speakers 4 and viewers 5 may be plural.
  • the voice recognition processing device 1 is connected via a network 6 to a voice recognition server 3 that provides a cloud type voice recognition service.
  • the cloud type speech recognition service is a cloud type speech recognition service disclosed in Non-Patent Document 2, for example.
  • the speech recognition processing device 1 includes a permutation number storage unit 13, a recognition request transmission / reception unit 14, and a control unit 30.
  • the control unit 30 is provided with a memory (not shown) that stores a computer program (hereinafter simply referred to as a program) and a CPU (Central Processing Unit) (not shown) that executes processing according to the program. .
  • the control unit 30 includes a voice collection unit 11, a voice division unit 12, a recognition result aggregation unit 15, and a recognition result display unit 16.
  • the voice collection unit 11, the voice division unit 12, the recognition result aggregation unit 15, and the recognition result display unit 16 are virtually configured in the voice recognition processing device 1.
  • a microphone is connected to the voice sampling means 11 and a display unit is connected to the recognition result display means 16, but the illustration is omitted. Further, although a case where a display unit is connected to the recognition result display unit 16 as an output unit for the recognition result sentence will be described, a printer may be used.
  • the voice collection unit 11, the voice division unit 12, the recognition result aggregation unit 15, and the recognition result display unit 16 shown in FIG. 1 are ASIC (Application Specific Integrated Circuit) or the like specialized for each function.
  • a dedicated integrated circuit may be used.
  • speech recognition technology it is necessary to perform speech recognition processing in accordance with the input speed of speech, and the speed of information processing becomes important.
  • the overall information processing speed can be improved.
  • the voice sampling means 11 receives voice information emitted from one or more speakers 4 as digital data, which is voice data continuously input via a microphone (not shown), and is continuous information like stream data. Get as.
  • the voice collection unit 11 will be described as a case where the acquired voice data is output to the voice division unit 12 without being processed, but the voice data may be processed and output.
  • noise canceling processing for removing noise and filtering processing for extracting only a frequency band indicating human speech can be considered.
  • the voice dividing unit 12 analyzes the voice data acquired by the voice sampling unit 11, and divides the voice data into voice data pieces that are smaller units.
  • the method of dividing is to detect a part in which voice information of a person does not exist (for example, a part in which no person's voice exists) or a breathing part, and extract data before and after that as a piece of voice data. is there.
  • the audio data in the area sandwiched between the detected parts corresponds to an audio data piece.
  • a frequency band eg, about 200 Hz to about 4 KHz
  • a method for determining whether there is sound information sound data in a state in which no human sound is included is collected, and the sound is recorded as an environmental sound.
  • a method of determining that there is no information is conceivable.
  • the method for detecting the presence or absence of audio information is not limited to the method described here, and other methods may be used.
  • the voice dividing means 12 assigns a permutation number indicating the order of appearance of the voice data to the divided voice data pieces.
  • the voice dividing means 12 assigns permutation numbers in order from the first voice data piece of the voice data received from the voice sampling means 11. Therefore, the permutation numbers assigned to the audio data pieces are in the order of input to the audio sampling means 11.
  • the recognition request transmission / reception means 14 When the recognition request transmission / reception means 14 receives from the voice division means 12 a pair of the voice data piece divided by the voice division means 12 and the permutation number assigned to the voice data piece, the voice request containing the voice data piece is sent to the voice request. It transmits to the recognition server 3. At that time, the recognition request transmission / reception means 14 transmits a plurality of voice recognition requests to the voice recognition server 3 in parallel. This will be specifically described below.
  • the number of communication ports (communication channels) for transmitting / receiving data to / from the voice recognition server 3 is preset.
  • the number of communication ports is determined by the information processing capability of the voice recognition server 3 that is a data transmission / reception destination.
  • a plurality of communication ports are set to be usable in the recognition request transmission / reception means 14.
  • the recognition request transmission / reception unit 14 has a plurality of logically usable communication ports, and associates a pair of a permutation number and a voice data piece passed from the voice division unit 12 to each of the plurality of communication ports. And permutation number combination information.
  • the recognition request transmission / reception means 14 can transmit a plurality of voice recognition requests in parallel by transmitting a recognition request including a voice data fragment to the voice recognition server 3 via each communication port. At this time, there is no need to synchronize between communication ports, and transmission can be performed asynchronously.
  • the number of voice recognition requests that can be made at one time may be fixedly set in the recognition request transmission / reception means 14 or may be set freely by a setting file or the like.
  • the recognition request transmission / reception unit 14 associates the received voice data piece with the communication port when receiving the voice recognition result from the voice recognition server 3 through the communication port.
  • the permutation number that has been assigned is assigned to the received speech recognition result.
  • the recognition request transmission / reception unit 14 stores the speech recognition result and the permutation number in the permutation number storage unit 13 in association with each other.
  • the permutation number storage means 13 records the permutation numbers assigned to the audio data pieces divided by the audio dividing means 12.
  • FIG. 2 is a diagram showing a configuration example of data stored in the permutation number storage means.
  • the storage area of T1301 shown in FIG. 2 is a field in which the maximum value of the permutation numbers assigned to the audio data pieces divided by the audio dividing means 12 is recorded.
  • 0 is recorded in the field T1301 of the permutation number storage means 13 as the initial value of the maximum permutation number.
  • the initial stage is when the speech recognition processing program of this embodiment is started.
  • the voice division means 12 When assigning the permutation number, the voice division means 12 reads the maximum value of the permutation number from the permutation number storage means 13, assigns the value obtained by adding 1 to the next voice data piece, and then updates it. The maximum value of the permutation number is recorded in the permutation number storage means 13.
  • the permutation number storage means 13 stores the speech recognition result received by the recognition request transmission / reception means 14 in pairs with the permutation number.
  • FIG. 3 is a diagram showing another configuration example of data stored in the permutation number storage means shown in FIG.
  • the storage area of T1311 shown in FIG. 3 is a field for storing the number (permutation number) assigned to the audio data pieces divided by the audio dividing means 12.
  • the storage area of T1312 shown in FIG. 3 is a field for storing the speech recognition result received by the recognition request transmission / reception means 14.
  • the permutation number storage means 13 is not limited to the data structure described above, and may be realized by a database or the like so that data can be referred to and recorded as described above.
  • the recognition result aggregating unit 15 reads the speech recognition result received by the recognition request transmitting / receiving unit 14 from the speech recognition server 3 and the permutation number associated with the recognition result from the permutation number storage unit 13, and the speech recognition results in the order of the permutation number. And a recognition result sentence composed of a certain number of words or syllables is created.
  • the recognition result aggregating unit 15 periodically searches the permutation number storage unit 13 to determine whether a predetermined number or more of speech recognition results are stored from the smallest permutation number.
  • the recognition result sentence is created by connecting the voice recognition results in order, and the created recognition result sentence is displayed as a recognition result.
  • the recognition result aggregation means 15 deletes the connected speech recognition result and the record of the permutation number from the data stored in the permutation number storage means 13.
  • the number of speech recognition results for confirming a sentence may be fixedly set in the recognition result aggregating unit 15 or may be freely set by a setting file or the like.
  • the recognition result display unit 16 When the recognition result display unit 16 receives the recognition result sentence from the recognition result aggregation unit 15, the recognition result display unit 16 outputs the recognition result sentence as a character string to a display unit (not shown) so that the viewer 5 can view it.
  • a window may be displayed by GUI (Graphical User Interface), or output to a file or the like.
  • processing for converting all output sentences into “Hiragana” or “Katakana” may be performed, or processing for converting a part or all of them into Roman characters or the like may be performed.
  • FIG. 4 is a flowchart showing the operation procedure of the speech recognition processing apparatus of this embodiment.
  • Step S01 The voice collecting means 11 continuously receives voice information from one or more speakers 4 from a microphone (not shown) as digital data, and acquires it as continuous information such as stream data.
  • Step S02 The voice dividing unit 12 detects a breathing and silent portion from the voice data collected by the voice collecting unit 11, and divides the voice data before and after the breathing and silent parts. Subsequently, the voice dividing means 12 assigns a permutation number to the divided voice data pieces, registers the permutation numbers in the permutation number storage means 13, and sets the divided voice data pieces and the assigned permutation numbers as a set. To the recognition request transmission / reception means 14.
  • step S02 shown in FIG. 4 will be described in detail with reference to FIG.
  • Step S0201 The voice dividing means 12 detects breathing and silent parts from the collected voice data.
  • Step S0202 The voice dividing unit 12 divides the voice data before and after the detected breathing and silence parts to create a voice data piece.
  • Step S0203 The voice dividing unit 12 assigns a permutation number to each of the voice data pieces in the divided order. Then, the voice dividing unit 12 acquires the current maximum value of the permutation number from the field T1301 of the permutation number storage unit 13, increments the value by 1, and records it in the field T1301 of the permutation number storage unit 13.
  • Step S0204 The voice dividing unit 12 assigns the permutation number assigned in step S0203 to the divided voice data piece and passes it to the recognition request transmitting / receiving unit 14.
  • Step S03 The recognition request transmission / reception means 14 requests voice recognition by transmitting a plurality of pieces of voice data divided by the voice division means 12 to the voice recognition server 3 asynchronously and in parallel.
  • the recognition request transmission / reception unit 14 has a plurality of communication ports, associates the communication port used for transmission of the recognition request with the permutation number passed from the voice division unit 12, and holds information on the correspondence.
  • Step S04 Upon receiving the speech recognition result from the speech recognition server 3, the recognition request transmission / reception means 14 searches the permutation number in the field T1311 of the permutation number storage means 13 using the permutation number held in Step S03. The speech recognition result is stored in the field T1312 of the record whose value matches.
  • Step S05 The recognition result aggregating unit 15 periodically searches the speech recognition result storage state in the permutation number storage unit 13, and if the speech recognition result is continuously stored for a certain length, those recognition results are stored. Connect the results to create a recognition result sentence.
  • step S05 shown in FIG. 4 will be described in detail with reference to FIG.
  • Step S0501 The recognition result aggregating unit 15 periodically retrieves the speech recognition result storage state in the permutation number storage unit 13, and a predetermined number of speech recognition results are continuously registered from the smallest permutation number. Find out the state of being.
  • Step S0502 The recognition result aggregating means 15 acquires a plurality of permutation number (field T1311) and speech recognition result (field T1312) pairs found in step S0501 from the permutation number storage means 13, and performs speech recognition in the order of the permutation numbers. The results are rearranged and the speech recognition results are connected to generate a recognition result sentence.
  • Step S0503 The recognition result aggregating unit 15 deletes the record storing the speech recognition result and the permutation number acquired in step S0502 from the permutation number storage unit 13.
  • Step S0504 The recognition result aggregating unit 15 passes the recognition result sentence generated in step S0502 to the recognition result display unit 16.
  • step S05 This completes the detailed description of the operation in step S05.
  • Step S06 The recognition result display means 16 displays the recognition result sentence generated by the recognition result aggregating means 15 so that the viewer 5 can view it.
  • the recognition result display means 16 outputs a recognition result sentence to a display unit (not shown).
  • the speech recognition processing method by the speech recognition processing apparatus of the present embodiment will be specifically described using examples. Detailed description of the same configuration as that shown in FIG. 1 is omitted.
  • FIG. 7 is a block diagram showing a configuration example of the voice recognition processing apparatus of the present embodiment.
  • the voice recognition processing apparatus 1 of the present embodiment has a configuration in which a program for executing the above-described voice recognition processing method is stored in advance in a memory (not shown) in the control unit 30 in a general PC.
  • a microphone 21 is connected to the voice sampling means 11 of the voice recognition processing device 1 as a device for inputting voice.
  • a display unit 22 is connected to the recognition result display means 16 of the speech recognition processing device 1 as a device for displaying the recognition result sentence.
  • the network 6 is a network including the Internet.
  • the voice recognition processing device 1 and the voice recognition server 3 use TCP (Transmission Control Protocol) / IP (Internet Protocol) as a communication protocol.
  • TCP Transmission Control Protocol
  • IP Internet Protocol
  • Each of the speech recognition processing device 1 and the speech recognition server 3 stores in advance the terminal identification of the own device and the counterpart device.
  • the recognition request transmission / reception means 14 has five communication ports that can simultaneously transmit a voice recognition request.
  • the recognition result aggregating unit 15 creates a recognition result sentence when three voice recognition results are continuously obtained.
  • a punctuation mark indicates breathing in actual occurrence, and a punctuation mark indicates a silent part.
  • Step S01 The voice sampling means 11 continuously receives voice data from the microphone 21 as digital data, and obtains it as stream data.
  • Step S02 The breath data and the silent part are detected from the voice data collected by the voice dividing unit 12 (today is fine), and the voice data is divided before and after that. Subsequently, the voice dividing means 12 assigns a permutation number to the divided voice data pieces, registers the permutation numbers in the permutation number storage means 13, and sets the divided voice data pieces and the assigned permutation numbers as a set. To the recognition request transmission / reception means 14.
  • step S02 will be described in detail with reference to FIG.
  • Step S0201 The voice dividing means 12 detects breathing and silent parts from the collected voice data (it is fine today). In this embodiment, a punctuation mark of a sentence representing voice data is detected. As a detection method, when the voice data of 200 Hz to 4 KHz is less than 60 decibels and the state is continued for 0.5 seconds or more, it is determined that breathing and silence are present.
  • Step S0202 The voice dividing unit 12 divides the voice data before and after the detected breathing and silence parts to create a voice data piece. In this embodiment, it is divided into a voice data piece “Today”, a voice data piece “Sunny”, and a voice data piece “I”.
  • Step S0203 The voice dividing unit 12 assigns a permutation number to each of the voice data pieces in the divided order. Then, the voice dividing unit 12 acquires the current maximum value of the permutation number from the field T1301 of the permutation number storage unit 13, increments the value by 1, and records it in the field T1301 of the permutation number storage unit 13.
  • Step S0204 The voice dividing means 12 passes the permutation number numbered in step S0203 to the divided voice data piece and assigns it to the allocation recognition request transmission / reception means 14, and records the permutation number information in the permutation number storage means 13.
  • the voice division means 12 assigns the permutation number 1 to the voice data piece “Today”, the permutation number 2 to the voice data piece “Sunny”, and the voice data “I”. Number permutation number 3 on a piece.
  • the state of the permutation number storage means 13 at this time is shown in FIG.
  • Step S03 The recognition request transmission / reception means 14 requests the voice recognition by transmitting a plurality of pieces of voice data divided by the voice dividing means 12 to the voice recognition server 3 asynchronously and in parallel.
  • the recognition request transmission / reception means 14 has five communication ports, and holds the communication port used for transmission of the recognition request and the permutation number passed from the voice division means 12 as shown in FIG. .
  • FIG. 9 is a diagram showing a state in which a recognition request is sent from the voice recognition processing device to the voice recognition server for each communication port.
  • ports 1 to 3 represent communication port numbers, and “port 1: permutation number 1” means that permutation number 1 is held in association with communication port 1.
  • the communication ports 4 and 5 are omitted. Since the speech recognition processing device 1 is realized on the PC by executing a program, each communication port and the corresponding permutation number are recorded on a memory (not shown) assigned from the PC.
  • Step S04 The recognition request transmission / reception means 14 receives the speech recognition result from the speech recognition server 3 as shown in FIG.
  • FIG. 10 is a diagram illustrating a state in which the recognition result is returned from the voice recognition server to the voice recognition processing apparatus for each communication port. Comparing FIG. 9 and FIG. 10, it can be seen that the recognition result corresponding to the recognition request is returned from the voice recognition server 3 to the same communication port.
  • the recognition request transmission / reception means 14 searches the permutation number in the field T1311 of the permutation number storage means 13 using the permutation number held in step S03, and displays the speech recognition result in the field T1312 of the record with the matching value. Store as shown in FIG.
  • the order of arrival of the speech recognition results is the order of permutation number 2, permutation number 3, and permutation number 1, and the speech recognition results are stored in the permutation number storage means 13 in that order.
  • Step S05 The recognition result aggregating unit 15 periodically searches the speech recognition result storage state in the permutation number storage unit 13, and finds a data string in which three speech recognition results are stored continuously. Then, the recognition result aggregating unit 15 connects the results to create a recognition result sentence “Today is sunny”. At the time of joining, the recognition result aggregation means 15 inserts a blank character between the speech recognition results.
  • step S05 The operation of step S05 will be described in detail with reference to FIG.
  • Step S0501 The recognition result aggregating means 15 periodically searches the speech recognition result storage state in the permutation number storage means 13, and three consecutive speech recognition results are registered from the smallest permutation number. Find the state, that is, the records with permutation numbers 1, 2, and 3.
  • Step S0502 The recognition result aggregating unit 15 acquires a plurality of permutation number (field T1311) and speech recognition result (field T1312) pairs found in step S0501 from the permutation number storage unit 13.
  • the recognition result aggregating unit 15 acquires “Today” from the record with the permutation number 1, acquires “sunny” from the record with the permutation number 2, and acquires “is” from the record with the permutation number 3. To do. Thereafter, the recognition result aggregating unit 15 rearranges the speech recognition results according to the order of the permutation numbers, and generates a recognition result sentence “Today is fine” by connecting the respective speech recognition results. When connecting each speech recognition result, a space is inserted between them.
  • Step S0503 The recognition result aggregating unit 15 deletes the record storing the speech recognition result and the permutation number acquired in step S0502 from the permutation number storage unit 13. In this case, the records with permutation numbers 1, 2, and 3 correspond.
  • Step S0504 The recognition result aggregating unit 15 passes the recognition result sentence “Today is fine” generated in step S0502 to the recognition result display unit 16.
  • step S05 This completes the detailed description of the operation of step S05.
  • Step S06 The recognition result display means 16 outputs the recognition result sentence “Today is fine” generated by the recognition result aggregation means 15 to the result display area 2201 of the display unit 22 as shown in FIG. To the user.
  • FIG. 12 is an example of a display screen.
  • steps S03 to S05 of the first embodiment there may be a case where the speech recognition result of the recognition request sent after the recognition request sent earlier is delivered first. A method for processing the next recognition request in this case will be described.
  • the recognition request transmission / reception means 14 assigns the permutation number 2 to the speech recognition result received through the port 2 while the port 1 is waiting for the speech recognition result reception, and the permutation number 3 to the speech recognition result received through the port 3. And the combination of the speech recognition result and the permutation number is stored in the permutation number storage means 13.
  • the recognition request transmission / reception means 14 When the recognition request transmission / reception means 14 receives a combination of the voice data piece and the permutation number as the next recognition processing target from the voice division means 12, the recognition request transmission / reception means 14 associates the pair with a communication port that is not waiting for the recognition result.
  • the recognition request transmitting / receiving unit 14 associates each of the four voice data pieces with each of the ports 2 to 5. That is, the recognition request transmission / reception means 14 sequentially distributes the next speech data pieces to be recognized to the unused ports 2 to 5 without waiting for the reception of the speech recognition result of the port 1.
  • the next recognition processing target speech data piece is sequentially associated with the communication port that is not used. Thus, information processing can be performed efficiently.
  • the voice data of a long conversation sentence is divided into voice data pieces having a word level that can be recognized by the cloud type voice recognition service, and each voice data piece is obtained using the cloud type voice recognition service.
  • the obtained speech recognition results are arranged in the original order to output a recognition result sentence of a long conversation sentence. Therefore, a user can convert a long conversation sentence composed of a plurality of sentences into character information without storing a large amount of data necessary for recognition processing in a terminal device such as his / her PC, smartphone or tablet terminal. Voice recognition results can be acquired.
  • the speech recognition processing device of the present invention has been specifically described for easy understanding.
  • the speech recognition processing device may be an information processing device as shown in FIG. Good.
  • FIG. 13 is a block diagram showing another configuration example of the speech recognition processing apparatus of the present embodiment.
  • the speech recognition processing apparatus includes a storage unit 33, a communication unit 34, and a control unit 30.
  • Each of the communication unit 34 and the storage unit 33 illustrated in FIG. 13 corresponds to the recognition request transmission / reception unit 14 and the permutation number storage unit 13 illustrated in FIG. 1.
  • a program for causing a computer to execute the speech recognition processing method described in this embodiment may be stored in a computer-readable recording medium. In this case, by installing the program from the recording medium into another information processing apparatus, it is possible to cause the other information processing apparatus to execute the above information processing method.
  • a user can obtain a speech recognition result of a long conversation sentence without storing a large amount of data necessary for speech recognition processing in his / her terminal device.
  • the present invention can be applied to the use of supporting the communication by displaying the contents spoken by the speaker in a characterized manner in a scene where a person with hearing impairments in general life recognizes the surrounding conversation. Moreover, by replacing the speech recognition process with a translation process, the present invention can be applied to a purpose of supporting communication with a foreigner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

La présente invention concerne un dispositif de traitement de reconnaissance de parole qui comprend : un moyen de collecte de parole (11) pour acquérir une parole d'entrée en tant que données de parole ; un moyen de segmentation de parole (12) pour segmenter les données de parole en une pluralité d'éléments de données de parole, et attribuer un numéro de séquence à chaque élément de données de parole dans l'ordre d'entrée ; un moyen de stockage (13) pour stocker le numéro de séquence ; un moyen de transmission et de réception de requête de reconnaissance (14) pour trier les éléments de données de parole tout en mappant les numéros de séquence à une pluralité de ports de communication établis au préalable, et transmettre les éléments de données de parole à un serveur de reconnaissance de parole par l'intermédiaire d'un réseau, et, lors de la réception de résultats de reconnaissance de parole à partir du serveur de reconnaissance de parole par l'intermédiaire des ports de communication, le moyen de transmission et de réception de requête de reconnaissance (14) attribuant les numéros de séquence mappés aux ports de communication aux résultats de reconnaissance de parole reçus, et stockant les résultats de reconnaissance de parole dans le moyen de stockage conjointement avec des numéros de séquence correspondant aux numéros de séquence attribués ; un moyen d'intégration de résultat de reconnaissance (15) pour générer une phrase de résultat de reconnaissance composée des résultats de reconnaissance de parole stockés conjointement avec les numéros de séquence, disposés dans l'ordre du numéro de séquence ; et un moyen d'affichage (16) pour afficher la phrase de résultat de reconnaissance générée.
PCT/JP2015/086000 2015-02-10 2015-12-24 Dispositif de traitement de reconnaissance de parole, procédé de traitement de reconnaissance de parole, et programme WO2016129188A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016574636A JP6429294B2 (ja) 2015-02-10 2015-12-24 音声認識処理装置、音声認識処理方法およびプログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2015023836 2015-02-10
JP2015-023836 2015-02-10

Publications (1)

Publication Number Publication Date
WO2016129188A1 true WO2016129188A1 (fr) 2016-08-18

Family

ID=56614333

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/086000 WO2016129188A1 (fr) 2015-02-10 2015-12-24 Dispositif de traitement de reconnaissance de parole, procédé de traitement de reconnaissance de parole, et programme

Country Status (2)

Country Link
JP (1) JP6429294B2 (fr)
WO (1) WO2016129188A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019090917A (ja) * 2017-11-14 2019-06-13 株式会社情報環境デザイン研究所 音声テキスト化装置、方法、及びコンピュータプログラム
JP2020184007A (ja) * 2019-05-07 2020-11-12 株式会社チェンジ 情報処理装置、音声テキスト化システム、音声テキスト化方法および音声テキスト化プログラム
CN113053380A (zh) * 2021-03-29 2021-06-29 海信电子科技(武汉)有限公司 服务器及语音识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006011066A (ja) * 2004-06-25 2006-01-12 Nec Corp 音声認識/合成システム、同期制御方法、同期制御プログラム、および同期制御装置
JP2008107624A (ja) * 2006-10-26 2008-05-08 Kddi Corp 文字起こしシステム
JP2012190088A (ja) * 2011-03-09 2012-10-04 Nec Corp 音声記録装置、方法及びプログラム
JP2013015726A (ja) * 2011-07-05 2013-01-24 Yamaha Corp 音声記録サーバ装置及び音声記録システム
JP2014056258A (ja) * 2008-08-29 2014-03-27 Mmodal Ip Llc 片方向通信を使用する分散型音声認識

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006011066A (ja) * 2004-06-25 2006-01-12 Nec Corp 音声認識/合成システム、同期制御方法、同期制御プログラム、および同期制御装置
JP2008107624A (ja) * 2006-10-26 2008-05-08 Kddi Corp 文字起こしシステム
JP2014056258A (ja) * 2008-08-29 2014-03-27 Mmodal Ip Llc 片方向通信を使用する分散型音声認識
JP2012190088A (ja) * 2011-03-09 2012-10-04 Nec Corp 音声記録装置、方法及びプログラム
JP2013015726A (ja) * 2011-07-05 2013-01-24 Yamaha Corp 音声記録サーバ装置及び音声記録システム

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019090917A (ja) * 2017-11-14 2019-06-13 株式会社情報環境デザイン研究所 音声テキスト化装置、方法、及びコンピュータプログラム
JP2020184007A (ja) * 2019-05-07 2020-11-12 株式会社チェンジ 情報処理装置、音声テキスト化システム、音声テキスト化方法および音声テキスト化プログラム
CN113053380A (zh) * 2021-03-29 2021-06-29 海信电子科技(武汉)有限公司 服务器及语音识别方法
CN113053380B (zh) * 2021-03-29 2023-12-01 海信电子科技(武汉)有限公司 服务器及语音识别方法

Also Published As

Publication number Publication date
JPWO2016129188A1 (ja) 2017-11-09
JP6429294B2 (ja) 2018-11-28

Similar Documents

Publication Publication Date Title
CN112115706B (zh) 文本处理方法、装置、电子设备及介质
EP3271917B1 (fr) Communication de métadonnées identifiant un orateur actuel
JP6771805B2 (ja) 音声認識方法、電子機器、及びコンピュータ記憶媒体
CN105895103B (zh) 一种语音识别方法及装置
JP6327848B2 (ja) コミュニケーション支援装置、コミュニケーション支援方法およびプログラム
CN108683937A (zh) 智能电视的语音交互反馈方法、系统及计算机可读介质
JP2015176099A (ja) 対話システム構築支援装置、方法、及びプログラム
US9196253B2 (en) Information processing apparatus for associating speaker identification information to speech data
JP2018045001A (ja) 音声認識システム、情報処理装置、プログラム、音声認識方法
JP6622165B2 (ja) 対話ログ分析装置、対話ログ分析方法およびプログラム
CN106713111B (zh) 一种添加好友的处理方法、终端及服务器
JP6429294B2 (ja) 音声認識処理装置、音声認識処理方法およびプログラム
CN110119514A (zh) 信息的即时翻译方法、装置和系统
CN110232921A (zh) 基于生活服务的语音操作方法、装置、智能电视及系统
WO2019123854A1 (fr) Dispositif de traduction, procédé de traduction et programme
CN114168710A (zh) 一种会议记录的生成方法、装置、系统、设备及存储介质
JP6507010B2 (ja) ビデオ会議システムと音声認識技術を組み合わせた装置および方法
WO2018198807A1 (fr) Dispositif de traduction
CN106873798B (zh) 用于输出信息的方法和装置
JP2018055022A (ja) 音声認識システム、情報処理装置、プログラム
KR20160131730A (ko) 자연어 처리 시스템, 자연어 처리 장치, 자연어 처리 방법 및 컴퓨터 판독가능 기록매체
JP2004348552A (ja) 音声文書検索装置および方法およびプログラム
US20200243092A1 (en) Information processing device, information processing system, and computer program product
CN112632241A (zh) 智能会话的方法、装置、设备和计算机可读介质
CN111582708A (zh) 医疗信息的检测方法、系统、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15882061

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016574636

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15882061

Country of ref document: EP

Kind code of ref document: A1