WO2019054199A1 - 情報処理装置、及び情報処理方法 - Google Patents
情報処理装置、及び情報処理方法 Download PDFInfo
- Publication number
- WO2019054199A1 WO2019054199A1 PCT/JP2018/032323 JP2018032323W WO2019054199A1 WO 2019054199 A1 WO2019054199 A1 WO 2019054199A1 JP 2018032323 W JP2018032323 W JP 2018032323W WO 2019054199 A1 WO2019054199 A1 WO 2019054199A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- token
- content
- information processing
- voice
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/018—Audio watermarking, i.e. embedding inaudible data in the audio signal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/043—Distributed expert systems; Blackboards
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04H—BROADCAST COMMUNICATION
- H04H20/00—Arrangements for broadcast or for distribution combined with broadcast
- H04H20/28—Arrangements for simultaneous broadcast of plural pieces of information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the present technology relates to an information processing apparatus and an information processing method, and more particularly to an information processing apparatus and an information processing method capable of improving the convenience of an audio AI assistance service used in cooperation with content.
- voice AI assistance services are rapidly spreading. For example, by using the voice AI assistance service, when the end user asks "Where are you here?", The answer "I am in Central Park.” Is returned based on the current position of the end user ( For example, refer to Patent Document 1).
- the present technology has been made in view of such a situation, and is intended to improve the convenience of the audio AI assistance service used in cooperation with content.
- An information processing apparatus is an information processing apparatus including an insertion unit that inserts a token related to use of an audio AI assistance service linked to content into an audio stream of the content.
- the information processing apparatus may be an independent apparatus or an internal block constituting one apparatus. Further, an information processing method according to a first aspect of the present technology is an information processing method corresponding to the above-described information processing apparatus according to the first aspect of the present technology.
- a token relating to the use of the audio AI assistance service linked to the content is inserted into the audio stream of the content.
- An information processing apparatus is an information processing apparatus including a detection unit that detects, from an audio stream of content, a token related to use of an audio AI assistance service linked to the content.
- the information processing apparatus may be an independent apparatus or an internal block constituting one apparatus. Further, an information processing method according to a second aspect of the present technology is an information processing method corresponding to the above-described information processing apparatus according to the second aspect of the present technology.
- a token relating to the use of the audio AI assistance service linked to the content is detected from the audio stream of the content.
- voice AI assistance services are rapidly spreading.
- This type of service is detected or collected by an application executed by a device (for example, a smart speaker) having a voice detection or sound collecting function or a mobile device (for example, a smartphone or a tablet computer) having a microphone function.
- Speech recognition is performed based on the sounded audio data. Then, based on the speech recognition obtained in this way, the user answers the end user's question or the like.
- Alexa registered trademark
- Amazon Echo registered trademark
- Alexa Voice Service AVS
- ASK Alexa Skills Kit
- Alexa Skills Kit for example, what kind of speech to respond to, what kind of word is used as a parameter and what function to execute, or how to return the returned answer to Alexa, etc.
- a package or an API (Application Programming Interface) group that defines the part of which a part is actually executed is called a skill.
- an end user throws the following words toward a local device having voice detection and sound collection functions such as a smart speaker.
- the first word "Alexa” is called Wake Word, and when the microphone of the device on the local side detects this word, it starts communication with the server on the cloud side, and the subsequent words are audio data. Will be sent to the server on the cloud side.
- the next word “ask” is called a launch phrase, and tells the cloud server that the next word is the skill name. In this example, "Anime Facts" is the skill name.
- the start phrase is, for example, "tell”, “launch”, “load”, “begin”, “open”, “start”, etc. in addition to “ask”. It can not be used.
- Another method is to use a conjunction to indicate the skill name. For example, in the above-mentioned example, even when calling “Alexa, can you give me a fact from Anime Facts”, “Anime Facts” that is the skill is recognized by recognizing the word “from”. It can be determined that it is the first name.
- the final "for a fact” is called Utterance, and the Alexa Skills Kit establishes the correspondence between Utterance and the process, procedure, or function actually executed. That is, by saying “for a fact", the server on the cloud side determines which process, procedure, or function the Utterance "for a fact" is to be connected to.
- the present technology makes it possible to improve the convenience of the audio AI assistance service when using such an audio AI assistance service in conjunction with a content such as a CM or a program.
- FIG. 1 is a block diagram showing a configuration example of a content / voice AI cooperation system to which the present technology is applied.
- the content-speech AI cooperation system 1 of FIG. 1 is a system for distributing contents, and it is possible to use the speech AI assistance service in cooperation with the distributed contents.
- the content / voice AI cooperation system 1 includes a server device 10, a broadcast system 11, a net distribution system 12, a client device 20, a voice processing device 30, and a server device 40.
- the client device 20 and the voice processing device 30 installed in the viewer's house can be connected to the network distribution system 12 and the server device 40 installed in a data center or the like via the Internet 50. Can exchange various data.
- the server device 10 stores contents to be distributed.
- the content to be distributed is, for example, content such as a CM or a program.
- the programs include, for example, dramas, news, shopping channels, animations, sports and the like.
- the server device 10 processes a stream of content to be distributed, and supplies the stream to the broadcast system 11 or the net distribution system 12 according to the content distribution method.
- the broadcast system 11 includes one or more broadcast servers and the like.
- the broadcast system 11 performs processing (for example, modulation processing, etc.) according to a predetermined broadcast system on the content supplied from the server device 10, and provides data obtained as a result as a broadcast wave at the transmission station. Transmit from the antenna.
- the net distribution system 12 includes one or more communication servers and the like.
- the net distribution system 12 processes the content supplied from the server device 10 according to a predetermined communication method, and distributes (streams and distributes) the data (packet) obtained as a result thereof through the Internet 50. .
- the client device 20 is configured as, for example, a fixed receiver such as a television receiver or a personal computer, or a mobile receiver such as a smartphone, a mobile phone, or a tablet computer.
- a fixed receiver such as a television receiver or a personal computer
- a mobile receiver such as a smartphone, a mobile phone, or a tablet computer.
- the client device 20 receives and processes a broadcast wave transmitted from the broadcast system 11, thereby reproducing content and outputting video and audio such as a CM or a program.
- the client device 20 receives and processes data distributed from the Internet distribution system 12 via the Internet 50 to reproduce content, and outputs video and audio such as CMs and programs.
- the voice processing device 30 is, for example, a speaker connectable to a network such as a home LAN (Local Area Network), and is also referred to as a smart speaker or a home agent.
- This type of speaker can function, for example, as a user interface of an audio AI assistance service, or can perform audio operations on devices such as lighting fixtures and air conditioners, in addition to music reproduction.
- the voice processing device 30 can provide the voice AI assistance service to the end user (the viewer of the content) by working alone or in cooperation with the server device 40 on the cloud side.
- the voice AI assistance service is, for example, a function or service that appropriately answers or operates an end user's inquiry or request by combining processing such as voice recognition processing and natural language analysis processing. It means that.
- a sound collection module and a voice recognition module are, for example, a sound collection module and a voice recognition module, but even if all of these functions are implemented in the voice processing apparatus 30 on the local side. Alternatively, some of the functions may be implemented in the server device 40 on the cloud side.
- the server device 40 is installed in a data center or the like, and has a function for providing a voice AI assistance service, various databases, and the like. In response to the request from the voice processing device 30, the server device 40 performs processing regarding the voice AI assistance service, and returns the processing result to the voice processing device 30 via the Internet 50.
- the content / voice AI cooperation system 1 is configured as described above.
- one client device 20 for example, a television receiver
- one voice processing device 30 for example, a smart speaker
- the client device 20 and the voice processing device 30 can be installed at each viewer's home. Further, in the viewer's home, the client device 20 and the voice processing device 30 are assumed to be installed in the same room, but may be installed in different rooms.
- server apparatus 10 and one server apparatus 40 have functions.
- a plurality of units may be provided for each company and each company.
- the client device 20 is provided on the reception side (viewer side) with respect to the server device 10 provided on the transmission side (broadcast station side). Further, with respect to the server device 40 provided on the cloud side, the audio processing device 30 will be described as provided on the local side.
- the CM of a hamburger chain store that is XYZ reproduced by a client device 20 such as a television receiver, supplements the contents of the CM, for example, the CM of "XYZ burger" "Service A, ask Hamburger It is assumed that a voice message of "restaurant XYZ” What's XYZ Burger "” is intentionally flowed in the voice of the CM to force the voice AI assistance service to answer this question.
- the case is not limited to the voice of the CM, and, for example, the case of being performed by an application or the like that is provided in broadcast by being associated with the CM is also included.
- “intention” here means that there is no consent of the viewer.
- the voice AI assistance service is notified of how to speak (question).
- the contents of the former CM are contents authorized by some authority or censorship institution or the like.
- the voice AI assistance service will explain the information that the viewer does not want to know in detail. It may be extra care (high possibility). In addition, there is a possibility that the viewer may be annoyed (high possibility) for the viewer's profile information to be stored as the user is interested in the content of this CM.
- the audio AI assistance service responds only to the content of the question uttered by the end user. You may want to limit it.
- a coping method in such a case, for example, there is a method of pre-registering a voice model of an end user and specifying a uttering user of a conversation (specifying a question of a recognition target voice user), but such a utterer
- voice AI assistance service without specific function, a list of questions that should not be reacted as a blacklist (for example, a list of text strings) to recognize CM speech but not respond to the question
- a blacklist for example, a list of text strings
- the blacklist including the question is managed so as not to process it. It will be done.
- the blacklist to be managed may become huge, and the blacklist is held for a certain period or for a long time in the future, and matching evaluation is immediately performed on all the questions (for example, Real-time database search etc.), which is not realistic.
- the holding period of the blacklist means, for example, a period in which a question may be thrown from an end user.
- the method of identifying the uttered user of the above-mentioned conversation is implemented, for example, by Google Home (registered trademark) which is another representative example of the voice AI assistance service.
- the present technology it is proposed to insert a token for prohibiting or permitting the speech recognition process by the speech AI assistance service linked to the content into the audio stream of the content as the audio watermark.
- the voice processing device 30 provided locally as the sound collection device for the audio AI assistance service, or the server device 40 provided on the cloud side that analyzes the collected audio stream, Implement watermark detection function.
- any method for the audio watermark may be used as long as a necessary and sufficient token can be superimposed on the target audio stream.
- FIG. 2 is a diagram showing an example of a speech recognition process prohibition token embedded as an audio watermark in a baseband audio stream.
- the audio stream of a CM or program contains audio that should not be passed on to subsequent processing as a valid speech recognition result after performing speech recognition processing of the speech AI assistance service. It is assumed that
- the server apparatus 10 on the transmitting side decodes all the audio streams into baseband audio streams, and the token generated by the token generator by the audio WM insertion module (voice recognition processing Insert the forbidden token) as an audio watermark into the baseband audio stream.
- the audio recognition process prohibition token inserted as an audio watermark can be inserted not only in the transmitting server apparatus 10 but also in the receiving client apparatus 20. Therefore, an audio watermark is inserted below. Will be described on the transmission side and the reception side.
- FIG. 3 is a block diagram showing a first example of the configuration of the content-voice AI cooperation system 1 according to the first embodiment.
- the content / voice AI cooperation system 1 of FIG. 3 includes a server device 10A, a client device 20A, and a voice processing device 30A.
- processing on an audio stream is mainly described, but processing on a video stream is also performed in the server device 10A, the client device 20A, and the like. .
- the server device 10A includes a CM / program bank 101, an audio decoder 102, a token generator 103, an audio WM insertion module 104, and an audio encoder 105.
- the CM / program bank 101 stores a large number of contents such as CMs and programs.
- the CM / program bank 101 includes an audio stream of a CM or a program (hereinafter referred to as a CM / program audio stream) in an CM or a stream of programs (hereinafter referred to as a CM / program stream) to be distributed.
- the data is supplied to the decoder 102.
- the audio obtained from the CM / program audio stream may include audio for which the speech recognition process should be prohibited.
- the audio decoder 102 decodes the CM / program audio stream supplied from the CM / program bank 101, and supplies the baseband CM / program audio stream obtained as a result of the decoding to the audio WM insertion module 104.
- the token generator 103 generates a speech recognition process prohibition token based on the token generation data and supplies the token to the audio WM insertion module 104. Also, the speech recognition process prohibition token is notified to the audio WM detection module 302 of the speech processing device 30A.
- the token generation data is, for example, data for generating a token or the like for preventing reaction to a question even if a specific voice flows in the CM of the hamburger chain store which is XYZ.
- the entity of the voice AI assistance service or other operators.
- the notification method of the speech recognition process prohibition token in addition to notification via communication via the Internet 50, for example, notification via broadcast or voice recognition processing to a recording medium such as a semiconductor memory or an optical disc
- a recording medium such as a semiconductor memory or an optical disc
- the speech recognition process prohibition token generated by the token generator 103 may be notified to the audio WM detection module 302 of the speech processing device 30A, and the method of the notification is arbitrary.
- the audio WM insertion module 104 inserts (encodes), as an audio watermark, a speech recognition process prohibition token supplied from the token generator 103 to the baseband CM / program audio stream supplied from the audio decoder 102.
- the data is supplied to the encoder 105.
- the audio encoder 105 encodes a baseband CM / program audio stream supplied from the audio WM insertion module 104 (a stream in which a speech recognition process prohibition token is inserted as an audio watermark on the transmitting side).
- the server device 10A sends the CM / program audio stream obtained as a result of the encoding by the audio encoder 105 to the broadcast system 11 or the net delivery system 12 according to the delivery method of the content.
- the broadcast system 11 processes the CM / program stream (a stream in which the speech recognition process prohibition token is inserted as an audio watermark on the transmitting side) sent from the server device 10A, and the data obtained as a result of the process is broadcast Send as.
- the net distribution system 12 processes the CM / program stream (a stream in which the speech recognition process prohibition token is inserted as an audio watermark on the transmitting side) sent from the server device 10A, and data (packet) obtained as a result of the process Distributed over the Internet 50.
- the client device 20A receives the CM / program stream distributed by the broadcast system 11 or the net distribution system 12.
- the client device 20A is configured to include an audio decoder 201 and an audio speaker 202.
- the audio decoder 201 decodes the CM / program audio stream received from the broadcast system 11 or the net distribution system 12, and supplies the resultant baseband CM / program audio stream to the audio speaker 202.
- the audio speaker 202 outputs audio corresponding to the baseband CM / program audio stream supplied from the audio decoder 201.
- the CM / program video stream is also decoded by the video decoder and the CM or program corresponding to the baseband CM / program video stream Is displayed on the display.
- the speech processing device 30A includes a sound collection module 301, an audio WM detection module 302, and a speech recognition module 303 as the function of the speech AI assistance service. Also, the sound collection module 301 includes an audio microphone 311.
- the audio microphone 311, the audio WM detection module 302, and the voice recognition module 303 of the sound collection module 301 respond to the voice input from the client device 20A or the voice input from the viewer 2. It constitutes a series.
- the audio WM detection module 302 holds, in advance, the speech recognition process prohibition token notified from (the token generator 103 of) the server device 10A.
- the audio microphone 311 picks up the sound output from the audio speaker 202 of the client device 20A, and supplies the resultant audio stream to the audio WM detection module 302 and the voice recognition module 303.
- the audio WM detection module 302 detects the audio watermark inserted in the audio stream supplied from the audio microphone 311, and the voice recognition processing prohibition token notified from the server device 10A is inserted as the audio watermark. Determine if there is.
- the speech recognition module 303 performs speech recognition processing on the audio stream supplied from the audio microphone 311.
- the speech recognition module 303 supplies the speech recognition result to the subsequent processing unit that performs the subsequent processing.
- the subsequent processing unit performs subsequent processing related to the voice AI assistance service based on the voice recognition result supplied from the voice recognition module 303.
- the voice recognition module 303 does not pass the voice recognition result to the subsequent processing unit. Do.
- the audio microphone 311 also picks up the voice of the viewer 2's speech, and supplies the audio stream obtained as a result to the audio WM detection module 302 and the voice recognition module 303.
- the audio WM detection module 302 detects the audio watermark inserted in the audio stream supplied from the audio microphone 311, and the voice recognition processing prohibition token notified from the server device 10A is inserted as the audio watermark. Determine if there is.
- the audio WM detection module 302 does not always insert the speech recognition process prohibition token. judge.
- the speech recognition module 303 performs speech recognition processing on the audio stream supplied from the audio microphone 311.
- the speech recognition module 303 Since it is determined by the audio WM detection module 302 that the speech recognition process prohibition token is not inserted at all times, the speech recognition module 303 supplies the speech recognition result to the subsequent processing unit that performs the subsequent processing. Therefore, the subsequent processing unit always performs the subsequent processing related to the voice AI assistance service based on the voice recognition result supplied from the voice recognition module 303.
- the voice processing unit 30A on the local side performs all processing of the voice AI assistance service for convenience of explanation, a part of the processing of the voice AI assistance service is executed by the server on the cloud side. It may be performed by the device 40.
- the voice AI assistance service is realized by cooperation of the voice processing device 30A and the server device 40.
- the speech recognition process prohibition token is basically one type of token, for example, a token for preventing reaction to the question even if a specific voice flows in the CM of a hamburger chain store that is XYZ. Although it is sufficient, it may be limited to several types and used as needed.
- FIG. 4 is a flowchart showing the flow of processing on the transmission side executed by the server device 10A and the broadcast system 11 or the net distribution system 12.
- FIG. 5 is a flowchart showing the flow of processing on the receiving side executed by the client device 20A and the voice processing device 30A.
- step S101 the CM / program bank 101 sends out a stream of CM / program stored therein.
- the CM / program audio stream is sent to the audio decoder 102.
- step S102 the token generator 103 generates a speech recognition process prohibition token based on the token generation data.
- a speech recognition processing prohibition token for example, even if the voice “Service A, ask Hamburger restaurant XYZ“ What's XYZ Burger ”flows in the CM of the hamburger chain store which is XYZ, Tokens are generated to prevent reaction.
- the speech recognition process prohibition token is notified in advance to the audio WM detection module 302 of the speech processing device 30 via communication or the like.
- step S103 the audio decoder 102 decodes the CM / program audio stream transmitted in the process of step S101. As a result of this decoding, a baseband CM / program audio stream is obtained.
- step S105 the audio WM insertion module 104 inserts, as an audio watermark, the speech recognition process prohibition token obtained in the process of step S102 into the baseband CM / program audio stream obtained in the process of step S103. ).
- step S104 the audio encoder 105 encodes the baseband CM / program audio stream into which the audio watermark has been inserted, obtained in the process of step S105.
- CM / program audio stream is described, but in the server device 10A, processing is performed by multiplexing with other streams such as CM / program video stream as necessary. Be done.
- the CM / program stream (a stream obtained by inserting the speech recognition process prohibition token as an audio watermark on the transmitting side) obtained by the server device 10A is distributed by the broadcast system 11 or the net according to the content distribution method. It is delivered to the system 12.
- the broadcast system 11 mixes the CM / program stream (stream in which the speech recognition process prohibition token is inserted as the audio watermark on the transmitting side) sent from the server device 10A. Process and send out data obtained as a result of the process as a broadcast wave.
- the net distribution system 12 transmits the CM / program stream sent from the server device 10A (a stream in which a speech recognition process prohibition token is inserted as an audio watermark on the transmitting side) And distribute data obtained as a result of the processing via the Internet 50.
- the CM / program stream distributed by the broadcast system 11 or the net distribution system 12 is received by the client device 20A in FIG.
- the client device 20A the CM / program stream is processed, and the CM / program audio stream is input to the audio decoder 201.
- the client device 20A adjusts the audio output volume of the audio speaker 202 so that the volume output from the audio speaker 202 is sufficient (S201).
- the audio speaker 202 is controlled such that the audio microphone 311 incorporated in the audio processing device 30A can pick up the sound.
- the client device 20A instructs the viewer 2 to adjust the volume (increase the volume).
- This instruction may be made, for example, by voice from the audio speaker 202, or a message to that effect may be presented on the screen.
- step S202 the audio decoder 201 decodes the CM / program audio stream. As a result of this decoding, a baseband CM / program audio stream is obtained.
- step S203 the audio speaker 202 outputs audio corresponding to the baseband CM / program audio stream obtained in the process of step S202.
- the CM / program audio stream is described here to simplify the description, but in the client device 20A, the CM / program video stream is also decoded by the video decoder and the baseband CM / program video is An image of a CM or a program corresponding to the stream is displayed on the display.
- the audio output from the audio speaker 202 of the client device 20A is collected by the audio microphone 311 of the audio processing device 30A.
- an audio stream corresponding to the voice collected by the audio microphone 311 is supplied to the audio WM detection module 302 and the voice recognition module 303. It is assumed that the speech recognition process prohibition token has been notified to the audio WM detection module 302 in advance from the server device 10A via communication or the like.
- step S301 the audio WM detection module 302 detects an audio watermark inserted in the audio stream according to the audio collected by the audio microphone 311 (audio output from the client device 20A).
- step S302 the voice recognition module 303 performs voice recognition processing on an audio stream according to the voice collected by the audio microphone 311 (voice output from the client device 20A).
- step S303 based on the detection result obtained in the process of step S301, the audio WM detection module 302 uses the voice recognition process prohibition token notified from the server device 10A as the audio watermark inserted in the audio stream. Determine if it has been inserted.
- step S303 If it is determined in step S303 that no voice recognition process prohibition token is inserted as an audio watermark, the process proceeds to step S304.
- step S304 the speech recognition module 303 passes the speech recognition result obtained in the process of step S302 to the subsequent process in accordance with the determination result of the process of step S303.
- step S303 when it is determined in step S303 that a speech recognition process prohibition token is inserted as an audio watermark, the process of step S304 is skipped. That is, in this case, the speech recognition result of the audio stream is regarded as invalid, and the speech recognition result is not passed on to the subsequent processing (discarding the speech recognition result).
- the voice processing apparatus 30A when the voice recognition process prohibition token is inserted in the audio stream, the voice recognition result of the audio stream is invalidated.
- the voice processing device 30A When the viewer 2 speaks (S11), the following processing is performed in the voice processing device 30A. That is, the voice of the viewer 2's speech is collected by the audio microphone 311 of the voice processing device 30A.
- an audio stream corresponding to the voice collected by the audio microphone 311 (a voice of the speech of the viewer 2) is supplied to the audio WM detection module 302 and the voice recognition module 303. It is assumed that the speech recognition process prohibition token has been notified to the audio WM detection module 302 in advance from the server device 10A.
- step S306 the audio WM detection module 302 detects an audio watermark on the audio stream corresponding to the sound collected by the audio microphone 311.
- the audio WM detection module 302 can not detect the speech recognition process prohibition token.
- step S307 the speech recognition module 303 performs speech recognition processing on an audio stream according to the speech collected by the audio microphone 311.
- step S308 the speech recognition module 303 considers that the speech recognition result of the audio stream is valid and passes it to the subsequent process, since no speech recognition process prohibition token is always inserted in the audio stream. .
- the speech recognition process prohibition token is not detected, so the speech recognition result by the speech recognition module 303 becomes valid, and the subsequent process is always performed. It will be.
- the flow of the audio AI processing according to the viewer's utterance has been described above.
- the transmitting side performs the watermark insertion described above
- the token not only the processing process of the speech recognition result is forcibly invalidated but, for example, the intention of the viewer 2 is once It can also be a token that can be heard. That is, in this case, two types of tokens are prepared, one is forcibly a token that invalidates the process of processing speech recognition results, and the other is invalidating the process of processing speech recognition results. It is a token which makes the audience 2 talk about whether it is possible to go through the processing process just before doing.
- the latter token is detected in the audio WM detection module 302 of the audio AI assistance service, for example, may it be permitted to use the audio AI assistance service by audio of this CM?
- the intention of the viewer 2 is confirmed by outputting such a confirmation message by voice from the voice processing device 30A.
- the watermark insertion process is performed by the server apparatus 10 on the transmission side (broadcasting station side), but may be performed by the client apparatus 20 on the reception side (for example, a television receiver).
- the process of inserting a watermark is performed by the client device 20 on the receiving side, it can be realized, for example, by executing an application such as a broadcast application accompanying a broadcast.
- the server apparatus 10 on the transmitting side performs the process of inserting a watermark
- the same audio for example, CM or audio of a program
- Control can not be performed in accordance with the intentions of the above, but, for example, the following can be achieved by adopting a configuration in which the client device 20 on the receiving side executes the application and inserts the watermark. Is made feasible.
- the viewer's intention can be reflected in the continuity of the processing process of the speech recognition result of the speech AI assistance service, and personalization can be performed.
- the intention of the viewer can be confirmed, for example, by displaying a confirmation message as shown in FIG.
- FIG. 8 is a block diagram showing a second example of the configuration of the content-voice AI cooperation system 1 according to the first embodiment.
- the content / voice AI cooperation system 1 of FIG. 8 includes a server device 10B, a client device 20B, and a voice processing device 30B.
- the server device 10 ⁇ / b> B includes a CM / program bank 101, a token generator 103, and an application generator 111.
- an application generator 111 is newly provided instead of the audio decoder 102, the audio WM insertion module 104, and the audio encoder 105, as compared with the server device 10A of FIG.
- the application generator 111 generates an application based on the application generation data. In addition, when generating the application, the application generator 111 embeds the speech recognition process prohibition token generated by the token generator 103 in hard code.
- the server device 10B sends the application generated by the application generator 111 to the broadcast system 11 or the net distribution system 12 according to the distribution method of the application.
- the broadcast system 11 sends out, as a broadcast wave, data of at least one of a CM / program stream sent from the server device 10B and an application. Also, the net distribution system 12 distributes, via the Internet 50, data of at least one of the CM / program stream and the application sent from the server device 10B.
- the client device 20B receives the CM / program stream and the application distributed by the broadcast system 11 or the net distribution system 12.
- the client device 20B includes an audio decoder 201, an audio speaker 202, an application execution environment 211, and an audio WM insertion module 212.
- the client device 20B of FIG. 8 is newly provided with an application execution environment 211 and an audio WM insertion module 212.
- the application execution environment 211 executes an application received from the broadcast system 11 or the net distribution system 12.
- the application execution environment 211 acquires the speech recognition process prohibition token and supplies the token to the audio WM insertion module 212.
- the audio WM insertion module 212 inserts (encodes), as an audio watermark, a speech recognition process prohibition token supplied from the application execution environment 211 to a baseband CM / program audio stream supplied from the audio decoder 201.
- An audio speaker 202 is supplied.
- the audio speaker 202 outputs audio corresponding to a baseband CM / program audio stream (a stream in which a speech recognition process prohibition token is inserted as an audio watermark on the receiving side) supplied from the audio WM insertion module 212.
- the speech processing device 30B of FIG. 8 has the same configuration as the speech processing device 30A of FIG. 3, the description thereof will be omitted here.
- the voice processing device 30B on the local side may cooperate with the server device 40 on the cloud side so that part of the processing of the voice AI assistance service is performed by the server device 40.
- FIG. 9 is a flowchart showing a flow of processing on the transmission side executed by the server device 10B and the broadcast system 11 or the net distribution system 12.
- FIG. 10 is a flowchart showing the flow of processing on the receiving side executed by the client device 20B and the voice processing device 30B.
- step S 111 the CM / program bank 101 sends the CM / program stream stored therein to the broadcast system 11 or the net distribution system 12.
- the voice corresponding to the CM / program audio stream includes voice for which the voice recognition process should be prohibited.
- step S112 the token generator 103 generates a speech recognition process prohibition token based on the token generation data.
- step S113 the application generator 111 generates an application based on the application generation data.
- the speech recognition process prohibition token obtained in the process of step S112 can be embedded in hard code.
- the speech recognition process prohibition token is embedded in the hard code
- transmission is performed via the Internet 50.
- the speech recognition process prohibition token may be acquired from (the token generator 103 of) the server device 10B on the side.
- step S114 the application generator 111 sends the application obtained in the process of step S113 to the broadcast system 11 or the net distribution system 12.
- the CM / program stream obtained by the server device 10B and the application are sent out to the broadcast system 11 or the net delivery system 12 according to the delivery method of content.
- the broadcast system 11 processes the CM / program stream sent from the server device 10B and the application, and the data obtained as a result of the processing is broadcasted. Send as.
- the internet distribution system 12 processes the CM / program stream sent from the server device 10B and the application, and the data obtained as a result of the processing is the Internet Deliver through 50.
- the CM / program stream and the application may be multiplexed in the same broadcast stream, the CM / program stream may be distributed via broadcasting and the application may be distributed via communication.
- the client device 20B on the receiving side accesses the Internet distribution system 12 via the Internet 50 immediately before or simultaneously with the start of the CM or the program to acquire the application.
- the CM / program stream and application distributed by the broadcast system 11 or the net distribution system 12 are received by the client device 20B.
- the client device 20 B the CM / program stream is processed, and the CM / program audio stream is input to the audio decoder 201.
- an application is input to the application execution environment 211.
- step S211 the audio decoder 201 decodes the CM / program audio stream. As a result of this decoding, a baseband CM / program audio stream is obtained.
- step S213 the application execution environment 211 executes the application.
- the speech recognition process prohibition token is embedded in hard code in the application, the application execution environment 211 can acquire the speech recognition process prohibition token.
- the application displays the confirmation message 251 shown in FIG. 7 described above, the application does not insert the watermark by hand, but once the intention of the viewer 2 is confirmed, the audio is displayed. Watermark insertion processing can be performed.
- the “NG button” is operated by the viewer 2 and the application execution environment 211 instructs to insert a watermark. Accept (S214). In this case, a process of inserting an audio watermark will be performed.
- the intention confirmation is performed in advance by an initial setting menu or the like, and the application executed by the application execution environment 211 can be referred to the viewer intention information. It may be stored in the initial setting database.
- a menu such as “audio AI assistance service self-use restriction” is added, and a dialog as shown in FIG. 7 is displayed. It may be made to confirm self-use of voice AI assistance service by.
- the application may perform watermark insertion control based on the viewer intention information instead of displaying the confirmation message 251 shown in FIG. 7 each time by referring to the initial setting database. it can.
- Audio watermarks may be inserted in sections of all commercials and programs.
- step S212 the audio WM insertion module 212 inserts, as an audio watermark, a speech recognition process prohibition token obtained in the process of step S213 into the baseband CM / program audio stream obtained in the process of step S211. ).
- the audio output volume of the audio speaker 202 is adjusted so that the volume output from the audio speaker 202 is sufficient (S215).
- the audio speaker 202 is controlled such that the audio microphone 311 built in the audio processing device 30B can pick up the sound.
- step S216 the audio speaker 202 outputs audio corresponding to the baseband CM / program audio stream (stream in which the speech recognition process prohibition token is inserted as an audio watermark on the receiving side) obtained in the process of step S212. .
- CM / program audio stream is described here to simplify the description, but in the client device 20B, the CM / program video stream is also decoded by the video decoder and the baseband CM / program audio is An image of a CM or program corresponding to the stream is displayed on the display.
- the audio output from the audio speaker 202 of the client device 20B is collected by the audio microphone 311 of the audio processing device 30B.
- the speech recognition result is passed to the subsequent process (S314).
- the speech recognition result is not passed to the subsequent process.
- a voice recognition process prohibition token is inserted as an audio watermark in the server apparatus 10 on the transmitting side or the client apparatus 20 on the receiving side, and the voice processing apparatus 30 on the local side or the cloud side
- the speech AI assistance service can be used after confirming the legitimacy of the speech recognition target data. As a result, more practical voice AI assistance service can be provided.
- the wording of speech recognition processing prohibition is expanded as a blacklist in a database, and the cost of checking whether the wording can be checked in real time in the speech AI assistance service It can be avoided.
- the blacklist is updated frequently and the amount of data is large, this cost may put pressure on the operation cost, which in turn leads to the degradation of the performance of the voice AI assistance service. It is because the possibility is high.
- the client device 20 such as a television receiver or mobile receiver. It can prevent the use of meaningless (convenient) services for the viewer.
- a plurality of types of tokens may be prepared, or a configuration may be implemented such that an application executed by the client device 20 on the receiving side (for example, the television receiver side) performs a process of inserting a watermark.
- the transmitting side for example, a broadcasting station or an audio AI assistance service entity
- the voice of all CM and program sections except for the CM and the section of the program (the section in which the processing process of the speech recognition result is always effective)
- the audio stream of the above may be decoded into a baseband audio stream, and the speech recognition process prohibition token generated by the token generator 103 may be inserted as an audio watermark.
- the audio stream of the speech is decoded into a baseband audio stream, and the speech recognition processing permission token May be inserted as an audio watermark. That is, in contrast to the speech recognition process prohibition token described above, this speech recognition process permission token continues the subsequent processing based on the speech recognition result of the audio stream when it is included in the collected speech. It can be said that it is a token of
- a television broadcast such as a CM or a program presents how the viewer should utter to the audio AI assistance service.
- Alexa registered trademark
- the character string obtained by combining the launch phrase (Launch phrase) the skill name, and the Utterance becomes very long, for example, “ It is assumed that there are occasions such as ask, Drama Facts, for any private information on the casts of XXX DRAMA by XXX CHANNEL ".
- the concatenated string such as the activation phrase becomes very long, for example, it is assumed that there is a case where an utterance such as “ask, shoppingApp, my personal account number is 1234567890” is prompted. However, in the example of this utterance, all or part of the utterance (for example, the part of "1234567890") is generated by an application executed by the client device 20 (for example, a television receiver or the like) of the viewer's house. Assume a case.
- the token itself may be viewed or falsified before it reaches the sound collection module of the voice AI assistance service or on the way to the subsequent processing of the voice AI assistance service. I have to make sure there is no such thing.
- the message itself may need to be concealed on the route from the token generator to the subsequent processing of the voice AI assistance service.
- the present technology proposes, as a second embodiment, that a parameter delivered to an audio AI assistance service linked to content is inserted into an audio stream of content as an audio watermark.
- the voice processing device 30 provided locally as the sound collection device for the audio AI assistance service or the server device 40 provided on the cloud side that analyzes the collected audio stream, Implement watermark detection function.
- any method for the audio watermark may be used as long as a necessary and sufficient token can be superimposed on the target audio stream.
- FIG. 11 is a diagram showing an example of a service delivery parameter embedded as an audio watermark in a baseband audio stream.
- a string indicating how the viewer should speak to the audio AI assistance service eg, “ask, Drama Facts, for any private information on the It is assumed that an instruction to utter "casts of XXX DRAMA by XXX CHANNEL" is given.
- the server apparatus 10 on the transmitting side decodes the audio stream in a certain time interval between the target CM and the program before transmitting the CM and the stream of the program, and then performs baseband audio Make it a stream. Then, the server device 10 causes the audio WM insertion module to insert the token (service delivery parameter) generated by the token generator into the audio stream of the baseband as an audio watermark.
- a service delivery parameter of “ask, DramaFacts, for any private information on the casts of XXXDRAMA by XXXCHANNEL” is generated, and is inserted as an audio watermark into the baseband audio stream. It should be noted that the embedding of the service delivery parameter is inserted into the baseband audio stream repeatedly several times.
- the content (message) of the token is encrypted or tampered with.
- the signature for this can be generated and then inserted as an audio watermark.
- the content (message) of the token “ask, DramaFacts, for any private information on the casts of XXXDRAMA by XXXCHANNEL” is stored in the Message element. Then, by applying, for example, an XML encryption or an XML signature to the message stored in the Message element, the contents of the token can be concealed or tampering can be prevented.
- FIG. 13 shows an example in which an XML signature is applied to the message stored in the above-mentioned Message element.
- the XML signature is a type of electronic signature attached to electronic data such as an XML (Extensible Markup Language) document.
- URI "" which is an attribute value of the ds: Reference element, indicates that the entire Message element is to be signed.
- the service delivery parameter inserted as an audio watermark can be inserted not only by the server apparatus 10 on the transmission side but also by the client apparatus 20 on the reception side, the insertion of the audio watermark will hereinafter be described.
- the configuration performed on the transmission side and the configuration performed on the reception side will be described respectively.
- FIG. 14 is a block diagram showing a first example of the configuration of the content-voice AI cooperation system 1 according to the second embodiment.
- the content / voice AI cooperation system 1 of FIG. 14 includes a server device 10C, a client device 20C, and a voice processing device 30C.
- the server device 10C includes a CM / program bank 101, an audio decoder 102, a token generator 103, an audio WM insertion module 104, and an audio encoder 105.
- the token generator 103 generates a service delivery parameter based on the token generation data, and supplies it to the audio WM insertion module 104.
- the token generation data is, for example, data for generating a token (service delivery parameter) such as “ask, DramaFacts, for any private information on the casts of XXXDRAMA by XXXCHANNEL”. It is considered that the station, the voice AI assistance service entity, or other operators are determined.
- the audio WM insertion module 104 inserts (encodes), as an audio watermark, the service handover parameter supplied from the token generator 103 to the baseband CM / program audio stream supplied from the audio decoder 102. Supply to
- the audio encoder 105 encodes a baseband CM / program audio stream supplied from the audio WM insertion module 104 (a stream in which a service delivery parameter is inserted as an audio watermark on the transmitting side).
- the client device 20 ⁇ / b> C is configured to include an audio decoder 201 and an audio speaker 202 as in the client device 20 ⁇ / b> A shown in FIG. 3.
- the voice processing device 30C is configured to include a sound collection module 301, an audio WM detection module 302, and a voice recognition module 303 as a function of the voice AI assistance service.
- the sound collection module 301 includes an audio microphone 311.
- the audio microphone 311 picks up the wake word uttered by the viewer 2 or the sound output from the audio speaker 202 of the client device 20.
- the sound collection module 301 starts an audio AI assistance service when the wake word utterance by the viewer 2 is recognized based on the sound collected by the audio microphone 311, and the service delivery by the audio WM detection module 302 is performed. Enable parameter detection.
- the audio WM detection module 302 detects an audio watermark inserted in the audio stream from the audio microphone 311, and determines whether a service delivery parameter is inserted as an audio watermark.
- the audio WM detection module 302 supplies the service delivery parameter as a speech recognition result to the subsequent processing unit that performs the subsequent processing.
- the subsequent processing unit performs the subsequent processing on the voice AI assistance service based on the voice recognition result supplied from the audio WM detection module 302.
- the audio WM detection module 302 prevents the voice recognition result from being delivered to the subsequent processing unit.
- the speech recognition module 303 performs speech recognition processing on the audio stream supplied from the audio microphone 311. In the configuration shown in FIG. 14, the speech recognition module 303 is not necessarily required.
- the wake word is uttered by the viewer 2, for example, the voice instruction message 261 as shown in FIG. It is possible to prompt a wakeword to activate the assistance service.
- the voice processing device 30C on the local side performs all processing of the voice AI assistance service
- a part of the processing of the voice AI assistance service is performed by the server on the cloud side. It may be performed by the device 40.
- the voice AI assistance service is realized by cooperation of the voice processing device 30C and the server device 40.
- token generator 103 is described as being included in the server device 10C in FIG. 14, the token generator 103 may be included in another device other than the server device 10C.
- FIG. 16 is a flowchart showing the flow of processing on the transmission side executed by the server device 10C and the broadcast system 11 or the net distribution system 12.
- FIG. 17 is a flowchart showing a flow of processing on the receiving side executed by the client device 20C and the voice processing device 30C.
- step S121 the CM / program bank 101 sends out a CM / program stream.
- the CM / program audio stream is sent to the audio decoder 102.
- step S122 the token generator 103 generates a service delivery parameter as a token based on the token generation data.
- the service delivery parameter for example, how should the viewer 2 speak to the audio AI assistance service, which is “ask, Drama Facts, for any private information on the casts of XXX DRAMA by XXX CHANNEL”
- a string (message) indicating is generated.
- an XML signature or the like is applied to this message, and the contents of the token can be concealed or tampering can be prevented.
- step S123 the audio decoder 102 decodes the CM / program audio stream transmitted in the process of step S121 to obtain a baseband CM / program audio stream.
- step S125 the audio WM insertion module 104 inserts (encodes), as an audio watermark, the service delivery parameter obtained in the process of step S122 into the baseband CM / program audio stream obtained in the process of step S123. .
- step S124 the audio encoder 105 encodes the baseband CM / program audio stream into which the audio watermark has been inserted, obtained in the process of step S125.
- CM / program audio stream is described to simplify the explanation, but in the server device 10C, processing is performed by multiplexing with other streams such as the CM / program video stream as necessary. Be done.
- the CM / program stream (a stream in which the service delivery parameter is inserted as an audio watermark on the transmitting side) obtained by the server device 10C is transmitted to the broadcast system 11 or the net distribution system 12 according to the content distribution method.
- the CM / program stream distributed by the broadcast system 11 or the net distribution system 12 is received by the client device 20C in FIG.
- the client device 20C the CM / program stream is processed, and the CM / program audio stream is input to the audio decoder 201.
- the client device 20C adjusts the audio output volume of the audio speaker 202 so that the volume output from the audio speaker 202 is sufficient (S221).
- the client device 20C instructs the viewer 2 to speak a wake word (for example, "Service A") for activating the audio AI assistance service (S222).
- a wake word for example, "Service A”
- the speech instruction message 261 (FIG. 15) which is "Please say” Service A "if you want to know the private information of the cast of this program” is CM or
- the audio stream of the program is displayed in the section where the audio watermark is inserted. And the viewer 2 who confirmed this display will utter a wake word (S21).
- step S223 the audio decoder 201 decodes the CM / program audio stream to obtain a baseband CM / program audio stream.
- step S224 the audio speaker 202 outputs audio corresponding to the baseband CM / program audio stream obtained in the process of step S223.
- the CM / program audio stream is described here to simplify the description, but in the client device 20C, the CM / program video stream is also decoded by the video decoder and the baseband CM / program video is An image of a CM or program corresponding to the stream is displayed on the display.
- the wake word uttered by the viewer 2 and the sound output from the audio speaker 202 of the client device 20C are collected by the audio microphone 311 of the audio processing device 30.
- step S 322 the sound collection module 301 recognizes the wakeword uttered by the viewer 2 from the audio stream corresponding to the sound collected by the audio microphone 311.
- the sound collection module 301 activates the voice AI assistance service to validate the detection of the service delivery parameter (S323).
- the processing of step S 321 by the audio WM detection module 302 is started with the activation of the detection of the service delivery parameter.
- step S321 the audio WM detection module 302 detects an audio watermark inserted in the audio stream from the audio microphone 311 in step S301.
- step S324 the audio WM detection module 302 determines, based on the detection result obtained in the process of step S321, whether a service handover parameter is inserted as an audio watermark inserted in the audio stream.
- step S324 If it is determined in step S324 that the service handover parameter is inserted as an audio watermark, the process proceeds to step S325.
- step S325 the audio WM detection module 302 passes the service delivery parameter obtained in the process of step S321 to the subsequent process as a speech recognition result.
- step S324 when it is determined in step S324 that no service delivery parameter is inserted as an audio watermark, the process of step S325 is skipped. That is, in this case, the speech recognition result of the audio stream is regarded as invalid, and the speech recognition result is not passed to the subsequent processing (do nothing).
- a string for example, "ask, Drama Facts, for any private information on the casts of XXX DRAMA by XXX CHANNEL"
- this message is passed to subsequent processing as a speech recognition result. Therefore, for example, it is possible to avoid a situation where the viewer 2 who uses the audio AI assistance service can not remember the phrase because the phrase is too long.
- the watermark insertion process is performed by the server apparatus 10 on the transmission side (broadcasting station side), but may be performed by the client apparatus 20 on the reception side (for example, a television receiver).
- the process of inserting a watermark is performed by the client device 20 on the receiving side, it can be realized, for example, by executing an application such as a broadcast application accompanying a broadcast.
- the server apparatus 10 on the transmitting side performs the process of inserting a watermark
- the same audio for example, CM or audio of a program
- Control can not be performed in accordance with the intentions of the above, but, for example, the following can be achieved by adopting a configuration in which the client device 20 on the receiving side executes the application and inserts the watermark. Is made feasible.
- the viewer-specific attribute information for example, the viewer's account information etc. necessary for the purchase of a product etc.
- FIG. 18 is a block diagram showing a second example of the configuration of the content-voice AI cooperation system 1 according to the second embodiment.
- the content / voice AI cooperation system 1 of FIG. 18 includes a server device 10D, a client device 20D, and a voice processing device 30D.
- the server device 10D is configured to include a CM / program bank 101 and an application generator 111.
- the application generator 111 generates an application based on the application generation data.
- the application generated here has a token generator function (function equivalent to the above-described token generator 103).
- the server device 10D sends the application generated by the application generator 111 to the broadcast system 11 or the net distribution system 12 according to the distribution method of the application.
- the client device 20D includes an audio decoder 201, an audio speaker 202, an application execution environment 211, and an audio WM insertion module 212.
- the application execution environment 211 executes an application received from the broadcast system 11 or the net distribution system 12.
- the application since the application has a token generator function, the token (service passing parameter) generated by the application is supplied to the audio WM insertion module 212.
- the audio WM insertion module 212 inserts (encodes), as an audio watermark, a service transfer parameter generated by an application of the application execution environment 211, to a baseband CM / program audio stream supplied from the audio decoder 201.
- An audio speaker 202 is supplied.
- the audio speaker 202 outputs audio corresponding to a baseband CM / program audio stream (a stream in which a service delivery parameter is inserted as an audio watermark on the receiving side) supplied from the audio WM insertion module 212.
- the audio processing device 30D of FIG. 18 has the same configuration as the audio processing device 30C of FIG. 14, so the description thereof is omitted here.
- the voice processing device 30D on the local side may cooperate with the server device 40 on the cloud side so that part of the processing of the voice AI assistance service is performed by the server device 40.
- the wake word is uttered by the viewer 2 with respect to the voice processing device 30D, for example, by displaying the speech instruction message 271 as shown in FIG. 19 in the client device 20D, the viewer 2 On the other hand, it is possible to prompt the utterance of a wake word for activating the voice AI assistance service.
- FIG. 20 is a flowchart showing a flow of processing on the transmission side executed by the server device 10D and the broadcast system 11 or the net distribution system 12.
- FIG. 21 is a flowchart showing a flow of processing on the receiving side executed by the client device 20D and the voice processing device 30D.
- step S 131 the CM / program bank 101 sends the CM / program stream to the broadcast system 11 or the net distribution system 12.
- step S133 the application generator 111 generates an application based on the application generation data.
- the application has a token generator function (function equivalent to the above-described token generator 103).
- a part of the service delivery parameter for example, common information other than the attribute information unique to the viewer may be embedded in the hard code.
- step S134 the application generator 111 sends the application obtained in the process of step S133 to the broadcast system 11 or the net distribution system 12.
- the CM / program stream and the application obtained by the server device 10D are sent out by the broadcast system 11 or the net delivery system 12 according to the delivery method of the content.
- the CM / program stream and the application distributed by the broadcast system 11 or the net distribution system 12 are received by the client device 20D in FIG.
- the client device 20D the CM / program audio stream is input to the audio decoder 201, and the application is input to the application execution environment 211.
- step S231 the audio decoder 201 decodes the CM / program audio stream to obtain a baseband CM / program audio stream.
- step S233 the application execution environment 211 executes the application.
- the application since the application has a token generator function, it can generate and obtain a service delivery parameter as a token.
- a string (message indicating how the viewer 2 should speak to the voice AI assistance service, which is “ask, shoppingApp, my personal account number is 1234567890”. ) Is generated.
- the application itself executed in the application execution environment 211 has viewer-specific attribute information (for example, an account number “1234567890”) that is related to the privacy of the viewer 2 Is acquired from a database (for example, a database in which information specific to the viewer is set by the initial setting menu of the client device 20), and the service delivery parameter is generated based on the information.
- viewer-specific attribute information for example, an account number “1234567890”
- a database for example, a database in which information specific to the viewer is set by the initial setting menu of the client device 20
- the application displays the confirmation message 251 shown in FIG. 7 described above, the application does not insert the watermark by hand, but once the intention of the viewer 2 is confirmed, the audio is displayed. Watermark insertion processing can be performed.
- the intention confirmation may be performed in advance, and the viewer's intention information may be stored in the initial setting database and used. Also, the audio watermark insertion process may be forcibly executed without performing the process of step S234.
- step S232 the audio WM insertion module 212 inserts (encodes), as an audio watermark, the service handover parameter obtained in the process of step S233 into the baseband CM / program audio stream obtained in the process of step S231. .
- the audio output volume of the audio speaker 202 is adjusted such that the volume output from the audio speaker 202 is sufficient (S235).
- the client device 20D instructs the viewer 2 to speak a wake word (for example, "Service A") for activating the voice AI assistance service (S236).
- a wake word for example, "Service A”
- the speech instruction message 271 (FIG. 19), which is "If you want to purchase the item introduced in this program, just say” Service A ""
- the audio stream of the program is displayed in the section in which the audio watermark is inserted. And the viewer 2 who confirmed this display will utter a wake word (S31).
- step S 237 the audio speaker 202 outputs audio corresponding to the baseband CM / program audio stream (a stream obtained by inserting the service delivery parameter as an audio watermark on the receiving side) obtained in the process of step S 232.
- CM / program audio stream is described to simplify the description, but in the client device 20D, the CM / program video stream is also decoded by the video decoder and the baseband CM / program video is An image of a CM or program corresponding to the stream is displayed on the display.
- the wake word uttered by the viewer 2 and the sound output from the audio speaker 202 of the client device 20D are collected by the audio microphone 311 of the audio processing device 30D.
- steps S331 to S335 as in steps S321 to S325 in FIG. 17, when the wake word uttered by the viewer 2 is recognized, the voice AI assistance service is activated, and the detection of the service delivery parameter is effective. It is determined whether or not the service delivery parameter is inserted as an audio watermark that has been converted and inserted into the audio stream from the audio microphone 311.
- the service delivery parameter is passed to the subsequent processing as the speech recognition result (S335).
- the speech recognition result is not passed to the subsequent processing.
- the audio processing device 30D when a character string (message) “for example,“ ask, shoppingApp, my personal account number is 1234567890 ”is inserted in the audio stream as the service delivery parameter, this is the case. Messages are passed on to subsequent processing as speech recognition results. Therefore, for example, it is possible to avoid the situation that the viewer 2 using the audio AI assistance service can not remember the phrase because it is too long, or that the user is required to utter the content related to privacy and security. it can.
- a character string “for example,“ ask, shoppingApp, my personal account number is 1234567890 ”
- the server apparatus 10 on the transmission side or the client apparatus 20 on the reception side inserts a service delivery parameter as an audio watermark, and the voice processing apparatus 30 on the local side or the server apparatus on the cloud side
- a service delivery parameter as an audio watermark
- the voice processing apparatus 30 on the local side or the server apparatus on the cloud side By detecting this service delivery parameter at 40, even if the viewer can not make the utterance as instructed correctly or if the content includes such as uttering the utterance, accuracy and security can be obtained.
- voice AI assistance service can be used. As a result, more practical voice AI assistance service can be provided.
- the user may confirm the utterance of the wake word as the intention to use the audio AI assistance service to the viewer, and start obtaining the voice AI assistance service after obtaining the consent of the viewer. it can.
- a service delivery parameter is not inserted as an audio watermark, for example, it is necessary to utter content that is too long for the viewer to speak correctly, as shown below.
- the service delivery parameter is inserted as the audio watermark, for example, the viewer who has confirmed the speech instruction message 261 of FIG. Since it is sufficient to utter only the wake word, it is possible to utter accurately.
- a service delivery parameter is not inserted as an audio watermark, for example, if the content of the utterance includes private information of the viewer as shown below, the viewer will hear the utterance It is also assumed.
- the service delivery parameter is inserted as the audio watermark, for example, the viewer who confirms the speech instruction message 271 in FIG. Since it is only necessary to utter the wake word, it is not necessary to utter the viewer-specific attribute information.
- the token is inserted into the audio stream as an audio watermark, but the audio watermark is an example, and another method may be used to embed the token.
- the token may be embedded using fingerprint information which is a feature amount extracted from an audio stream of content such as a CM or a program.
- Non-Patent Documents 1 and 2 The details of the audio watermark are shown in, for example, Non-Patent Documents 1 and 2 below.
- Non-Patent Documents 1 and 2 audio watermarks in Advanced Television Systems Committee (ATSC) 3.0, which is one of the next-generation terrestrial broadcast standards, are defined.
- ATSC Advanced Television Systems Committee
- Non Patent Literature 1 ATSC Standard: Audio Watermark Emission (A / 334)
- Non Patent Literature 2 ATSC Standard: Content Recovery in Redistribution Scenarios (A / 336)
- the application is not limited to an application (an application executed on a browser) developed with a markup language such as HTML5 (HyperText Markup Language 5) or a script language such as JavaScript (registered trademark), for example, Java (registration It may be an application developed in a programming language such as a trademark).
- a markup language such as HTML5 (HyperText Markup Language 5)
- a script language such as JavaScript (registered trademark)
- Java registered trademark
- Java registered trademark
- the application executed by the client device 20 is not limited to one obtained via broadcast, and may be obtained from a server on the Internet 50 via communication. Further, the content described above is not limited to a CM or a program, and can include any content such as music, video, electronic book, game, advertisement, and the like. Furthermore, CMs and programs may be assumed to be all or part of a service or a channel.
- the hardware configuration of the client device 20 is not particularly described in the above description, for example, the following configuration can be employed. That is, since the client device 20 is configured as, for example, a television receiver, in addition to the audio decoder 201 and the audio speaker 202, for example, a CPU (Central Processing Unit), a memory, a tuner, a demultiplexer, a video decoder, a display, communication It can be configured to include an I / F and the like.
- a CPU Central Processing Unit
- the hardware configuration of the audio processing device 30 is not particularly described, for example, the following configuration can be employed. That is, since the voice processing device 30 is configured as, for example, a smart speaker, it can be configured to include, for example, a CPU, a memory, a speaker, a communication I / F, etc. in addition to the audio microphone 311.
- the client device 20 and the voice processing device 30 are described as being configured as separate devices, a device in which the client device 20 and the voice processing device 30 are integrated (packaged device ) May be configured.
- a device in which the client device 20 and the voice processing device 30 are integrated packetaged device
- the function of the voice processing device 30 as a voice processing module and including it in the function of the client device 20, it can be configured as a bundled device.
- the server device 10 the client device 20, the voice processing device 30, and the server device 40 are information processing devices.
- the client device 20 is described as a fixed receiver such as a television receiver or a mobile receiver such as a smartphone.
- the client device 20 may be a head mounted display (HMD). Or the like.
- the client device 20 may be, for example, a device mounted on a car such as a car-mounted television, a set top box (STB: Set Top Box), a game machine, or the like. That is, the client device 20 may be any device as long as it can play back and record content.
- HMD head mounted display
- STB Set Top Box
- game machine or the like. That is, the client device 20 may be any device as long as it can play back and record content.
- the broadcast system of the broadcast system 11 is not particularly referred to, but as the broadcast system, for example, ATSC (especially, ATSC 3.0) which is a system adopted in the United States etc., Japan, etc. are adopted. It is possible to adopt ISDB (Integrated Services Digital Broadcasting), which is a system to be used, DVB (Digital Video Broadcasting), which is a system adopted by European countries and the like.
- ATSC ATSC 3.0
- DVB Digital Video Broadcasting
- satellite broadcast using cable broadcasting satellite (BS: Broadcasting Satellite), communication satellite (CS: Communications Satellite), etc. besides cable broadcasting (CATV) Or the like may be wired broadcasting.
- the names used herein are by way of example and in practice other names may be used. However, the difference between these names is a formal difference, not the substantial content of the subject.
- the wake word described above may be referred to as an activation keyword, a command word, or the like.
- FIG. 24 is a diagram showing an example of a hardware configuration of a computer that executes the series of processes described above according to a program.
- a central processing unit (CPU) 1001, a read only memory (ROM) 1002, and a random access memory (RAM) 1003 are mutually connected by a bus 1004.
- An input / output interface 1005 is further connected to the bus 1004.
- An input unit 1006, an output unit 1007, a recording unit 1008, a communication unit 1009, and a drive 1010 are connected to the input / output interface 1005.
- the input unit 1006 includes a keyboard, a mouse, a microphone and the like.
- the output unit 1007 includes a display, a speaker, and the like.
- the recording unit 1008 includes a hard disk, a non-volatile memory, and the like.
- the communication unit 1009 includes a network interface or the like.
- the drive 1010 drives a removable recording medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
- the CPU 1001 loads the program stored in the ROM 1002 or the recording unit 1008 into the RAM 1003 via the input / output interface 1005 and the bus 1004, and executes the program. A series of processing is performed.
- the program executed by the computer 1000 can be provided by being recorded on, for example, a removable recording medium 1011 as a package medium or the like. Also, the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
- the program can be installed in the recording unit 1008 via the input / output interface 1005 by attaching the removable recording medium 1011 to the drive 1010. Also, the program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the recording unit 1008. In addition, the program can be installed in advance in the ROM 1002 or the recording unit 1008.
- the processing performed by the computer according to the program does not necessarily have to be performed chronologically in the order described as the flowchart. That is, the processing performed by the computer according to the program includes processing executed in parallel or separately (for example, parallel processing or processing by an object). Further, the program may be processed by one computer (processor) or may be distributed and processed by a plurality of computers.
- the present technology can have the following configurations.
- An information processing apparatus comprising: an insertion unit that inserts a token related to use of an audio AI assistance service linked to content into an audio stream of the content.
- the token is a token for prohibiting or permitting speech recognition processing by the speech AI assistance service on an audio stream of the content.
- the information processing apparatus according to (1), wherein the token is a parameter delivered to the voice AI assistance service.
- It further comprises a generation unit that generates the token, The information processing apparatus according to any one of (1) to (3), wherein the insertion unit inserts the token generated by the generation unit into an audio stream of content to be distributed.
- the information processing apparatus inserts the token as an audio watermark into an audio stream of the content distributed via broadcast or communication.
- the system further comprises an execution unit that executes an application having a function of generating the token.
- the information processing apparatus according to any one of (1) to (3), wherein the insertion unit inserts the token generated by the application being executed into an audio stream of content to be reproduced.
- the insertion unit inserts, as an audio watermark, the token generated by the application distributed via broadcast or communication into the audio stream of the content distributed via broadcast or communication (6)
- the information processing apparatus according to claim 1.
- the information processing apparatus is notified in advance to a side that detects the token inserted in the audio stream of the content.
- the information processing apparatus (9) The information processing apparatus according to (3), wherein the parameter is encrypted or a signature for tampering detection is added. (10) In an information processing method of an information processing apparatus, The information processing apparatus An information processing method for inserting, into an audio stream of content, a token related to use of an audio AI assistance service linked to content. (11) An information processing apparatus comprising: a detection unit that detects, from an audio stream of content, a token related to use of an audio AI assistance service linked to the content. (12) The information processing apparatus according to (11), wherein the token is a token for inhibiting speech recognition processing by the speech AI assistance service on an audio stream of the content.
- the information processing apparatus (13) It further comprises a voice recognition unit that performs voice recognition processing on the audio stream of the content, The information processing apparatus according to (12), wherein the detection unit invalidates the speech recognition result obtained by the speech recognition process when the token notified in advance is detected from the audio stream of the content.
- the token is a token for permitting speech recognition processing by the speech AI assistance service on the audio stream.
- the information processing apparatus 11), wherein the token is a parameter delivered to the voice AI assistance service.
- the information processing apparatus wherein the detection unit passes the parameter to a subsequent process when the parameter is detected from the audio stream of the content.
- the detection unit detects the token inserted in the audio stream of the content when the wake word of the audio AI assistance service is uttered from the viewer who views the content (16) or (17)
- the information processing apparatus according to the above.
- It further comprises a sound collection unit that collects the sound of the content output from another information processing apparatus that reproduces the content distributed via broadcast or communication.
- the detection unit detects the token inserted as an audio watermark in an audio stream of the sound of the content collected by the sound collection unit according to any one of (11) to (18).
- Information processing device (20)
- the information processing apparatus An information processing method for detecting, from an audio stream of content, a token related to use of a voice AI assistance service linked to the content.
- SYMBOLS 1 Content / voice AI cooperation system 10, 10A, 10B, 10C, 10D server apparatus, 11 broadcast system, 12 net delivery system, 20, 20A, 20B, 20C, 20D Client apparatus, 30, 30A, 30B, 30C, 30D Audio processing unit, 40 server units, 50 Internet, 101 CM / program bank, 102 audio decoder, 103 token generator, 104 audio WM insertion module, 105 audio encoder, 111 application generator, 201 audio decoder, 202 audio speaker, 211 application execution Environment, 212 Audio WM Insertion Module, 301 Sound Collection Module, 302 Audio O WM detection module, 303 speech recognition module, 311 audio microphone, 1000 computer, 1001 CPU
Abstract
Description
2.本技術の実施の形態
(1)第1の実施の形態:WMによる音声AIアシスタンス認識対象選別
(A)ウォータマークの挿入を送信側で行う構成
(B)ウォータマークの挿入を受信側で行う構成
(2)第2の実施の形態:WMによる音声AIアシスタンスへの発話補完
(C)ウォータマークの挿入を送信側で行う構成
(D)ウォータマークの挿入を受信側で行う構成
3.変形例
4.コンピュータの構成
図1は、本技術を適用したコンテンツ・音声AI連携システムの構成例を示すブロック図である。
図2は、ベースバンドのオーディオストリームに、オーディオウォータマークとして埋め込まれる音声認識処理禁止トークンの例を示す図である。
図3は、第1の実施の形態のコンテンツ・音声AI連携システム1の構成の第1の例を示すブロック図である。
次に、図4乃至図5のフローチャートを参照して、ウォータマークの挿入を送信側で行う場合のコンテンツ・音声AI連携処理の流れを説明する。
次に、図6のフローチャートを参照して、視聴者発話に応じた音声AI処理の流れを説明する。
図8は、第1の実施の形態のコンテンツ・音声AI連携システム1の構成の第2の例を示すブロック図である。
次に、図9乃至図10のフローチャートを参照して、ウォータマークの挿入を受信側で行う場合のCM/番組・音声AI連携の流れを説明する。
図11は、ベースバンドのオーディオストリームに、オーディオウォータマークとして埋め込まれるサービス引き渡しパラメタの例を示す図である。
図14は、第2の実施の形態のコンテンツ・音声AI連携システム1の構成の第1の例を示すブロック図である。
次に、図16乃至図17のフローチャートを参照して、ウォータマークの挿入を送信側で行う場合のコンテンツ・音声AI連携処理の流れを説明する。
図18は、第2の実施の形態のコンテンツ・音声AI連携システム1の構成の第2の例を示すブロック図である。
次に、図20乃至図21のフローチャートを参照して、ウォータマークの挿入を受信側で行う場合のコンテンツ・音声AI連携処理の流れを説明する。
上述した説明では、トークンが、オーディオウォータマークとして、オーディオストリームに挿入される場合を示したが、オーディオウォータマークは一例であって、トークンを埋め込む方法としては、他の方法を用いるようにしてもよい。ここでは、例えば、CMや番組等のコンテンツのオーディオストリームから抽出される特徴量であるフィンガプリント(Finger Print)情報を利用して、トークンが埋め込まれるようにしてもよい。
非特許文献2:ATSC Standard:Content Recovery in Redistribution Scenarios (A/336)
上述した説明では、クライアント装置20のアプリケーション実行環境211にて実行されるアプリケーションとして、ブラウザにより実行される放送付随の放送アプリケーションを一例に説明したが、例えば、OS(Operating System)環境(提示制御環境)などで実行されるネイティブアプリケーションなどの他のアプリケーションであってもよい。
上述した説明では、クライアント装置20のハードウェア構成について、特に述べていないが、例えば、次のような構成とすることができる。すなわち、クライアント装置20は、例えば、テレビ受像機として構成されるため、オーディオデコーダ201とオーディオスピーカ202のほか、例えば、CPU(Central Processing Unit)やメモリ、チューナ、デマルチプレクサ、ビデオデコーダ、ディスプレイ、通信I/Fなどを含んで構成することができる。
上述した説明では、放送システム11の放送方式について特に言及していないが、放送方式としては、例えば、米国等で採用されている方式であるATSC(特に、ATSC3.0)や、日本等が採用する方式であるISDB(Integrated Services Digital Broadcasting)、欧州の各国等が採用する方式であるDVB(Digital Video Broadcasting)などを採用することができる。また、放送経由の配信の場合の伝送路としては、地上波放送のほか、放送衛星(BS:Broadcasting Satellite)や通信衛星(CS:Communications Satellite)等を利用した衛星放送や、ケーブルテレビ(CATV)等の有線放送であってもよい。
本明細書で使用している名称は、一例であって、実際には、他の名称が用いられる場合がある。ただし、これらの名称の違いは、形式的な違いであって、対象のものの実質的な内容が異なるものではない。例えば、上述したウェイクワードは、アクティベーションキーワードや、コマンドワードなどと称される場合がある。
コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを、前記コンテンツのオーディオストリームに挿入する挿入部を備える
情報処理装置。
(2)
前記トークンは、前記コンテンツのオーディオストリームに対する前記音声AIアシスタンスサービスによる音声認識処理を禁止又は許可するためのトークンである
前記(1)に記載の情報処理装置。
(3)
前記トークンは、前記音声AIアシスタンスサービスに引き渡されるパラメタである
前記(1)に記載の情報処理装置。
(4)
前記トークンを生成する生成部をさらに備え、
前記挿入部は、前記生成部により生成された前記トークンを、配信対象のコンテンツのオーディオストリームに挿入する
前記(1)乃至(3)のいずれかに記載の情報処理装置。
(5)
前記挿入部は、前記トークンを、オーディオウォータマークとして、放送経由又は通信経由で配信される前記コンテンツのオーディオストリームに挿入する
前記(4)に記載の情報処理装置。
(6)
前記トークンを生成する機能を有するアプリケーションを実行する実行部をさらに備え、
前記挿入部は、実行中の前記アプリケーションにより生成された前記トークンを、再生対象のコンテンツのオーディオストリームに挿入する
前記(1)乃至(3)のいずれかに記載の情報処理装置。
(7)
前記挿入部は、放送経由又は通信経由で配信された前記アプリケーションにより生成された前記トークンを、オーディオウォータマークとして、放送経由又は通信経由で配信された前記コンテンツのオーディオストリームに挿入する
前記(6)に記載の情報処理装置。
(8)
前記トークンは、前記コンテンツのオーディオストリームに挿入された前記トークンの検出を行う側に、あらかじめ通知される
前記(2)に記載の情報処理装置。
(9)
前記パラメタは、暗号化されるか、又は改ざん検出用の署名が付与される
前記(3)に記載の情報処理装置。
(10)
情報処理装置の情報処理方法において、
前記情報処理装置が、
コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを、前記コンテンツのオーディオストリームに挿入する
情報処理方法。
(11)
コンテンツのオーディオストリームから、前記コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを検出する検出部を備える
情報処理装置。
(12)
前記トークンは、前記コンテンツのオーディオストリームに対する前記音声AIアシスタンスサービスによる音声認識処理を禁止するためのトークンである
前記(11)に記載の情報処理装置。
(13)
前記コンテンツのオーディオストリームに対する音声認識処理を行う音声認識部をさらに備え、
前記検出部は、前記コンテンツのオーディオストリームから、あらかじめ通知された前記トークンが検出された場合、前記音声認識処理で得られる音声認識結果を無効にする
前記(12)に記載の情報処理装置。
(14)
前記トークンは、前記オーディオストリームに対する前記音声AIアシスタンスサービスによる音声認識処理を許可するためのトークンである
前記(11)に記載の情報処理装置。
(15)
前記コンテンツのオーディオストリームに対する音声認識処理を行う音声認識部をさらに備え、
前記検出部は、前記コンテンツのオーディオストリームから、あらかじめ通知された前記トークンが検出された場合、前記音声認識処理で得られる音声認識結果を、後続の処理に渡す
前記(14)に記載の情報処理装置。
(16)
前記トークンは、前記音声AIアシスタンスサービスに引き渡されるパラメタである
前記(11)に記載の情報処理装置。
(17)
前記検出部は、前記コンテンツのオーディオストリームから、前記パラメタが検出された場合、当該パラメタを、後続の処理に渡す
前記(16)に記載の情報処理装置。
(18)
前記検出部は、前記コンテンツを視聴する視聴者から、前記音声AIアシスタンスサービスのウェイクワードが発話された場合、前記コンテンツのオーディオストリームに挿入された前記トークンの検出を行う
前記(16)又は(17)に記載の情報処理装置。
(19)
放送経由又は通信経由で配信された前記コンテンツの再生を行う他の情報処理装置から出力される前記コンテンツの音声を収音する収音部をさらに備え、
前記検出部は、前記収音部により収音された前記コンテンツの音声のオーディオストリームに、オーディオウォータマークとして挿入されている前記トークンを検出する
前記(11)乃至(18)のいずれかに記載の情報処理装置。
(20)
情報処理装置の情報処理方法において、
前記情報処理装置が、
コンテンツのオーディオストリームから、前記コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを検出する
情報処理方法。
Claims (20)
- コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを、前記コンテンツのオーディオストリームに挿入する挿入部を備える
情報処理装置。 - 前記トークンは、前記コンテンツのオーディオストリームに対する前記音声AIアシスタンスサービスによる音声認識処理を禁止又は許可するためのトークンである
請求項1に記載の情報処理装置。 - 前記トークンは、前記音声AIアシスタンスサービスに引き渡されるパラメタである
請求項1に記載の情報処理装置。 - 前記トークンを生成する生成部をさらに備え、
前記挿入部は、前記生成部により生成された前記トークンを、配信対象のコンテンツのオーディオストリームに挿入する
請求項1に記載の情報処理装置。 - 前記挿入部は、前記トークンを、オーディオウォータマークとして、放送経由又は通信経由で配信される前記コンテンツのオーディオストリームに挿入する
請求項4に記載の情報処理装置。 - 前記トークンを生成する機能を有するアプリケーションを実行する実行部をさらに備え、
前記挿入部は、実行中の前記アプリケーションにより生成された前記トークンを、再生対象のコンテンツのオーディオストリームに挿入する
請求項1に記載の情報処理装置。 - 前記挿入部は、放送経由又は通信経由で配信された前記アプリケーションにより生成された前記トークンを、オーディオウォータマークとして、放送経由又は通信経由で配信された前記コンテンツのオーディオストリームに挿入する
請求項6に記載の情報処理装置。 - 前記トークンは、前記コンテンツのオーディオストリームに挿入された前記トークンの検出を行う側に、あらかじめ通知される
請求項2に記載の情報処理装置。 - 前記パラメタは、暗号化されるか、又は改ざん検出用の署名が付与される
請求項3に記載の情報処理装置。 - 情報処理装置の情報処理方法において、
前記情報処理装置が、
コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを、前記コンテンツのオーディオストリームに挿入する
情報処理方法。 - コンテンツのオーディオストリームから、前記コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを検出する検出部を備える
情報処理装置。 - 前記トークンは、前記コンテンツのオーディオストリームに対する前記音声AIアシスタンスサービスによる音声認識処理を禁止するためのトークンである
請求項11に記載の情報処理装置。 - 前記コンテンツのオーディオストリームに対する音声認識処理を行う音声認識部をさらに備え、
前記検出部は、前記コンテンツのオーディオストリームから、あらかじめ通知された前記トークンが検出された場合、前記音声認識処理で得られる音声認識結果を無効にする
請求項12に記載の情報処理装置。 - 前記トークンは、前記オーディオストリームに対する前記音声AIアシスタンスサービスによる音声認識処理を許可するためのトークンである
請求項11に記載の情報処理装置。 - 前記コンテンツのオーディオストリームに対する音声認識処理を行う音声認識部をさらに備え、
前記検出部は、前記コンテンツのオーディオストリームから、あらかじめ通知された前記トークンが検出された場合、前記音声認識処理で得られる音声認識結果を、後続の処理に渡す
請求項14に記載の情報処理装置。 - 前記トークンは、前記音声AIアシスタンスサービスに引き渡されるパラメタである
請求項11に記載の情報処理装置。 - 前記検出部は、前記コンテンツのオーディオストリームから、前記パラメタが検出された場合、当該パラメタを、後続の処理に渡す
請求項16に記載の情報処理装置。 - 前記検出部は、前記コンテンツを視聴する視聴者から、前記音声AIアシスタンスサービスのウェイクワードが発話された場合、前記コンテンツのオーディオストリームに挿入された前記トークンの検出を行う
請求項17に記載の情報処理装置。 - 放送経由又は通信経由で配信された前記コンテンツの再生を行う他の情報処理装置から出力される前記コンテンツの音声を収音する収音部をさらに備え、
前記検出部は、前記収音部により収音された前記コンテンツの音声のオーディオストリームに、オーディオウォータマークとして挿入されている前記トークンを検出する
請求項11に記載の情報処理装置。 - 情報処理装置の情報処理方法において、
前記情報処理装置が、
コンテンツのオーディオストリームから、前記コンテンツに連携した音声AIアシスタンスサービスの利用に関するトークンを検出する
情報処理方法。
Priority Applications (9)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201880057831.9A CN111052231B (zh) | 2017-09-15 | 2018-08-31 | 信息处理设备和信息处理方法 |
EP18856338.1A EP3683792A4 (en) | 2017-09-15 | 2018-08-31 | INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING PROCESS |
AU2018333668A AU2018333668B2 (en) | 2017-09-15 | 2018-08-31 | Information processing device and information processing method |
MX2020002591A MX2020002591A (es) | 2017-09-15 | 2018-08-31 | Aparato de procesamiento de la informacion y metodo de procesamiento de informacion. |
US16/645,058 US11600270B2 (en) | 2017-09-15 | 2018-08-31 | Information processing apparatus and information processing method |
JP2019541990A JP7227140B2 (ja) | 2017-09-15 | 2018-08-31 | 情報処理装置、情報処理方法、音声処理装置、及び音声処理方法 |
SG11202001429XA SG11202001429XA (en) | 2017-09-15 | 2018-08-31 | Information processing apparatus and information processing method |
CA3075249A CA3075249A1 (en) | 2017-09-15 | 2018-08-31 | Information processing apparatus and information processing method |
KR1020207006277A KR102607192B1 (ko) | 2017-09-15 | 2018-08-31 | 정보 처리 장치, 및 정보 처리 방법 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017-177754 | 2017-09-15 | ||
JP2017177754 | 2017-09-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019054199A1 true WO2019054199A1 (ja) | 2019-03-21 |
Family
ID=65722792
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2018/032323 WO2019054199A1 (ja) | 2017-09-15 | 2018-08-31 | 情報処理装置、及び情報処理方法 |
Country Status (10)
Country | Link |
---|---|
US (1) | US11600270B2 (ja) |
EP (1) | EP3683792A4 (ja) |
JP (1) | JP7227140B2 (ja) |
KR (1) | KR102607192B1 (ja) |
CN (1) | CN111052231B (ja) |
AU (1) | AU2018333668B2 (ja) |
CA (1) | CA3075249A1 (ja) |
MX (1) | MX2020002591A (ja) |
SG (1) | SG11202001429XA (ja) |
WO (1) | WO2019054199A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020128552A1 (ja) * | 2018-12-18 | 2020-06-25 | 日産自動車株式会社 | 音声認識装置、音声認識装置の制御方法、コンテンツ再生装置、及びコンテンツ送受信システム |
JP2020185618A (ja) * | 2019-05-10 | 2020-11-19 | 株式会社スター精機 | 機械動作方法,機械動作設定方法及び機械動作確認方法 |
WO2021100555A1 (ja) * | 2019-11-21 | 2021-05-27 | ソニーグループ株式会社 | 情報処理システム、情報処理装置、情報処理方法及びプログラム |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20240038249A1 (en) * | 2022-07-27 | 2024-02-01 | Cerence Operating Company | Tamper-robust watermarking of speech signals |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003044069A (ja) * | 2001-07-19 | 2003-02-14 | Samsung Electronics Co Ltd | 音声認識による誤動作の防止及び音声認識率の向上が可能な電子機器及び方法 |
JP2005338454A (ja) * | 2004-05-27 | 2005-12-08 | Toshiba Tec Corp | 音声対話装置 |
JP2008305371A (ja) * | 2007-05-08 | 2008-12-18 | Softbank Bb Corp | 分散処理により膨大なコンテンツの検査を行う装置と方法、およびコンテンツの検査結果にもとづいて利用者間の自律的なコンテンツ流通とコンテンツ利用を制御するコンテンツ配信システム |
JP2013160883A (ja) * | 2012-02-03 | 2013-08-19 | Yamaha Corp | 通信端末、プログラム、コンテンツサーバおよび通信システム |
JP2016004270A (ja) | 2014-05-30 | 2016-01-12 | アップル インコーポレイテッド | 手動始点/終点指定及びトリガフレーズの必要性の低減 |
US20160358614A1 (en) * | 2015-06-04 | 2016-12-08 | Intel Corporation | Dialogue system with audio watermark |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720249B2 (en) * | 1993-11-18 | 2010-05-18 | Digimarc Corporation | Watermark embedder and reader |
US6937984B1 (en) * | 1998-12-17 | 2005-08-30 | International Business Machines Corporation | Speech command input recognition system for interactive computer display with speech controlled display of recognized commands |
US9955205B2 (en) | 2005-06-10 | 2018-04-24 | Hewlett-Packard Development Company, L.P. | Method and system for improving interactive media response systems using visual cues |
US7983441B2 (en) * | 2006-10-18 | 2011-07-19 | Destiny Software Productions Inc. | Methods for watermarking media data |
JP5042799B2 (ja) * | 2007-04-16 | 2012-10-03 | ソニー株式会社 | 音声チャットシステム、情報処理装置およびプログラム |
JP5332602B2 (ja) | 2008-12-26 | 2013-11-06 | ヤマハ株式会社 | サービス提供装置 |
JP2010164992A (ja) * | 2010-03-19 | 2010-07-29 | Toshiba Tec Corp | 音声対話装置 |
JP5982791B2 (ja) * | 2011-11-16 | 2016-08-31 | ソニー株式会社 | 情報処理装置及び情報処理方法、情報提供装置、並びに、情報提供システム |
CN104956436B (zh) | 2012-12-28 | 2018-05-29 | 株式会社索思未来 | 带有语音识别功能的设备以及语音识别方法 |
US9548053B1 (en) | 2014-09-19 | 2017-01-17 | Amazon Technologies, Inc. | Audible command filtering |
US9924224B2 (en) * | 2015-04-03 | 2018-03-20 | The Nielsen Company (Us), Llc | Methods and apparatus to determine a state of a media presentation device |
US10079024B1 (en) * | 2016-08-19 | 2018-09-18 | Amazon Technologies, Inc. | Detecting replay attacks in voice-based authentication |
US10395650B2 (en) * | 2017-06-05 | 2019-08-27 | Google Llc | Recorded media hotword trigger suppression |
-
2018
- 2018-08-31 JP JP2019541990A patent/JP7227140B2/ja active Active
- 2018-08-31 CN CN201880057831.9A patent/CN111052231B/zh active Active
- 2018-08-31 MX MX2020002591A patent/MX2020002591A/es unknown
- 2018-08-31 WO PCT/JP2018/032323 patent/WO2019054199A1/ja unknown
- 2018-08-31 EP EP18856338.1A patent/EP3683792A4/en active Pending
- 2018-08-31 US US16/645,058 patent/US11600270B2/en active Active
- 2018-08-31 KR KR1020207006277A patent/KR102607192B1/ko active IP Right Grant
- 2018-08-31 SG SG11202001429XA patent/SG11202001429XA/en unknown
- 2018-08-31 AU AU2018333668A patent/AU2018333668B2/en active Active
- 2018-08-31 CA CA3075249A patent/CA3075249A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003044069A (ja) * | 2001-07-19 | 2003-02-14 | Samsung Electronics Co Ltd | 音声認識による誤動作の防止及び音声認識率の向上が可能な電子機器及び方法 |
JP2005338454A (ja) * | 2004-05-27 | 2005-12-08 | Toshiba Tec Corp | 音声対話装置 |
JP2008305371A (ja) * | 2007-05-08 | 2008-12-18 | Softbank Bb Corp | 分散処理により膨大なコンテンツの検査を行う装置と方法、およびコンテンツの検査結果にもとづいて利用者間の自律的なコンテンツ流通とコンテンツ利用を制御するコンテンツ配信システム |
JP2013160883A (ja) * | 2012-02-03 | 2013-08-19 | Yamaha Corp | 通信端末、プログラム、コンテンツサーバおよび通信システム |
JP2016004270A (ja) | 2014-05-30 | 2016-01-12 | アップル インコーポレイテッド | 手動始点/終点指定及びトリガフレーズの必要性の低減 |
US20160358614A1 (en) * | 2015-06-04 | 2016-12-08 | Intel Corporation | Dialogue system with audio watermark |
Non-Patent Citations (1)
Title |
---|
See also references of EP3683792A4 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020128552A1 (ja) * | 2018-12-18 | 2020-06-25 | 日産自動車株式会社 | 音声認識装置、音声認識装置の制御方法、コンテンツ再生装置、及びコンテンツ送受信システム |
US11922953B2 (en) | 2018-12-18 | 2024-03-05 | Nissan Motor Co., Ltd. | Voice recognition device, control method of voice recognition device, content reproducing device, and content transmission/reception system |
JP2020185618A (ja) * | 2019-05-10 | 2020-11-19 | 株式会社スター精機 | 機械動作方法,機械動作設定方法及び機械動作確認方法 |
WO2021100555A1 (ja) * | 2019-11-21 | 2021-05-27 | ソニーグループ株式会社 | 情報処理システム、情報処理装置、情報処理方法及びプログラム |
Also Published As
Publication number | Publication date |
---|---|
AU2018333668B2 (en) | 2023-12-21 |
JPWO2019054199A1 (ja) | 2020-10-22 |
MX2020002591A (es) | 2020-07-13 |
KR20200053486A (ko) | 2020-05-18 |
AU2018333668A1 (en) | 2020-03-26 |
EP3683792A4 (en) | 2020-11-11 |
KR102607192B1 (ko) | 2023-11-29 |
SG11202001429XA (en) | 2020-04-29 |
EP3683792A1 (en) | 2020-07-22 |
US20200211549A1 (en) | 2020-07-02 |
JP7227140B2 (ja) | 2023-02-21 |
CA3075249A1 (en) | 2019-03-21 |
US11600270B2 (en) | 2023-03-07 |
CN111052231B (zh) | 2024-04-12 |
CN111052231A (zh) | 2020-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7477547B2 (ja) | 受信装置、及び情報処理方法 | |
JP7227140B2 (ja) | 情報処理装置、情報処理方法、音声処理装置、及び音声処理方法 | |
TWI665659B (zh) | 音頻解碼裝置、音頻解碼方法及音頻編碼方法 | |
JP7020799B2 (ja) | 情報処理装置、及び情報処理方法 | |
US20200082816A1 (en) | Communicating context to a device using an imperceptible audio identifier | |
US20090265022A1 (en) | Playback of multimedia during multi-way communications | |
KR102586630B1 (ko) | 수신 장치, 송신 장치, 및 데이터 처리 방법 | |
JP6569793B2 (ja) | 送信装置および送信方法 | |
US11197048B2 (en) | Transmission device, transmission method, reception device, and reception method | |
US11438650B2 (en) | Information processing apparatus, information processing method, transmission apparatus, and transmission method | |
JP6457938B2 (ja) | 受信装置、受信方法、及び、送信方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18856338 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019541990 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 3075249 Country of ref document: CA |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2018333668 Country of ref document: AU Date of ref document: 20180831 Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2018856338 Country of ref document: EP Effective date: 20200415 |