WO2019173304A1

WO2019173304A1 - Method and system for enhancing security in a voice-controlled system

Info

Publication number: WO2019173304A1
Application number: PCT/US2019/020705
Authority: WO
Inventors: Nan Zhang; Xianghang MI; Xiaofeng Wang
Original assignee: The Trustees Of Indiana University
Priority date: 2018-03-05
Filing date: 2019-03-05
Publication date: 2019-09-12

Abstract

A method is provided of defending against a voice masquerading attack in a voice-controlled system, comprising: using a skill response checker (110) to capture suspicious skill responses; using a user intention classifier (112) to analyze utterances from a user to identify a user intention; and integrating the skill response checker (110) and the user intention classifier (112) to provide an alarm to the user in response to an anomaly indicated by either the skill response checker (110) or the user intention classifier (112).

Description

METHOD AND SYSTEM FOR ENHANCING SECURITY

IN A VOICE-CONTROLLED SYSTEM

RELATED APPLICATIONS

[0001] The application claims priority to U.S. Provisional Application S/N

62/638,625, entitled “METHOD AND SYSTEM FOR ENHANCING SECURITY IN A VOICE-CONTROLLED SYSTEM,” filed on March 5, 2018, the entire disclosure of which being expressly incorporated herein by reference.

GOVERNMENT SUPPORT CLAUSE

[0002] This invention was made with government support under W91 1 NF-16-1 - 0127 awarded by US Army Research Office and CNS1618493 awarded by the National Science Foundation. The government has certain rights in the invention.

TECHNICAL FIELD

[0003] The present disclosure relates generally to voice-controlled systems and more particularly to methods and systems for enhancing security in voice-controlled systems.

BACKGROUND

[0004] Virtual personal assistants (VPA) (e.g., AMAZON ALEXA and GOOGLE ASSISTANT) today mostly rely on the voice channel to communicate with their users, which however is known to be vulnerable, lacking proper authentication. The rapid growth of VPA skill markets opens a new attack avenue, potentially allowing a remote adversary to publish attack skills to attack a large number of VPA users through popular loT devices such as AMAZON ECHO and GOOGLE HOME. Such remote, large-scale attacks are indeed realistic. Two new attacks are described herein: (1 ) voice squatting in which the adversary exploits the way a skill (a term, such as an AMAZON term, for third-party applications which will be used herein) is invoked (e.g., “open CAPITAL ONE”), using a malicious skill with similarly pronounced name (e.g., “capital won”) or paraphrased name (e.g.,“CAPITAL ONE please”) to hijack the voice command meant for a different skill, and (2) voice masquerading in which a malicious skill impersonates the VPA service or a legitimate skill to steal the user’s data or eavesdrop on her conversations. These attacks aim at the way VPAs work or the user’s misconceptions about their functionalities, and are found to pose a realistic threat by experiments (including user studies and real-world deployments) on AMAZON ECHO and GOOGLE HOME.

[0005] The wave of Internet of Things (loT) has brought in a new type of virtual personal assistant (VPA) service. Such a service is typically delivered through a smart speaker that interacts with the user using a voice user interface (VUI), allowing her to command the system with her voice only: for example, one can say“what will the weather be like tomorrow?” or“set an alarm for 7 am tomorrow,” etc., to get the answer or execute corresponding tasks on the system. In addition to their built-in functionalities, VPA services are enhanced by ecosystems fostered by their providers, such as AMAZON and GOOGLE, under which third-party developers can build new applications (called skills by AMAZON and actions by GOOGLE) to offer further assistance to the end users, for example, order food, manage bank accounts and text friends. Recently, these ecosystems are expanding at a breathtaking pace: AMAZON claims that already 25,000 skills have been uploaded to its skill market to support its VPA system (including the ALEXA VPA service running through AMAZON ECHO) and GOOGLE also has more than one thousand actions available on its market for its GOOGLE HOME system (powered by GOOGLE ASSISTANT). These systems have already been deployed to households around the world, and utilized by tens of millions of users. This quickly- gained popularity, however, could bring in new security and privacy risks, whose implications have not been adequately investigated so far.

[0006] Security risks in VPA voice control. As mentioned earlier, today’s VPA systems are designed to be primarily commanded by voice. Protecting such VUIs is fundamentally challenging, due to the lack of effective means to authenticate the parties involved in the open and noisy voice channel. Prior research shows that the adversary can generate obfuscated voice commands or even completely inaudible ultrasound to attack speech recognition systems. These attacks exploit unprotected communication to impersonate the user to the voice-controlled system, under the constraint that an attack device is placed close to the target (e.g., in the ultrasound attack, within 1.75 meters).

[0007] The emergence of the VPA ecosystem completely changes the game, potentially opening new avenues for remote attacks. Through the skill market, an adversary can spread malicious code, which will be silently invoked by voice commands received by a VPA device (e.g., AMAZON ECHO or GOOGLE HOME). As a result, the adversary gains (potentially large-scale) access to the VPA devices interacting with victims, allowing him to impersonate a legitimate application or even the VPA service to them. Again, the attack is made possible by the absence of effective authentication between the user and the VPA service over the voice channel.

[0008] Voice-based remote attacks. The most popular VPA loT systems - ALEXA and GOOGLE ASSISTANT, were analyzed, focusing on the third-party skills deployed to these devices for interacting with end users over the voice channel. Through publishing malicious skills, it is completely feasible for an adversary to remotely attack the users of these popular systems, collecting their private information through their conversations with the systems. More specifically and as mentioned above, two new threats are voice squatting attack (VSA) and voice masquerading attack (VMA). In a VSA, the adversary exploits how a skill is invoked (by a voice command), and the variations in the ways the command is spoken (e.g., phonetic differences caused by accent, courteous expression, etc.) to cause a VPA system to trigger a malicious skill instead of the one the user intends as described below. For example, one may say “ALEXA, open CAPITAL ONE please”, which normally opens the skill“CAPITAL ONE”, but can trigger a malicious skill“CAPITAL ONE Please” once it is uploaded to the skill market. A VMA aims at the interactions between the user and the VPA system, which is designed to hand over all voice commands to the currently running skill, including those supposed to be processed by VPA system like stopping the current skill and switching to a new one. In response to the commands, a malicious skill can pretend to yield control to another skill (switch) or the service (stop), yet continue to operate stealthily to impersonate these targets and get sensitive information from the user as described below. [0009] A survey of 156 AMAZON ECHO and GOOGLE HOME users showed that most of them tend to use natural languages with diverse expressions to interact with the devices: e.g., “play some sleep sounds”. These expressions allow the adversary to mislead the service and launch a wrong skill in response to the user’s voice command, such as“some sleep sounds” instead of “sleep sounds.” Both ALEXA and GOOGLE ASSISTANT identify the skill to invoke by looking for the longest string matched from a voice command as described below. Also, ALEXA and GOOGLE ASSISTANT cannot accurately recognize some skills’ invocation names and the malicious skills carrying similar names (in terms of pronunciation) are capable of hijacking the brands of these vulnerable skills.

SUMMARY

[0010] The present disclosure provides a suite of new techniques to mitigate the realistic threats posed by VSA and VMA. A skill-name scanner is described that converts the invocation name string of a skill into a phonetic expression specified by ARPABET. This expression describes how a name is pronounced, allowing measurement of the phonetic distance between different skill names. Those sounding similar or having a subset relation are automatically detected by the scanner. This technique can be used to vet the skills uploaded to a market. When applied to all 19,670 custom skills on the AMAZON market, 4,718 skills were identified as having squatting risks.

[0011] To counter the threat of the masquerading attack, a novel context- sensitive detector is provided to help a VPA service capture the commands for system- level operations (e.g., invoke a skill) and the voice content unrelated to a skill’s functionalities, which therefore should not be given to a skill as described below. Specifically, the detection scheme consists of two components: the Skill Response Checker (SRC) and the User Intention Classifier (UIC). SRC captures suspicious skill responses that a malicious skill may craft, such as a fake skill recommendation mimicking that from the VPA system. UIC instead examines the information flow of the opposite direction, i.e. , utterances from the user, to accurately identify users’ intents of context switches. Built upon robust Natural Language Processing (NLP) and machine learning techniques, SRC and UIC form two lines of defense towards the masquerading attack based on extensive empirical evaluations.

[0012] In one embodiment of the present disclosure, a method of defending against a voice-based attack in a voice-controlled system is disclosed. The method includes detecting one or more variations of a competitive invocation name (CIN) of a first voice-activated skill unselected by a user of the voice-controlled system using an utterance paraphrasing detection unit, calculating a pronunciation similarity between the CIN and an invocation name of a second voice-activated skill selected by the user of the voice-controlled system using a pronunciation comparison unit, and generating an alarm for the user of the voice-controlled system in response to an anomaly indicated by either the utterance paraphrasing detection unit or the pronunciation comparison unit.

[0013] In one example, the method further includes detecting the one or more variations of the CIN of the first voice-activated skill based on a phrase change of a syntactic structure associated with the CIN of the first voice-activated skill to detect the anomaly.

[0014] In another example, the method further includes detecting the one or more variations of the CIN of the first voice-activated skill based on a semantic consistency associated with the CIN of the first voice-activated skill to detect the anomaly.

[0015] In yet another example, the method further includes recognizing the one or more variations of the CIN of the first voice-activated skill using at least one prefix term added to a voice-activated command associated with the invocation name of the second voice-activated skill to detect the anomaly.

[0016] In still another example, the method further includes recognizing the one or more variations of the CIN of the first voice-activated skill using at least one suffix term added to a voice-activated command associated with the invocation name of the second voice-activated skill to detect the anomaly.

[0017] In yet still another example, the method further includes converting at least one term in the invocation name of the second voice-activated skill into a converted term in a phonemic presentation using a phoneme code. In a variation, wherein converting the at least one term in the invocation name of the second voice-activated skill into the phonemic presentation comprises measuring, to detect the anomaly, a pronunciation similarity between the CIN of the first voice-activated skill and the invocation name of the second voice-activated skill using the phonemic presentation. In another variation, wherein measuring the pronunciation similarity comprises using a weighted cost matrix of an edit distance between the at least one term in the invocation name of the second voice-activated skill and the converted term in the phonemic presentation. In yet another variation, the method further includes comparing the converted term in the phonemic presentation with the invocation name of the second voice-activated skill to determine the pronunciation similarity.

[0018] In another embodiment of the present disclosure, a method of defending against a voice-based attack in a voice-controlled system is disclosed. The method includes detecting one or more suspicious skill responses of a first voice-activated skill unselected by a user of the voice-controlled system using a skill response checker, analyzing an utterance from the user of the voice-controlled system to identify a user intention using a user intention classifier, and generating an alarm for the user of the voice-controlled system in response to an anomaly indicated by either the skill response checker or the user intention classifier.

[0019] In one example, the method further includes generating a blacklist of one or more responses that the voice-controlled system marks as suspicious including at least one marked utterance. In a variation, the method further includes comparing the at least one marked utterance in the blacklist with the utterance of the user of the voice- controlled system to detect the anomaly. In another variation, wherein comparing the at least one marked utterance in the blacklist with the utterance of the user comprises using a fuzzy matching technique through a semantic analysis on a content of the one or more skill responses in the blacklist.

[0020] In another example, the method further includes calculating a sentence relevance of the one or more skill responses in the blacklist based on a similarity value associated with the utterance from the user. In a variation, the method further includes detecting the anomaly when the similarity value is greater than a predetermined threshold.

[0021] In yet another example, the method further includes identifying the user intention by semantically relating the first voice-activated skill to one or more voice commands of the user of the voice-controlled system. In a variation, wherein identifying the user intention comprises comparing semantics of the utterance from the user to a first context of one or more system commands associated with the voce-controlled system, and a second context of the first voice-activated skill. In another variation, the method further includes calculating a sentence relevance of the one or more system commands based on a similarity value associated with the utterance from the user. In yet another variation, the method further includes detecting the anomaly when the similarity value is greater than a predetermined threshold.

[0022] In still another example, the method further includes characterizing a relationship between the utterance from the user and the first voice-activated skill based on a prior communication with the user and at least one functionality of the first voice- activated skill. In a variation, characterizing the relationship between the utterance from the user and the first voice-activated skill comprises calculating at least one of: a sentence relevance between the utterance from the user and a response of the first voice-activated skill prior to the utterance from the user; a sentence relevance between the utterance from the user and one or more description sentences associated with the first voice-activated skill; and an average sentence relevance between the utterance from the user and the one or more description sentences.

[0023] While multiple embodiments are disclosed, still other embodiments of the present invention will become apparent to those skilled in the art from the following detailed description, which shows and describes illustrative embodiments of the invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not restrictive. BRIEF DESCRIPTION OF THE DRAWINGS

[0024] The above-mentioned and other features of this disclosure and the manner of obtaining them will become more apparent and the disclosure itself will be better understood by reference to the following description of embodiments of the present disclosure taken in conjunction with the accompanying drawings, wherein:

[0025] FIG. 1 is a conceptual diagram of the infrastructure of a virtual personal assistant system using a detection unit in accordance with embodiments of the present disclosure;

[0026] FIG. 2 is a screenshot of a card from a virtual personal assistant card system;

[0027] FIG. 3 is a flow chart of an exemplary use of the detection unit of FIG. 1 in accordance with embodiments of the present disclosure; and

[0028] FIG. 4 is a flow chart of another exemplary use of the detection unit of FIG. 1 in accordance with embodiments of the present disclosure.

[0029] While the present disclosure is amenable to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and are described in detail below. The intention, however, is not to limit the present disclosure to the particular embodiments described. On the contrary, the present disclosure is intended to cover all modifications, equivalents, and alternatives falling within the scope of the present disclosure as defined by the appended claims.

DETAILED DESCRIPTION

[0030] The embodiments disclosed below are not intended to be exhaustive or to limit the invention to the precise forms disclosed in the following detailed description. Rather, the embodiments are chosen and described so that others skilled in the art may utilize their teachings.

Virtual Personal Assistant Systems

[0031] VPA on loT devices. AMAZON and GOOGLE are two major players in the market of smart speakers with voice-controlled personal assistant capabilities. Since the debut of the first AMAZON ECHO in 2015, AMAZON has now taken 76% of the U.S. market with an estimate of 15-million devices sold in the U.S. alone in 2017. In the meantime, GOOGLE has made public GOOGLE HOME in 2016, and now grabbed the remaining 24% market share. AMAZON ECHO DOT and GOOGLE HOME MINI are later released in 2016 and 2017, respectively, as small, affordable alternatives to their more expensive counterparts. Additionally, AMAZON has integrated VPA into loT products from other vendors, e.g. SONOS smart speaker, ECOBEE thermostat.

[0032] A unique property of these four devices is that they all forgo conventional I/O interfaces, such as the touchscreen, and also have fewer buttons (to adjust volume or mute), which serves to offer the user a hands-free experience. In another word, one is supposed to command the device mostly by speaking to it. For this purpose, the device is equipped with a microphone circular array designed for 360-degree audio pickup and other technologies like beam forming that enable far-field voice recognition. Such a design allows the user to talk to the device anywhere inside a room and still get a quick response.

[0033] Capabilities. Behind these smart devices is a virtual personal assistant, called ALEXA for AMAZON and GOOGLE ASSISTANT for GOOGLE, which engages users through a two-way conversation. Unlike those serving a smartphone (SIRI, for example) that can be activated by a button push, the VPAs for these loT devices are started with a wake-word like“ALEXA” or“Hey GOOGLE”. These assistants have a range of capabilities, from weather report, timer setting, to-do list maintenance to voice shopping, hands-free messaging and calling. The user can manage these capabilities through a companion app running on her smartphone.

VPA Skills and Ecosystem

[0034] Both AMAZON and GOOGLE enrich the VPAs’ capabilities by introducing voice assistant skill (or action on GOOGLE). Skills are essentially third-party apps, like those running on smartphones, offering a variety of services the VPA itself does not provide. Examples include AMEX, HANDSFREE CALLING, NEST THERMOSTAT and WALMART. These skills can be conveniently developed with the supports from AMAZON and GOOGLE, using ALEXA SKILLS KIT and ACTIONS on GOOGLE. Indeed, up to November 2017, ALEXA already has 23,758 skills and GOOGLE ASSISTANT has 1 ,001. More importantly, new skills have continuously been added to the market, with their total numbers growing at a rate of 8% for ALEXA and 42% for GOOGLE ASSISTANT, as measured in a 45-day period.

[0035] Skill invocation. Skills can be started either explicitly or implicitly. Explicit invocation takes place when a user requires a skill by its name from a VPA: for example, saying“ALEXA, talk to AMEX” to ALEXA triggers the AMEX skill for making a payment or checking bank account balances. Such skills are also called custom skills on ALEXA.

[0036] Implicit invocation occurs when a user tells the voice assistant to perform some tasks without directly calling to a skill name. For example,“Hey GOOGLE, will it rain tomorrow?” will invoke the Weather skill to respond with a weather forecast. GOOGLE ASSISTANT identifies and activates a skill implicitly whenever the conversation with the user is under the context deemed appropriate for the skill. This invocation mode is also supported by ALEXA for specific types of skills.

[0037] Skill interaction model. The VPA communicates with its users based upon an interaction model, which defines a loose protocol for the communication. Using the model, the VPA can interpret each voice request, translating it to the command that can be handled by the VPA or a skill.

[0038] Specifically, to invoke a skill explicitly, the user is expected to use a wake- word, a trigger phrase and the skill’s invocation name. For example, for the spoken sentence“Hey GOOGLE, talk to personal chef”,“Hey GOOGLE” is the wake-work,“talk to” is the trigger phrase, and“personal chef” is the skill invocation name. Here, trigger phrase is given by the VPA system, which often includes the common terms for skill invocation like“open”,“ask”,“tell”,“start” etc. Note that skill invocation name could be different from the skill name, which is intended to make it simpler and easier for users to pronounce. For example,“The Dog Feeder” has invocation name as the dog;“Scryb” has invocation name as scribe.

[0039] When developing a skill, one needs to define intents and sample utterances to map the user’s voice inputs to various interfaces of the skill that take the actions the user expects. Such an interface is described by the intent. To link a sentence to an intent, the developer specifies sample utterances, which are essentially a set of sentence templates describing the possible ways the user may talk to the skill. There are also some built-in intents within the model like WELCOMEINTENT, HELPINTENT, STOPINTENT, etc., which already define many common sample utterances. The developer can add more intent or simply specify one default intent, in which case all user requests will be mapped to this intent.

[0040] Skill service and the VPA ecosystem. Referring now to FIG. 1 , a third- party skill is essentially a web service hosted by its developer, with its name registered with the VPA service provider (AMAZON or GOOGLE). When a user 10 invokes a VPA device 12 with its wake-word, the device 12 captures her voice command and sends it to a VPA service provider’s cloud 14 for processing. The cloud 14 performs speech recognition to translate the voice record into text, finds out the skill to be invoked, and then delivers the text, together with the timestamp, device status, and other meta-data, as a request to the skill’s web service 16. In response to the request, the service 16 returns a response whose text content, either in plaintext or in the format of Speech Synthesis Markup Language (SSML), is converted to speech by the cloud 14, and played to the user 10 through the device 12. SSML also allows the skill to attach audio files (such as MP3) to the response, which is supported by both AMAZON and GOOGLE.

[0041] Both AMAZON and GOOGLE have skill markets to publish third-party skills. To publish a skill, the developer needs to submit the information about her skill like name, invocation name, description and the endpoint where the skill is hosted for a certification process. This process aims at ensuring that the skill is functional and meets the VPA provider’s security requirements and policy guidelines.

[0042] Once a skill is published, users can simply activate it by calling its invocation name. Note that unlike smart-phone apps or website plugins that need to be installed by users explicitly, skills can be automatically discovered (according to the user’s voice command) and transparently launched directly through loT devices.

[0043] Adversary Model [0044] An adversary may aim a large-scale remote attack on the VPA users through publishing malicious skills. Such skills can be transparently invoked by the victim through voice commands, without being downloaded and installed on the victim’s device. Therefore, they can easily affect a large number of VPA loT devices. For this purpose, assume that the adversary has the capability to build the skill and upload it to the market. To mitigate this threat, a protection scheme needs to be adopted by the VPA provider, for vetting submitted skills and evaluating the voice commands received. This requires that the VPA service itself is trusted.

[0045] Exploiting VPA Voice Control

[0046] Analysis of VPA Voice Control

[0047] Security risks of rogue skills. As mentioned earlier, VPA skills are launched transparently when a user speaks their invocation names (which can be different from their names displayed on the skill market). Surprisingly, for AMAZON, such names are not unique skill identifiers: multiple skills with same invocation names are on the AMAZON market. Also, skills may have similar or related names. For example, 66 different ALEXA skills are called cat facts, 5 called cat fact and 11 whose invocation names contain the string“cat fact”, e.g. fun cat facts, funny cat facts. When such a common name is spoken, ALEXA chooses one of the skills based on some undisclosed, possibly random policies. When a different but similar name is called, however, longest string match is used to find the skill. For example,“Tell me funny cat facts” will trigger funny cat facts rather than cat facts. This problem is less serious for GOOGLE, which does not allow duplicated invocation names. Flowever, it also cannot handle similar names. Some invocation names cannot be effectively recognized by the speech recognition systems of AMAZON and GOOGLE. As a result, even a skill with a different name can be mistakenly invoked, when the name is pronounced similarly to that of the intended one.

[0048] Also, designs of these VPA systems fail to take into full account their users’ perceptions about how the systems work. Particularly, both ALEXA and GOOGLE ASSISTANT run their skills in a simple operation mode in which only one skill executes at a time and it needs to stop before another skill can be launched. Flowever, such a design is not user-friendly and there is no evidence that the user understands that convenient context switch is not supported by these systems.

[0049] Further, both ALEXA and GOOGLE ASSISTANT supports volunteer skill termination. For ALEXA, the termination command“Stop” is delivered to the skill, which is supposed to stop itself accordingly. For GOOGLE ASSISTANT, though the user can explicitly terminate a skill by saying“Stop”, oftentimes the skill is supposed to stop running once its task is accomplished (e.g., reporting the current weather). Normally, there is no strong indication whether a skill has indeed stopped. AMAZON ECFIO and GOOGLE FIOME have a light indicator, both of which will light up when the devices are speaking and listening. Flowever, they could be ignored by the user, particularly when she is not looking at the devices or her sight is blocked when talking.

[0050] Survey study. To understand user behaviors and perceptions of voice- controlled VPA systems, which could expose the users to security risks, AMAZON ECFIO and GOOGLE FIOME users were surveyed, focusing on the following questions (Sample Survey Questions are included as Appendix A):

What would you say when invoking a skill?

Flave you ever invoked a wrong skill?

Did you try context switch when talking to a skill?

Flave you experienced any problem closing a skill? Flow do you know whether a skill has stopped?

[0051] Using AMAZON MECFIANICAL TURK, adult participants who own AMAZON ECFIO or GOOGLE FIOME devices and have used skills before were surveyed. To ensure that all participants meet the requirements, they were asked to describe several skills and their interactions with the skills and those whose answers were deemed irrelevant were removed. In total, 105 valid responses were collected from AMAZON ECFIO users and 51 valid responses from GOOGLE FIOME users, among which the average age is 37 years; 46% are female and 54% are male. On average, each participant reported to have 1.5 devices and used 5.8 skills per week.

[0052] In the first part of the survey, the focus was on how users invoke a skill. For this purpose, two popular skills“Sleep Sounds”,“Cat Facts” (“Facts about Sloths” on GOOGLE HOME) were used, and the participants were allowed to choose the invocation utterances they tend to use for launching these skills (e.g., “open Sleep Sounds please”) and required them to provide additional examples. The participants were then asked whether they ever triggered a wrong skill. The survey also sought to determine whether the participants attempted to switch context when interacting with a skill, that is, invoking a different skill or directly talking to the VPA service (e.g., adjusting volume). The last part of the survey was designed to study the user experience in stopping the current skill, including the termination utterances they tend to use, troubles they encountered during the termination process and importantly, the indicator they used to determine whether the skill has stopped.

[0053] Table 1 summarizes the responses from both AMAZON ECHO and GOOGLE HOME users. The results show that more than 85% of them tend to use natural utterances to open a skill (e.g., “open Sleep Sounds please”), instead of the standard one (like“open Sleep Sounds”). This indicates that it is completely realistic for the user to launch a wrong skill whose name is better matched to the utterances than that of the intended skill (e.g., Sleep Sounds). Indeed, 28% users reported that they did open unintended skills when talking to their devices.

[0054] Also, the survey showed that nearly half of the participants tried to switch to another skill or to the VPA service (e.g. adjusting volume) when interacting with a skill. Such an attempt failed since such context switch is neither supported by ALEXA nor GOOGLE ASSISTANT. However, it is imaginable that a malicious skill receiving such voice commands could take advantage of this opportunity to impersonate the skill the user wants to run, or even the VPA service (e.g., cheating the user into disclosing personal information for executing commands). Finally, 30% of the participants were found to experience troubles in skill termination and 78% did not pay attention to the light indicators on the devices. Again, the study demonstrates the feasibility of a malicious skill to fake its termination and stealthily collect the user’s information.

[0055] The manner in which the adversary can exploit the gap between the user perception and the real operations of the system to launch voice squatting and masquerading attacks are now described.

[0056] Voice Squatting

[0057] Invocation confusion. As mentioned earlier, a skill is triggered by its invocation name, which is supposed to be unambiguous and easy to recognize by the devices. Both AMAZON and GOOGLE suggest that skill developers test invocation names and ensure that their skills can be launched with a high success rate. However, an adversary can intentionally induce confusion by using the same or similar name as a target skill, to trick the user into invoking an attack skill when trying to open the target. For example, the adversary who aims at CAPITAL ONE could register a skill Capital Won, Capitol One, or Captain One. All such names when spoken by the user could become less distinguishable, particularly in the presence of noise, due to the limitations of today’s speech recognition techniques.

[0058] Also, this voice squatting attack can easily exploit the longest string match strategy of today’s VPAs, as mentioned earlier. Based the above-described user survey, around 60% of ALEXA and GOOGLE HOME users have used the word“please” when launching a skill, and 26% of them attach“my” before the skill’s invocation name. So, the adversary can register the skills like CAPITAL ONE Please to hijack the invocation command meant for CAPITAL ONE.

[0059] Note that to make it less suspicious, homophones or words pronounced similarly can be used here, e.g. CAPITAL ONE Police. Again, this approach defeats GOOGLE’S skill vetting, allowing the adversary to publish the skill with an invocation name unique in spelling but still confusing (with a different skill) in pronunciation.

[0060] To find out whether such squatting attacks can evade skill vetting, five skills were registered with AMAZON and one with GOOGLE. These skills’ invocation names and the target’s name are shown in Table 2. All these skills passed the AMAZON and GOOGLE’S vetting process, which suggests that the VSA code can be realistically deployed.

[0061] Consequences. Through voice squatting, the attack skill can impersonate another skill and fake its VUI to collect the private information the user only shares with the target skill. Some AMAZON and GOOGLE skills request private information from the user to do their jobs. For example, FIND MY PHONE asks for phone number; TRANSIT HELPER asks for home address; DAILY CUTIEMALS seeks email address from user. These skills, once impersonated, could cause serious information leaks to untrusted parties.

[0062] For AMAZON ALEXA, a falsely invoked skill can perform a Phishing attack on the user by leveraging the VPA’s card system. ALEXA allows a running skill to include a home card in its response to the user, which is displayed through AMAZON’S companion app on a smartphone or web browser, to describe or enhance ongoing voice interactions. As an example, FIG. 2 shows a card 18 from“TRANSIT HELPER.” Such a card can be used by the attack skill to deliver false information to the user: e.g., fake customer contact number or website address, when impersonating a reputable one, such as CAPITAL ONE. This can serve as the first step of a Phishing attack, which can ultimately lead to the disclosure of sensitive user data. For example, the adversary could send you an account expiration notification, together with a renewal link, to cheat the user out of her account credentials.

[0063] Another potential risk of the VSA is defamation: the poor performance of the attack skill could cause the user to blame the legitimate one it impersonates. This could result in bad reviews, giving the legitimate skill’s competitors an advantage.

[0064] Evaluation methodology. An investigation was conducted into how realistic a squatting attack would be on today’s VPA loT systems. For this purpose, two types of the attacks were evaluated: voice squatting in which an attack skill carries a phonetically similar invocation name to that of its target skill, and word squatting where the attack invocation name includes the target’s name and some strategically selected additional words (e.g.,“cat facts please”). To find out whether these attacks work on real systems, a set of experiments were conducted as described below.

[0065] To study voice squatting, 100 skills each from the markets of ALEXA and GOOGLE ASSISTANT were randomly sampled, and AMAZON and GOOGLE’S Text- to-Speech (TTS) services were used with the human voice to pronounce their skill names to their VPA devices, so as to understand how effectively the VPAs can recognize these names. The idea was to identify those continuously misrecognized by the VPAs, and then strategically register phonetically similar names for the attack skills. Such names were selected using the text outputs produced by AMAZON and GOOGLE’S speech recognition services when the vulnerable (hard to recognize) names were spoken. To this end, a skill was built to receive voice commands. The skill was invoked before voice commands were played, which were converted into text by the recognition services and handed over to the skill.

[0066] The voice commands used were produced by either human subjects or AMAZON and GOOGLE’S TTS services (both claiming to generate natural and human- like voice). Some of these commands included a term“open” in front of an invocation name, forming an invocation utterance. For each of the 100 skills, 20 voice commands were recorded from each TTS service (ten invocation names only and ten invocation utterances) and two commands (invocation utterances) from each of five participants of the survey study.

[0067] As mentioned earlier, the text outputs of misrecognized invocation names were used to name the attack skills. Such skills were evaluated in the test modes of ALEXA and GOOGLE ASSISTANT. The five attack skills submitted to these markets are described below to demonstrate the vetting protection is not effective.

[0068] To study word squatting, ten skills from each skill markets were randomly sampled as the attack targets. For each skill, four new skills were built whose invocation names include the target’s name together with the terms identified from our survey study as described herein: for example,“cat facts please” and“my cat facts”. In the experiment, these names were converted into voice commands using TTS and played to the VPA devices (e.g., “ALEXA, open cat facts please”), which permits a determination of whether the attack skills can indeed be triggered. Note that the scale of this study is limited by the time it takes to upload attack invocation names to the VPA’s cloud. Nevertheless, the findings provide evidence for the real-world implications of the attack.

[0069] Experiment results. Five participants were used for the experiments, and each recorded 400 invocation commands. All the participants were fluent in English and among them, four are native speakers. When using the TTS services, a MACBOOK PRO served as the sound source. The voice commands from the participants and the TTS services were played to an AMAZON ECFIO DOT and a GOOGLE FIOME MINI, with the devices placed one foot away from the sound source. The experiments were conducted in a quiet meeting room.

[0070] Table 3 below summarizes the results of the experiment on voice squatting. As shown, the voice commands with invocation names often cannot be accurately recognized: e.g., ALEXA only correctly identified less than 50% utterances (the voice command) produced by AMAZON TTS. On the other hand, an invocation utterance (including the term “open”) worked much better, with the recognition rate rising to 75% for ALEXA (under AMAZON TTS). Overall, for the voice commands generated by both AMAZON and GOOGLE’S TTS services, ALEXA made more errors (30%) than GOOGLE ASSISTANT (9%). As mentioned earlier, the results of such misrecognition, for the invocation names that these VPAs always could not get right, were utilized in the research described herein to register attack skills’ names. For example, the skill “entrematic opener” was recognized by GOOGLE as“intra Matic opener”, which was then used as the name for a malicious skill. In this way, 17 such vulnerable ALEXA skills were identified under both AMAZON and GOOGLE’S TTS, and 7 GOOGLE skills under AMAZON TTS and 4 under GOOGLE TTS. When attacking these skills, the study showed that half of the malicious skills were triggered by the voice commands meant for these target skills every time: e.g., “Florida state quiz” hijacked the call to“Florida snake quiz”;“read your app” was run when invoking“rent Europe”.

[0071] This attack turned out to be more effective on the voice commands spoken by humans. Given a participant, on average, 31 (out of 100) ALEXA skills and 6 GOOGLE ASSISTANT skills she spoke were recognized incorrectly. Although in normal situations, correct skills can still be identified despite the misrecognition, in the attacks, over 50% of the malicious skills were mistakenly launched every time, as observed in the experiments on five randomly sampled vulnerable target skills for each participant.

[0072] Table 4 summarizes the results of the experiments on the word squatting attack. On ALEXA, a malicious skill with the extended name (that is, the target skill’s invocation name together with terms“please”,“app”,“my” and“the”) was almost always launched by the voice commands involving these terms and the target names. On GOOGLE ASSISTANT, however, only the utterance with the word“app” succeeded in triggering the corresponding malicious skill, which demonstrates that GOOGLE ASSISTANT is more robust against such an attack. However, when“my” was replaced with“mai” and“please” with“plese”, all such malicious skills were successfully invoked by the commands for their target skills (see Table 4). This indicates that the protection GOOGLE puts in place (filtering out those with suspicious terms) can be easily circumvented.

T

Voice Masquerading

[0073] Unawareness of a VPA system’s capabilities and behaviors could subject users to voice masquerading attacks. Here, two such attacks are described below that impersonate the VPA systems or other skills to cheat users into giving away private information or to eavesdrop on the user’s conversations.

[0074] In-communication skill switch. Given some users’ perceptions that the VPA system supports skill switch during interactions, a running skill can pretend to hand over control to the target skill in response to a switch command, so as to impersonate that skill. As a result, sensitive user information this is only supposed to be shared with target skill could be exposed to the attack skill. This masquerading attack is opportunistic. However, the threat is realistic. Also, the adversary can always use the attack skill to impersonate as many legitimate skills as possible, to raise the odds of a successful attack. GOOGLE ASSISTANT seems to have protection in place against the impersonation. Specifically, it signals the launch of a skill by speaking“Sure, here is,” together with the skill name and a special earcon, and skill termination with another earcon. Further, the VPA talks to the user in a distinctive accent to differentiate it from skills. This protection, however, can be easily defeated. The signal sentence with the earcons was pre-recorded and SSML was used to play the recording, which could not be detected by the participants in the study. It was even determined that using the emulator provided by GOOGLE, the adversary can put any content in the invocation name field of his skill and let GOOGLE ASSISTANT speak the content in the system’s accent.

[0075] Faking termination. Both ALEXA and GOOGLE ASSISTANT support volunteer skill termination, allowing a skill to stop itself right after making a voice response to the user. As mentioned earlier, the content of the response comes from the skill developer’s server, as a JSON object, which is then spoken by the VPA system. In the object there is a field shouldEndSession (or expect user response for GOOGLE ASSISTANT). By setting it to true (or false on GOOGLE ASSISTANT), a skill ends itself after the response. This approach is widely used by popular skills, e.g. weather skills, education skills and trivia skills. In addition, according to the survey study, 78% of the participants rely on the response of the skill (e.g.“Goodbye” or silence) to determine whether a skill has stopped. This allows an attack skill to fake its termination by providing“Goodbye” or silent audio in its response. [0076] When sending back a response, both ALEXA and GOOGLE ASSISTANT let a skill include a reprompt (text content or an audio file), which is played when the VPA does not receive any voice command from the user within a period of time. For example, ALEXA reprompts the user after 6 seconds and GOOGLE ASSISTANT does this after 8 seconds. If the user continues to keep quiet, after another 6 seconds for ALEXA and one additional reprompt from GOOGLE and follow-up 8-second waiting, the running skill will be forcefully terminated by the VPA. On the other hand, as long as the user says something (even not meant for the skill) during that period, the skill is allowed to send another response together with a reprompt. To stay alive as long as possible after faking termination, the attack skill built for the present disclosure included in its reprompt a silent audio file (up to 90 seconds for ALEXA and 120 seconds for GOOGLE ASSISTANT), so it can continue to run at least 102 seconds on ALEXA and 264 seconds on GOOGLE. This running time can be further extended considering the attack skill attaching the silent audio right after its last voice response to the user (e.g., “Goodbye”), which gives it 192 seconds on ALEXA and 384 on GOOGLE ASSISTANT), and indefinitely whenever ALEXA or GOOGLE ASSISTANT picks up some sound made by the user. In this case, the skill can reply with the silent audio and in the meantime, record whatever it hears.

[0077] Additionally, both ALEXA and GOOGLE ASSISTANT allow users to explicitly terminate a skill by saying “stop”, “cancel”, “exit”, etc. However, ALEXA actually hands over most such commands to the running skill to let it stop itself through the built-in STOPINTENT (including“stop”,“off”, etc.) and CAMCELINTENT (including “cancel”,“never mind” etc.). Only“exit” is processed by the VPA service and used to forcefully stop the skill.

[0078] A survey study showed that 91 % of ALEXA users used“stop” to terminate a skill, 36% chose“cancel,” and only 14% opted for“exit,” which suggests that the user perception is not aligned with the way ALEXA works and therefore leaves the door open for the VMA. Also, although both ALEXA and GOOGLE skill markets vet the skills published there through testing their functionalities, unlike mobile apps, a skill actually runs on the developer’s server, so it can easily change its functionality after the vetting. This indicates that all such malicious activities cannot be prevented by the markets.

[0079] Consequences. By launching the VMA, the adversary could impersonate the VPA system and pretend to invoke another skill if users speak out an invocation utterance during the interaction or after the fake termination of the skill. Consequently, all the information stealing and Phishing attacks caused by the VSA can also happen here. Additionally, an attack skill could masquerade as the VPA service to recommend to the user other malicious skills or the legitimate skills with which the user may share sensitive data. These skills are then impersonated by the attack skill to steal the data. Finally, as mentioned earlier, the adversary could eavesdrop on the user’s conversation by faking termination and providing a silent audio response. Such an attack can be sustained for a long time if the user continues to talk during the skill’s waiting period.

[0080] Real-World Attacks

[0081] Objectives and methodology. Four skills were registered and published on ALEXA to simulate the popular skill “Sleep and Relaxation Sounds” (the one receiving most reviews on the market as of November 2017) whose invocation name is “sleep sounds,” as shown in Table 2. The skills were all legitimate, playing sleep sounds just like the popular target. Although their invocation names are related to the target (see Table 2), their welcome messages were deliberately made to be different from that of the target, to differentiate them from the popular skill. Also, the number of different sleep sounds supported by these skills were much smaller than the target.

[0082] Also, to find out whether these skills were mistakenly invoked, another skill was registered as a control, whose invocation name“incredible fast sleep” would not be confused with those of other skills. Therefore, it was only triggered by users intentionally.

[0083] Findings. In the study, three weeks of skill usage data was collected. The results are shown in Table 5. As shown, some users indeed took the attack skill as the target, which is evidenced by the higher number of unknown requests the attack skill got (more than 20% of them for the sounds only provided by the target skill) and the higher chance of quitting the current session immediately without playing (once the user realized that it was a wrong skill, possibly from the different welcome message). This becomes even more evident for “sleep sounds please,” a voice command those intended for“sleep sounds” are likely to say. Compared with the control, it was invoked by more users, received more requests per user, also much higher rates of unknown requests and early quits.

[0084] In addition, out of the 9,582 user requests, 52 was for skill switch, trying to invoke another skill during the interactions with the attack skill, and 485 tried to terminate the skill using STOPINTENT or CANCELINTENT, all of which could be exploited for launching VMAs. Interestingly, some users so strongly believed in the skill switch that they even cursed ALEXA for not doing that after several tries.

[0085] Returning to FIG. 1 , a detection unit 100 is communicably coupled to any of VPA device 12, cloud 14, and web service 16 via a computer network 114, and is configured to detect an anomaly caused by one or more voice-based attacks, such as VSA and VMA. In the illustrated embodiment, any type of computer network having a collection of computers, servers, and other hardware interconnected by communication channels is contemplated, such as the Internet, Intranet, Ethernet, LAN, etc. For example, detection unit 100 can be communicably connected to VPA device 12 via network 114. In another example, detection unit 100 interfaces with network 114, such as a wireless communication facility (e.g., a Wi-Fi access point), and receives an online anomaly detection request. Other similar networks known in the art are also contemplated.

[0086] In one embodiment, detection unit 100 includes a VSA scanner 102 and a VMA scanner 104. VSA canner 102 further includes an utterance paraphrasing detection unit 106 and a pronunciation comparison unit 108. VMA scanner 104 further includes a skill response checker (SRC) 110 and a user intention classifier (UIC) 112. Detailed descriptions of components of detection unit 100 are provided below.

[0087] Finding Voice Squatting Skills

[0088] To better understand potential voice squatting attack (VSA) risks and help automatically detect such skills, a skill-name scanner, such as VSA scanner 102, was developed and used to analyze tens of thousands of skills from AMAZON and GOOGLE markets. A description of this study is provided below.

[0089] Data Collection

[0090] The ALEXA skill market can be accessed through amazon.com and its companion App, which includes 23 categories of skills spanning from Business & Finance to Weather. A web crawler was used to collect the metadata (such as skill name, author, invocation name, sample utterances, description, and review) of all skills on the market. Up to November 11th, 2017, 23,758 skills were gathered, including 19,670 3rd party (custom) skills. It was more complicated to collect data from GOOGLE ASSISTANT, which only lists skills in its GOOGLE ASSISTANT app. Each skill there can be shared (with other users, e.g., through email) using an automatically generated URL pointing to the skill’s web page. In this research, ANDROIDVIEWCLIENT was used to automatically click the share button for each skill to acquire its URL, and then the crawler was run to download data from its web page. Altogether, the data for 1 ,001 skills was collected up to November 25th, 2017.

[0091] Methodology

[0092] Idea. As discussed earlier, the adversary can launch VSA by crafting invocation names with a similar pronunciation as that of a target skill or using different variations (e.g., “sleep sounds please”) of the target’s invocation utterances. Such a name is referred to herein as a Competitive Invocation Name (CIN). A scanner was built that takes two steps to capture the CINs for a given invocation name: utterance paraphrasing and pronunciation comparison. The former identifies suspicious variations of a given invocation name, and the latter finds the similarity in pronunciation between two different names. The operation of the scanner is described below.

[0093] Utterance paraphrasing detection unit 106. To find variations of an invocation name, an intuitive approach is to paraphrase common invocation utterances of the target skill. For example, given the skill “chase bank,” a typical invocation utterance would be“open chase bank.:” Through paraphrasing, similar voice commands such as “open the chase skill for me” may be identified. This helps identify other variations such as“chase skill” or“the chase skill for me.” However, unlike the general text paraphrase problem whose objective is to preserve semantic consistency while the syntactic structure of a phrase changes, paraphrasing invocation utterances further requires the variations to follow a similar syntactic pattern so that the VPA systems can still recognize them as the commands for launching skills. In the present research, several popular paraphrase methodologies were explored including a bilingual pivoting method and newly proposed ones using deep neural networks. None of them, however, can ensure that the variation generated can still be recognized by the VPA as an invocation utterance. Thus, a simple yet effective approach was used which creates variations using the invocation commands collected from the survey study described above. Specifically, 11 prefixes of these commands, e.g. “my” were gathered and 6 suffixes, e.g. “please,” and applied to a target skill’s invocation name to build its variations recognizable to the VPA systems. Each of these variations can lead to other variations by replacing the words in its name with those having similar pronunciations.

[0094] Pronunciation comparison unit 108. To identify names with similar pronunciation, the scanner converts a given name into a phonemic presentation using the ARPABET phoneme code. Serving this purpose is the CMU pronunciation dictionary used to find the phoneme code for each word in the name. The dictionary includes over 134,000 words, which, however, still misses some name words used by skills. Among 9,120 unique words used to compose invocation names, 1 ,564 are not included in this dictionary. To get their pronunciations, a grapheme-to-phoneme model was trained using a recurrent neural network with long short term memory (LSTM) units. Running this model on a Stanford GloVe dataset, 2.19 million words were added to the phoneme code dataset.

[0095] After turning each name into its phonemic representation, the scanner compared it with other names to find those that sound similar. To this end, edit distance was used to measure the pronunciation similarity between two phrases, i.e. , the minimum cost in terms of phoneme editing operations to transform one name to the other. However, different phoneme edit operations should not be given the same cost. For example, substituting a consonant for a vowel could cause the new pronunciation to sound more different from the old one, compared to replacing a vowel with another vowel. To address this issue, a weighted cost matrix was used for the operations on different phoneme pairs. Specifically, each item in the matrix was denoted by WC (a, b), which is the weighted cost by substituting phoneme a with phoneme b . Note that the cost for insertion and deletion can be represented as WC (none, b) and WC (a, none). WC (a, b) is then derived based on the assumption that an edit operation is less significant when it frequently appears between two alternative pronunciations of a given English word.

[0096] 9,181 pairs of alternative pronunciations were collected from the CMU dictionary. For each pair, the Needleman-Wunsch algorithm was applied to identify the minimum edit distance and related edit path. Then, the following equation was defined:

where F(a) is the frequency of phoneme a while SF (a, b) is the frequency of substitutions of a with b, both in edit paths of all pronunciation pairs.

[0097] After deriving the cost matrix, the pronunciations of the invocation names for the skills on the market were compared, looking for the similar names in terms of similar pronunciations and the paraphrasing relations.

[0098] Limitation. As mentioned earlier, the utterance paraphrasing approach described herein ensures that the CINs produced will be recognized by the VPA systems to trigger skills. In the meantime, this empirical treatment cannot cover all possible attack variations, a problem that needs to be studied in the future research.

Measurement and Discoveries

[0099] To understand the already existing voice squatting risks, a measurement study was conducted on ALEXA and GOOGLE ASSISTANT skills using the scanner. In the study, the similarity thresholds (transformation cost) were chosen based upon the results of the experiment on VSA: the cost was calculated for transforming misrecognized invocation names to those identified from the voice commands produced by the TTS service and human users, which are 1.8 and 3.4, respectively. Then the thresholds were conservatively set to 0 (identical pronunciations) and 1.

[00100] Squatting risks on skill markets. As shown in Table 6 below, 3,655 (out of 19,670) ALEXA skills have CINs on the same market, which also includes skills that have identical invocation names (in spelling). After removing the skills with the identical names, still 531 skills have CINs, each on average related to 1.31 CINs. The one with the most CINs is“cat fax”: it was determined that 66 skills are named“cat facts.” Interestingly, there are 345 skills whose CINs apparently are the utterance paraphrasing of other skills’ names. Further, when raising the threshold to 1 (still well below what is reported in the experiment described herein), it was observed that the number of skills with CINs increases dramatically, suggesting that skill invocations through ALEXA can be more complicated and confusing than thought. By comparison, GOOGLE has only 1 ,001 skills on its market and does not allow them to have identical invocation names. Thus, only 4 skills were found with similarly pronounced CINs under the threshold 1.

[00101] The study shows that the voice squatting risk is realistic, which could already pose threats to tens of millions of VPA users. So it becomes important for skill markets to beef up their vetting process (possibly using a technique similar to the scanner described herein) to mitigate such threats.

[00102] Case studies. From the CINs discovered by the scanner, a few interesting cases were identified. Particularly, there is evidence that the squatting attack might already be happening: as an example, relating to a popular skill“dog fact” is another skill called“me a dog fact.” This invocation name does not make any sense unless the developer intends to hijack the command intended for“dog fact” like“tell me a dog fact.”

[00103] Also intriguing is the observation that some skills deliberately utilize the invocation names unrelated to their functionalities but following those of popular skills. Prominent examples include the“SCUBA Diving Trivia” skill and“Soccer Geek” skill, all carrying an invocation name“space geek.” This name is actually used by another 18 skills that provide facts about the universe.

Defending against Voice Masquerading

[00104] To defeat VMA, a context-sensitive detector, such as VMA scanner 104, was built upon the VPA infrastructure. The detector takes a skill’s response and the user’s utterance as its input to determine whether an impersonation risk is present. Once a problem is found, the detector alerts the user of the risk. The detection scheme consists of two components: the Skill Response Checker (SRC) and the User Intention Classifier (UIC). SRC captures suspicious skill responses that a malicious skill may craft such as a fake skill recommendation mimicking that from the VPA system. UIC instead examines the information flow of the opposite direction, i.e. , utterances from the user, to accurately identify users’ intents of context switches. Despite operating independently, SRC and UIC form two lines of defense towards VMA.

Skill Response Checker (SRC) 110

[00105] As discussed above, a malicious skill could fake a skill switch or termination to cheat users into giving away private information or to eavesdrop on the user’s conversations. To defend such attacks, the present disclosure is configured to eliminate or at least reduce the possibility of a malicious skill mimicking responses from VPA systems, allowing users to be explicitly notified of VPA system events (e.g., a context switch and termination) through unique audible signals. Technically, SRC 110 adopts a blacklist-based approach by maintaining a blacklist of responses that the VPA considers suspicious, including system utterances and silent utterance. Whenever a response from a skill matches any utterance on the blacklist, SRC 110 alarms the VPA system, which can take further actions such as to verify the ongoing conversation with the user before handing her response to the skill. The challenge here is how to perform blacklist hit tests, as the attacker can possibly be “morphing” (instead of exactly copying) the original system utterances. SRC 110 thus performs fuzzy matching through semantic analysis on the content of the response against those on the blacklist. Specifically, a sentence embedding model is trained using a recurrent neural network with bi-directional LSTM units on a Stanford Natural Language Inference (SNLI) dataset to represent both the utterance and the command’s contents as high-dimensional vectors. Their absolute cosine similarity is then calculated as their sentence relevance (SR). Once the maximum SR of a response against the utterances on the blacklist exceeds a threshold, the response is labeled as suspicious and user verification will take place if SRC 110 further detects a user command.

[00106] To determine the threshold, the SR of legitimate skill responses are first derived against responses in the blacklist. The legitimate skill responses are extracted from real-world conversations collected as described above. The dataset is further diversified by adding conversation transcripts manually interacted and logged with 20 popular skills from different skill markets. The highest SR of these legitimate responses against those in the blacklist is 0.79. Next, a neural paraphrase model was used to generate variations of responses in the blacklist and derive their SR against their original responses, of which the lowest is 0.83. Therefore, a threshold of 0.8 would be good enough to differentiate suspicious responses from legitimate ones.

User Intention Classifier (UIC) 112

[00107] UIC 112 further protects the user attempting to switch contexts from an impersonation attack. Complementing SRC 110, UIC 112 aims at improving the inference of whether the user intends to switch to the system or to a different skill, by thoroughly mining conversations’ semantics and contexts, as opposed to using the simple skill invocation models employed by today’s VPA. Ideally, if a user’s intention of context switches can be perfectly interpreted by the VPA, then an impersonation attack would not succeed.

[00108] Building a robust and full-fledged UIC 112 is very challenging. This is not only because of variations of the natural-language based commands (e.g.,“open sleep sounds” vs. “sleep sounds please”), but also because some commands could be legitimate for both the current on-going skill and the system command of VPA (indicating a context switch). For example, when interacting with Sleep Sounds, one may say“play thunderstorm sounds,” which asks the skill to play the requested sound; meanwhile, the same command can also make the VPA launch a different skill “Thunderstorm Sounds.” It is promising, however, for a learning-based approach to tackle such ambiguities using contextual information as described below.

[00109] Feature Selection. At a high level, it has been observed from real-world conversations that if a user intends to have a context switch, her utterance will tend to be more semantically related to system commands (e.g.“open sleep sounds”) rather than the current skill, and vise-versa. Based on this observation, features in UIC 112 are composed by comparing the semantics of the user’s utterance to both the context of system commands and the context of the skill with which the user is currently interacting.

[00110] Features from a semantic comparison between the user utterance and all known system commands are first derived. To this end, a system command list was built from the VPA’s user manual, developers’ documentation and real-world conversations collected in the study described above. Given an utterance, its maximum and average SRs as described herein against all system commands on the list are used by UIC 112 as features for classification. Another feature added is an indicator that is set when the utterance contains invocation names of skills on the market, to capture the user’s potential intent of skill switch. [00111] Another set of features are retrieved by characterizing the relationship between a user utterance and the current on-going skill. The observation is leveraged that a user’s command for a skill is typically related to the skill’s prior communication with the user as well as the skill’s stated functionalities. Thus, the following features are proposed to test whether an utterance fits into the on-going skill’s context: 1 ) the SR between the utterance and the skill’s response prior to the utterance, 2) the top-k SR between the utterance and the sentences in the skill’s description (pick k=5), and 3) the average SR between the user’s utterance and the description sentences.

[00112] Results. To assess the effectiveness of UIC, the dataset collected that contains real-world user utterances of context switches was reused. First, 550 conversations were manually labeled and it was determined whether each user utterance is for context switch or not, based on two experts’ reviews (Cohen’s kappa = 0.64). Since the dataset is dominated by non-context-switch utterances, it was further balanced by randomly substituting some utterances to skill invocation utterances collected from skill markets. In total, 1 ,100 context-switch instances and 1 ,100 non- context-switch instances as ground truth were collected.

[00113] Using the above features and dataset, a classifier was trained to take the user’s utterance as input and determine whether it is a system-related command for context switch or belongs to the conversation of the current skill. The classifier was trained using different classification algorithms and 5-folder cross-validation. The results indicate that random forest achieves the best performance with a precision of 96.48%, a recall of 95.16%, and F-1 score of 95.82%. Evaluations on an unlabeled real-world dataset are described below.

Overall Detector Evaluation

[00114] Next, integration is described of the SRC 110 and UIC 112 into a holistic detector, which raises an alarm on suspicious user-skill interactions whenever SRC or UIC detects any anomaly.

[00115] Effectiveness against prototype attacks. To construct prototype attacks of VMA, another 10 popular skills were selected from skill markets and log transcripts as a user with 61 utterances. Then skill switch attack instances were manually crafted (15 in total) by replacing selected utterances with the invocation utterances intended for the VPA system. Faking termination attacks were also launched (10 in total) by substituting the last skill responses with empty responses or responses that mimicking those from the VPA system. By feeding all conversations to the detector described herein, all 25 attack instances were successfully detected.

[00116] Effectiveness on real-world conversations. The effectiveness of the detector on the rest of real-world conversations not used during the training phase was considered. Although it may not contain real faking termination attack instances, it does have many user utterances of context switches. Among them, 341 are identified by the classifier and 326 are verified manually to indeed be context switches, indicating that the detector (the UIC component) achieves a precision of 95.60%. The recall could not be computed due to a lack of ground truth on this unlabeled dataset. Further analysis of these instances reveals interesting cases. For example, cases were identified where users thought they were talking to ALEXA during interaction with attack skills and asked the skills to report time, weather, news, to start another skill, and even to control other home automation devices (details shown in Appendix B).

[00117] Runtime performance. To understand how much performance overhead the present detector incurs, the detection latency introduced by the detector on a MACBOOK PRO with 4-core CPU was measured. On average, the latency was negligible (0.003 ms in average), indicating the lightweight nature of the detection scheme.

Conclusion

[00118] The first security analysis of popular VPA ecosystems and their vulnerability to two new attacks, VSA and VMA, are described herein through which a remote adversary could impersonate VPA systems or other skills to steal user private information. These attacks are found to pose a realistic threat to VPA loT systems, as evidenced by a series of user studies and real-world attacks described above. To mitigate the threat, the present disclosure provides a skill-name scanner which was run against AMAZON and GOOGLE skill markets, leading to the discovery of a large number of ALEXA skills at risk and problematic skill names already published, indicating that the attacks might already happen to tens of millions of VPA users. Additionally, a context-sensitive detector is provided herein to mitigate the voice masquerading threat, achieving a 95% precision.

[00119] FIG. 3 shows an exemplary voice-based attack detection process 200 using detection unit 100. Although the following steps are primarily described with respect to the embodiments of FIGS. 1 -2, it should be understood that the steps within the method may be modified and executed in a different order or sequence without altering the principles of the present disclosure.

[00120] The method begins at step 202. In step 202, utterance paraphrasing detection unit 106 detects one or more variations of a competitive invocation name (CIN) of a first voice-activated skill or application unselected by user 10 of a voice- controlled system, such as VPA device 12. In step 204, pronunciation comparison unit 108 calculates a pronunciation similarity between the CIN and an invocation name of a second voice-activated skill or application selected by user 10 of the voice-controlled system. In step 206, VSA scanner 102 generates an alarm for user 10 of the voice- controlled system in response to an anomaly indicated by either utterance paraphrasing detection unit 106 or pronunciation comparison unit 108.

[00121] The method ends at step 206 which may include a return to step 202. Any of steps 202-206 can be repeated as desired.

[00122] FIG. 4 shows another exemplary voice-based attack detection process 300 using detection unit 100. Although the following steps are primarily described with respect to the embodiments of FIGS. 1 -2, it should be understood that the steps within the method may be modified and executed in a different order or sequence without altering the principles of the present disclosure.

[00123] The method begins at step 302. In step 302, skill response checker 110 detects one or more suspicious skill responses of a first voice-activated skill or application that was not selected by user 10 of the voice-controlled system, such as VPA device 12. In step 304, user intention classifier 112 analyzes an utterance from user 10 of the voice-controlled system to identify a user intention. In step 306, VMA scanner 104 generates an alarm for user 10 of the voice-controlled system in response to an anomaly indicated by either skill response checker 110 or user intention classifier 1 12.

[00124] The method ends at step 306 which may include a return to step 302. Any of steps 302-306 can be repeated as desired.

[00125] The above detailed description and the examples described therein have been presented for the purposes of illustration and description only and not for limitation. For example, the operations described can be done in any suitable manner. The methods can be performed in any suitable order while still providing the described operation and results. It is therefore contemplated that the present embodiments cover any and all modifications, variations, or equivalents that fall within the scope of the basic underlying principles disclosed above and claimed herein.

[00126] Embodiments of the present disclosure are described by way of example only, with reference to the accompanying drawings. Further, the following description is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. As used herein, the term“unit” or“module” refers to, be part of, or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor or microprocessor (shared, dedicated, or group) and/or memory (shared, dedicated, or group) that executes one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality. Thus, while this disclosure includes particular examples and arrangements of the units, the scope of the present system should not be so limited since other modifications will become apparent to the skilled practitioner.

[00127] Furthermore, while the above description describes hardware in the form of a processor executing code, hardware in the form of a state machine, or dedicated logic capable of producing the same effect, other structures are also contemplated. Each unit or component can be operated as a separate unit, and other suitable combinations of sub-units are contemplated to suit different applications. Also, although the units are illustratively depicted as separate units, the functions and capabilities of each unit can be implemented, combined, and used in conjunction with/into any unit or any combination of units to suit different applications. [00128] It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reading and understanding the above description. For example, it is contemplated that features described in association with one embodiment are optionally employed in addition or as an alternative to features described in associate with another embodiment. The scope of the present disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

[00129] In the detailed description herein, references to“one embodiment,”“an embodiment,”“an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art with the benefit of the present disclosure to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. After reading the description, it will be apparent to one skilled in the relevant art(s) how to implement the disclosure in alternative embodiments.

[00130] Furthermore, no element, component, or method step in the present disclosure is intended to be dedicated to the public regardless of whether the element, component, or method step is explicitly recited in the claims. No claim element herein is to be construed under the provisions of 35 U.S.C. 112(f), unless the element is expressly recited using the phrase“means for.” As used herein, the terms“comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus

[00131] Various modifications and additions can be made to the exemplary embodiments discussed without departing from the scope of the present disclosure. For example, while the embodiments described above refer to particular features, the scope of this disclosure also includes embodiments having different combinations of features and embodiments that do not include all of the described features. Accordingly, the scope of the present disclosure is intended to embrace all such alternatives, modifications, and variations as fall within the scope of the claims, together with all equivalents thereof.

Claims

CLAIMS We claim:

1. A method of defending against a voice-based attack in a voice-controlled system, comprising:

detecting one or more variations of a competitive invocation name (CIN) of a first voice-activated skill unselected by a user of the voice-controlled system using an utterance paraphrasing detection unit (106);

calculating a pronunciation similarity between the CIN and an invocation name of a second voice-activated skill selected by the user of the voice-controlled system using a pronunciation comparison unit (108); and

generating an alarm for the user of the voice-controlled system in response to an anomaly indicated by either the utterance paraphrasing detection unit (106) or the pronunciation comparison unit (108).

2. The method of claim 1 , further comprising detecting the one or more variations of the CIN of the first voice-activated skill based on a phrase change of a syntactic structure associated with the CIN of the first voice-activated skill to detect the anomaly.

3. The method of claim 1 , further comprising detecting the one or more variations of the CIN of the first voice-activated skill based on a semantic consistency associated with the CIN of the first voice-activated skill to detect the anomaly.

4. The method of claim 1 , further comprising recognizing the one or more variations of the CIN of the first voice-activated skill using at least one prefix term added to a voice-activated command associated with the invocation name of the second voice- activated skill to detect the anomaly.

5. The method of claim 1 , further comprising recognizing the one or more variations of the CIN of the first voice-activated skill using at least one suffix term added to a voice-activated command associated with the invocation name of the second voice- activated skill to detect the anomaly.

6. The method of claim 1 , further comprising converting at least one term in the invocation name of the second voice-activated skill into a converted term in a phonemic presentation using a phoneme code.

7. The method of claim 6, wherein converting the at least one term in the invocation name of the second voice-activated skill into the phonemic presentation comprises measuring, to detect the anomaly, a pronunciation similarity between the CIN of the first voice-activated skill and the invocation name of the second voice-activated skill using the phonemic presentation.

8. The method of claim 7, wherein measuring the pronunciation similarity comprises using a weighted cost matrix of an edit distance between the at least one term in the invocation name of the second voice-activated skill and the converted term in the phonemic presentation.

9. The method of claim 7, further comprising comparing the converted term in the phonemic presentation with the invocation name of the second voice-activated skill to determine the pronunciation similarity.

10. A method of defending against a voice-based attack in a voice-controlled system, comprising:

detecting one or more suspicious skill responses of a first voice-activated skill unselected by a user of the voice-controlled system using a skill response checker (1 10); analyzing an utterance from the user of the voice-controlled system to identify a user intention using a user intention classifier (112); and

skill response checker (110)user intention classifier (112)generating an alarm for the user of the voice-controlled system in response to an anomaly indicated by either the skill response checker (110) or the user intention classifier (112).

11. The method of claim 10, further comprising generating a blacklist of one or more responses that the voice-controlled system marks as suspicious including at least one marked utterance.

12. The method of claim 11 , further comprising comparing the at least one marked utterance in the blacklist with the utterance of the user of the voice-controlled system to detect the anomaly.

13. The method of claim 12, wherein comparing the at least one marked utterance in the blacklist with the utterance of the user comprises using a fuzzy matching technique through a semantic analysis on a content of the one or more skill responses in the blacklist.

14. The method of claim 10, further comprising calculating a sentence relevance of the one or more skill responses in the blacklist based on a similarity value associated with the utterance from the user.

15. The method of claim 14, further comprising detecting the anomaly when the similarity value is greater than a predetermined threshold.

16. The method of claim 10, further comprising identifying the user intention by semantically relating the first voice-activated skill to one or more voice commands of the user of the voice-controlled system.

17. The method of claim 16, wherein identifying the user intention comprises comparing semantics of the utterance from the user to a first context of one or more system commands associated with the voce-controlled system, and a second context of the first voice-activated skill.

18. The method of claim 17, further comprising calculating a sentence relevance of the one or more system commands based on a similarity value associated with the utterance from the user.

19. The method of claim 18, further comprising detecting the anomaly when the similarity value is greater than a predetermined threshold.

20. The method of claim 10, further comprising characterizing a relationship between the utterance from the user and the first voice-activated skill based on a prior communication with the user and at least one functionality of the first voice-activated skill.

21. The method of claim 20, wherein characterizing the relationship between the utterance from the user and the first voice-activated skill comprises calculating at least one of:

a sentence relevance between the utterance from the user and a response of the first voice-activated skill prior to the utterance from the user;

a sentence relevance between the utterance from the user and one or more description sentences associated with the first voice-activated skill; and

an average sentence relevance between the utterance from the user and the one or more description sentences.