FI127920B - Online multimodal information transfer method, related system and device - Google Patents

Online multimodal information transfer method, related system and device Download PDF

Info

Publication number
FI127920B
FI127920B FI20165708A FI20165708A FI127920B FI 127920 B FI127920 B FI 127920B FI 20165708 A FI20165708 A FI 20165708A FI 20165708 A FI20165708 A FI 20165708A FI 127920 B FI127920 B FI 127920B
Authority
FI
Finland
Prior art keywords
user
data
speech
text
terminal
Prior art date
Application number
FI20165708A
Other languages
Finnish (fi)
Swedish (sv)
Other versions
FI20165708A (en
Inventor
Martti Pitkänen
Robert Parts
Pirjo Huuhka-Pitkänen
Original Assignee
Aplcomp Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aplcomp Oy filed Critical Aplcomp Oy
Priority to FI20165708A priority Critical patent/FI127920B/en
Publication of FI20165708A publication Critical patent/FI20165708A/en
Application granted granted Critical
Publication of FI127920B publication Critical patent/FI127920B/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0861Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3226Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using a predetermined code, e.g. password, passphrase or PIN
    • H04L9/3231Biological data, e.g. fingerprint, voice or retina

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Method (600) for multimodal information transfer in online environment preferably involving substantially real-time communication between two or more users, to be performed by a system comprising at least one server apparatus, comprising obtaining (604) digital sound data comprising speech captured by a first user terminal from a speech source and associated with a claim of an identity of a first user, said digital sound data preferably incorporating both air microphone and contact microphone based data, preferably verifying (606) on the basis of the digital sound data and reference voiceprint data linked with the first user and characterizing the voice of the first user, whether the sound data has been uttered by the first user to authenticate the speech source as the first user, converting (608) the sound data into corresponding text, optionally further translating the text into another language preferably being the language selected By the first user or a second user utilizing a second terminal, and providing (610) the text, optionally including said translated text, to at least one user terminal, preferably including said first terminal and/or said second terminal of the second user, for reproduction preferably via a display of the concerned terminal. Related system and terminal are presented.

Description

FIELD OF THE INVENTION io Generally the invention pertains to computers and related communications infrastructures. In particular, however not exclusively, the present invention concerns multimodal communication via online environment incorporating verbal and textual modalities.
is BACKGROUND
Access control in conjunction with e.g. network services or physical resources may imply user identification, which can be generally based on a 20 variety of different approaches. For example, three categories may be considered including anonymous, standard and strong identification. Regarding anonymous case, the service users do not have to be and are not identified. Standard, or ‘normal’, identification may refer to what the requestor for access knows, such as a password, or bears such as a physical 25 security token. Such a token may include pas sword-generating device (e.g.
SecurlD ™), a list of one-time passwords, a smart card and a reader, or a one-time password transmitted to a mobile terminal. Further, strong identification may be based on a biometric property, particularly a biometrically measurable property, of a user, such as a fingerprint or retina, 30 or a security token the transfer of which between persons is difficult, such as a mobile terminal including a PKI (Public Key Infrastructure) certificate requiring entering a PIN (Personal Identification Number) code upon each instance of use.
On the other hand, network service -related authentication, i.e. reliable identification, may also be implemented on several levels, e.g. on four levels, potentially including unnecessary, weak, strongish, and strong authentication, wherein the strongish authentication, being stronger than weak, thus resides between the weak and strong options. If the user may remain anonymous, authentication is unnecessary. Weak authentication may refer to the use of single standard category identification means such as user ID/password pair, for example. Instead, strongish authentication may
20165708 prh 29 -11- 2018 apply at least two standard identification measures utilizing different techniques. With strong authentication, at least one of the identification measures should be strong.
Notwithstanding the various advancements taken place during the last years in the context of user and service identification, authentication, and related secure data transfer, some defects still remain therewith and are next briefly and non- exhaustively reviewed with useful general background information.
Roughly, access control methods to network services include push and pull methods. In pull methods, a user may first identify oneself anonymously to a network service providing a login screen in return. The user may then type in the user ID and a corresponding password, whereupon he/she may directly 15 access the service or be funneled into the subsequent authentication phase. In push methods, a network server may first transmit information to the e-mail address of the user in order to authorize accessing the service. Preferably only the user knows the password of the e-mail account.
Various applications of information capturing, information transfer and e.g.
specifically communication such as verbal communication, essentially speech, involve different user identification, authentication and service quality challenges.
For instance, in a variety of online environments, with reference to IP (Internet
Protocol) networks such as the Internet, or communication/computer networks in general, voice or textual (‘chat’) communication services may suffer from poor or practically non-existent verifiability of the identity of a user e.g. at a remote end of a connection.
In some use scenarios e.g. a speech signal that has been captured using an ordinary air microphone contains so much ambient noise or is otherwise so far away from the original speech emitted, containing e.g. various artifacts due to lossy encoding or transfer, that even the intelligibility of the speech is severely 35 limited. Thereupon, related speaker identification or verification by another entity such as a remote party of the communication connection may turn out impossible or at least difficult, even if the remote entity knows the person indicated as the speaker upfront, has talked to them earlier and could in the
20165708 prh 29 -11- 2018 case of more ideal conditions really verify their identity with sufficient reliability merely through listening to them.
Still, one might consider e.g. multimodal scenarios wherein speech of a user is 5 converted into text, optionally even translated into another language, and potentially finally transformed back to speech using speech synthesis. Obviously, the features characterizing the user’s speech for identification or authentication purposes, considering e.g. formant frequencies of a vocal tract, are not derivable anymore from the speech signal reproduced at the remote end 10 by means of some generic speech synthesizer. Accordingly, the origin of the speech cannot be manually or even technically verified at the receiving end by relying on e.g. recalled voice characteristics of the claimed speaker and comparing them with the characteristics of the actually received signal. Instead, some other assurance of the identity of the speaker should be obtained.
Simple login credentials (user id, password) are insufficient for many modem security-driven applications such as banking applications or other applications of essentially private and/or financial nature.
The effect of captured ambient noises or qualify issues arising from speech 20 encoding or transfer may already min the text conversion or translation accuracy even prior to the potential synthesis. Besides, conversion and translation are computation-heavy procedures, the execution of which easily causes capacity problems in connection with e.g. ordinary user terminals and especially mobile type devices relying on portable power sources such as 25 rechargeable batteries.
As one additional concrete example, IP calls are often performed in noisy, potentially mobile or otherwise dynamic environments and the detrimental effects of ambient noise have to be mitigated somehow. It can be done by 30 using advance audio (air) microphone technology in connection with the terminals and e.g. complex noise reducing algorithms at least on a server side, which however involve an inherent trade-off between the authenticity and intelligibility of the signal. Extensive processing of a speech signal to reduce e.g. noise and increase intelligibility thereof unfortunately tend to alter also the 35 original speech content including various delicate properties characterizing the speaker and thus otherwise finding use in voice-based authentication.
20165708 prh 29 -11- 2018
Publication FI126129 discloses an electionic system for authenticating a user of an electronic service. The system maintains a subset of cues for determining whether a voice response is uttered by an existing user on a basis of sound data. The sound data indicative of the voice responses uttered to the 5 represented cues are preferably matched as concatenated against a concatenated voiceprint established based on the voiceprints linked with the represented cues and the existing user.
io SUMMARY OF THE INVENTION
The objective is to at least alleviate one or more problems described hereinabove regarding the usability, quality, archiving, capacity, communication and security issues, such as identification or authentication, 15 associated with electronic services, such as online services, for information transfer and especially communication.
The objective is achieved by the system, device and method in accordance with the present invention.
In one embodiment, an electronic system for multimodal information transfer in online environment preferably involving substantially real-time communication between two or more users, said system comprising at least one server apparatus provided with a processing device and memory entity for 25 processing and storing instructions and other data, respectively, and a data transfer interface for receiving and sending data via a network of the online environment, preferably the Internet, the instructions being configured to cause, when executed by the processing device, the at least one server apparatus to obtain digital sound data comprising speech captured by a first user terminal from a speech source and associated with a claim of an identity of a first user, said digital sound data incorporating both air microphone and contact microphone based data, said contact microphone based data being utilized in 35 segmenting the speech of the source included at least in said air microphone based data of the digital sound data from further sound data including background noises, preferably verify on the basis of the digital sound data and reference voiceprint 40 data linked with the first user and characterizing the voice of the first user,
20165708 prh 29 -11- 2018 whether the sound data has been uttered by the first user to authenticate the speech source as the first user, convert the sound data into corresponding text, optionally further translating the text into another language preferably being the language selected by the first user or a second user utilizing a second terminal, and provide the text, optionally including said translated text, to at least one user terminal, preferably including said first terminal and/or said second terminal of 10 the second user, for reproduction preferably via a display of the concerned terminal.
In one embodiment, in addition to the text, corresponding digital speech data including or derived from the obtained digital sound data, optionally 15 alternatively or additionally including speech data derived from the translated text utilizing text-to-speech synthesis, to the first terminal and/or second terminal of the second user for audible reproduction.
Optionally, the system may be configured to apply the voiceprint data in adapting text-to-speech synthesis of the text, optionally one or more parameters of a synthesis model reconstructing the characteristics of human voice production mechanism such as vocal tract.
In an embodiment, the system is configured to store, for a number of users including said first user, a plurality of personal voiceprints each of which linked with a dedicated visual, audiovisual or audio cue, for challengeresponse authentication of the users, the cues being user-selected, userprovided or user-created.
In a related embodiment, the system is configured to pick, upon receipt of an authentication request associated with a claim of an identity of an existing user of said number of users, a subset of cues for which there are voiceprints of the first user stored, and provide the cues for representation to the speech source as a challenge, receive digital sound data indicative of the voice responses uttered by the speech source to the represented cues, the voice responses represented by the sound data being captured preferably simultaneously utilizing an air
20165708 prh 29 -11- 2018 microphone and contact microphone optionally integrated in a common dualmicrophone apparatus, said sound data thus preferably including both air microphone -based data and contact microphone -based data, determine on the basis of the digital sound data, the represented cues and voiceprints linked therewith and the first user, whether the response has been uttered by the first user of said number of users, wherein the sound data indicative of the voice responses uttered to the represented cues are optionally matched as concatenated against a concatenated voiceprint established based 10 on the voiceprints linked with the represented cues and the first user, and provided that this seems to be the case, and elevate the authentication status of the user as the first user, preferably regarding at least the current communication session.
In preferred embodiments, the system is configured to authenticate the speech source, preferably at regular or irregular (e.g. random or triggered by predefined event(s)) intervals, during and substantially after the start of the multimodal information transfer procedure utilizing text-independent speaker 20 verification. This mode of authentication preferably takes place in the background from a user perspective thus omitting the need for any particular user input from the speech source for execution, in contrast to the preferably initially (upon call/connection start-up, for example) executed text-dependent speaker verification involving the user more closely in the process.
In some embodiments, the system may further comprise at least one element selected from the group consisting of: voice activity detector configured to utilize contact microphone portion of the digital sound data to separate between speech segments such as sentences in the digital sound data or specifically in 30 air microphone portion thereof, noise reducer or canceller for removing noise from the digital sound data or specifically from the air microphone portion thereof, noise reducer or canceller utilizing contact microphone based data for the removal, adaptive speech-to-text conversion platform configured to apply the voiceprint data in adapting the function of the platform and/or adapting the 35 input speech, dictation platform, log or database for storing the text optionally in minutes, memo, statement or doctor’s certificate format, online education or learning platform for providing digital learning material comprising the text and/or corresponding digital speech data in original and/or translated language to a number of terminals, online communication, meeting, or collaboration
20165708 prh 29 -11- 2018 platform for transferring the text and/or corresponding digital speech data in original and/or translated language to the second terminal, online examination platform wherein the digital sound data obtained from the first user contains responses to exam questions by the first user, online digital camera data 5 management module configured to receive digital camera data from a user terminal and store the camera data and/or analysis data derived therefrom as associated with related user id, online store, banking or customer service platform configured to enable speech and/or textual communication between a customer type first or second user and a customer servant type second or first io user, respectively, and honeypot platform configured to transfer a user connection failing speaker verification based authentication to an isolated resource mimicking the original system.
Preferably, the aforesaid dynamically selected (e.g. randomly or pseudo15 randomly selected) subset contains two or more cues. Most preferably, however, not all the cues having a voiceprint of the claimed user associated therewith are included in the subset.
In some embodiments, the sound data is received from a mobile (terminal) 20 device. The mobile device advantageously incorporates at least one integral or separate/removable (e.g. wiredly or wirelessly connected) microphone for capturing voice and encoding it into digital format, more advantageously at least one air microphone and at least one contact microphone. Preferably, the system maintains or has access to information linking service/application 25 users, or user id’s, and mobile devices or mobile identities, e.g. IMEI code or
IMSI code (or other smart card), respectively, together. Optionally, a mobile phone number could be utilized for the purpose.
Optionally, the cues are at least indicated to the user via a first terminal 30 device such as a laptop or desktop computer. Service data in general and/or the cues may be provided as browser data such as web page data. Preferably such first terminal device includes or is at least connected to a display, a projector and/or a loudspeaker with necessary digital-toanalogue conversion means for the purpose. Optionally, the sound data 35 is obtained via a second terminal device, preferably via the aforementioned mobile device like a cellular phone, typically a smartphone, or a communications-enabled PDA/tablet, configured to capture the sound signal
20165708 prh 29 -11- 2018 incorporating the user’s voice (uttering the response to the cues) and convert it into digital sound data forwarded towards the system.
In some embodiments, the mobile device may be provided with a message, such as an SMS message, triggered by the system in order to verify that the user requiring voice-based authentication has the mobile device with him/her. For example, the user may have logged in an electronic service using certain user id that is associated with the mobile device. Such io association may be dynamically controlled in the service settings by the user, for instance. In response to the message, the user has to trigger sending a reply, optionally via the same mobile device or via the first terminal, optionally provided with a secret such as password, or other acknowledgement linkable by the system with the user (id).
In some embodiments, the cues may be represented visually and/or audibly utilizing e.g. a web browser at the first user terminal. Preferably, but not necessarily, the user provides the response using the second terminal such as a mobile terminal. The first terminal may refer to e.g. a desktop or laptop 20 computer that may be personal or in a wider use. The second terminal, particularly if being a mobile terminal such as a smartphone, is typically a personal device associated with a certain user only, or at least rather limited group of users.
The system may be configured to link or associate the first and second terminals together relative to the ongoing session and authentication task. As a result, actions taken utilizing the second terminal may be linked with activity or response at the first terminal, e.g. browser thereat, by the 30 system. Some examples of applicable linking techniques can be found in publication WO2015/059365 (“AUDIOVISUAL ASSOCIATIVE AUTHENTICATION METHOD AND RELATED SYSTEM”) and FI126129B.
In some embodiments, the determination tasks may include a number of mapping, feature extraction, and/or comparison actions according to predetermined logic by which the match between the obtained sound data and existing voiceprint data relative to the indicated existing user is confirmed, i.e. the authentication is considered successful in 40 the light of such voice-based authentication factor. In the case of no match,
20165708 prh 29 -11- 2018
i.e. failed voice-related authentication, the authentication status may remain as is or be lowered (or access completely denied).
In some embodiments, elevating the gained current authentication status in connection with successful voice-based speaker (user) verification may include at least one action selected from the group consisting of: enabling service access, enabling a new service feature, enabling the use of a new application, enabling a new communication method, and enabling the (user) io adjustment of service settings or preferences.
In some embodiments, a visual cue defines a graphical image that is rendered on a display device for perception and visual inspection by the user. The image may define or comprise a graphical pattern, drawing or e.g. a 15 digital photograph. Preferably, the image is complex enough so that the related (voice) association the user has, bears also necessary complexity and/or length in view of sound data analysis (too short or too simple voice input/voiceprint renders making reliable determinations difficult).
In some embodiments, audiovisual cue includes a video clip or video file with associated integral or separate sound file(s). Alternatively or additionally, audiovisual cue may incorporate at least one graphical image and related sound.
Generally, video and audiovisual cues are indicated by e.g. a screenshot or other descriptive graphical image, and/or text, shown in the service UI. The image or a dedicated UI feature (e.g. button symbol) may be then utilized 30 to activate the video playback by the user through clicking or otherwise selecting the image/feature, for instance. Alternatively, e.g. video cue(s) may playback automatically, optionally repeatedly.
In some embodiments, the audio cue includes sound typically in a form of at least one sound file that may be e.g. monophonic or stereophonic. The sound may represent music, sound scenery or landscape (e.g. jungle sounds, waterfall, city or traffic sounds, etc.), various noises or e.g. speech.
Audio cue may, despite of its non-graphical/invisible nature, still be associated with an image represented via the service UI. The image used to indicate an audio cue is preferably at least substantially the same (i.e.
20165708 prh 29 -11- 2018 non-unique) with all audio cues, but anyhow enables visualizing an audio cue in the UI among e.g. visual or audiovisual cues, the cues being optionally rendered as a horizontal sequence of images (typically one image per cue), of the overall challenge. As with video or audiovisual cues, the image may be active and selecting, or 'clicking' it, advantageously then triggers the audible reproduction of the cue.
Alternatively or additionally, a common UI feature such as icon may be io provided to trigger sequential reproduction of all audio and optionally audiovisual, cues.
In some embodiments and in the light of foregoing, basically all the cues may 15 be indicated in a (horizontal) row or column, or using other configuration, via the service UI.
Visually distinguishable, clear ordering of the cues is advantageous as the 20 user may immediately realize also the corresponding, correct order of corresponding cue-specific (sub-)responses in his/her overall voice response.
Video, audiovisual and/or audio cues may at least have a representative, 25 generic or characterizing, graphical image associated with them as discussed above, while graphical (image) cues are preferably shown as such.
In some embodiments, at least one cue is selected or provided, optionally created, by the user himself/herself. A plurality of predetermined cues may be offered by the system to the user for review via the service UI wherefrom the user may select one or more suitable, e.g. the most memorable, cues to be associated with voiceprints. Preferably, a plurality of cues is associated with each user.
A voiceprint, i.e. a voice-based fingerprint, may be determined for a cue based on a user’s sound, or specifically voice, sample recorded and audibly exhibiting the user’s association (preferably brainworm) relating to each particular cue. A voiceprint of the present invention thus advantageously 40 characterizes, or is used to characterize, both the user (utterer) and the spoken message (the cue or substantive personal association with the cue) itself. Recording may be effectuated using the audio input features
20165708 prh 29 -11- 2018 available in a terminal device such as microphone, analogue-to-digital conversion means, encoder, etc.
With different users, a number of same or similar cues may be generally 5 utilized. Obviously, the voiceprints associated with them are and shall be personal.
In some embodiments, the established service connection (access) is io maintained based on a number of security measures the outcome of which is used to determine the future of the service connection, i.e. let it remain, terminate it, or change it, for example. In some scenarios, fingerprint methodology may be applied, some examples of which have been described in the aforementioned WO2015/059365 and FI126129B.
In some embodiments, the system is location-aware advantageously in a sense it utilizes explicit or implicit location information, or indication of location, to authenticate the user. The location information may be absolute (e.g.
coordinates), relative (e.g. with respect to some other location such as “50 meters west/away from compound X”), and/or contextual (e.g. work, home, club, bowling center, base station, warehouse, gate A, door B). A number of predetermined allowed and/or non-allowed/blocked locations may be associated with each user of the arrangement. For example, the location may refer to at least one element selected from the group consisting of: address, network address, sub-network, IP (Internet Protocol) address, IP sub-network, cell, cell-ID, street address, building or estate, access control terminal, access controlled physical resource, one or more coordinates, GPS coordinates, GLONASS coordinates, district, town, country, continent, distance to a predetermined location, maximum range from a predetermined location, and direction from a predetermined location. Each of the aforesaid addresses may further refer to an address range. The location information may be at least partially gathered based on data obtained from a user terminal, e.g. mobile terminal.
Accordingly, the estimated location of the user, based on the information obtained and indicative of the location of the user (or of an associated user terminal), may be utilized as additional authentication factor. In some embodiments, the information may indicate e.g. a base station currently used 40 by the terminal. In some embodiments, the information may indicate or
20165708 prh 29 -11- 2018 identify e.g. an access control terminal, other access control device and/or authentication device via which an access request or other user, or user device, input has been received. As the locations of such network, authentication or access control elements are known, also the location of the user may be 5 estimated (deemed substantially same, for example). In some embodiments, the location information may indicate e.g. certain building or estate (e.g. via name or other ID). The user may have been detected, in addition to or instead of related access control data, based on e.g. camera data or other surveillance data, for instance.
io
Failed location-based authentication may result in a failed overall authentication (denied access), or alternatively, a limited functionality such as limited access to the service may be provided. The same applies to potential 15 other authentication factors. Each authentication factor may be associated with a characterizing weight (effect) in the authentication process.
In some embodiments, the system may be configured to transmit a code, preferably as browser data such as web page data, during a communication 20 session associated with a predetermined user of the service for visualization and subsequent input by the user. Further, the system may be configured to receive data indicative of the inputted code and of the location of the terminal device applied for transmitting the data, determine on the basis of the data and predetermined locations associated with the user whether the 25 user currently is in allowed location, and provided that this seems to be the case on the basis of the data, raise the gained authentication status of the user regarding at least the current communication session. Preferably the data is received from a mobile (terminal) device. Optionally, the code is indicated to the user via a first terminal device such as a laptop or desktop 30 computer. Instead of a code dedicated for the purpose, e.g. the aforesaid temporary id such as socket id may be utilized in this context as well.
A certain location may be associated with a certain user by “knowing” the 35 user, which may refer to optionally automatically profiling and learning the user via monitoring one’s habits such as location and optionally movements.
As a result, a number of common, or allowed, locations may be determined and subsequently utilized for authentication purposes. Additionally or alternatively, the user may manually register a number of allowed locations 40 for utilizing the solution in the arrangement. Generally, in various
20165708 prh 29 -11- 2018 embodiments of the present invention, knowing the user and/or his/her gear and utilizing the related information such as location information in connection with access control, conducting automated attacks such as different dictionary attacks against the service may be made more futile.
In some scenarios, the location of the user (terminal) and/or data route may be estimated, e.g. by the system, based on transit delay and/or round-trip delay. For example, delays relating to data packets may be compared with io delays associated with a number of e.g. location-wise known references such as reference network nodes, which may include routers, servers, switches, firewalls, terminals, etc.
Yet in a further, either supplementary or alternative embodiment, the 15 electronic service is a cloud service (running in a cloud). Additionally or alternatively, the service may arrange virtual desktop and/or remote desktop to the user, for instance.
In an additional embodiment, an electronic terminal device for multimodal 20 information transfer in online environment preferably involving substantially real-time communication between two or more users, comprises a processing device and memory entity for processing and storing instructions and other data, respectively, air microphone and contact microphone for capturing audio data, a display for visualizing data, and a data transfer interface for receiving 25 and sending data via a network of the online environment, preferably the
Internet, the instructions being configured to cause, when executed by the processing device, the device to establish digital sound data based on signals captured via said air microphone 30 and contact microphone, said signals comprising speech from a speech source associated with a claim of an identity of a first user, the contact microphone based data being utilized in segmenting the speech of the source included at least in said air microphone based data of said digital sound data from further sound data including background noises, transmit the digital sound data to a network entity for processing, preferably including speaker verification, said network entity optionally including a number of servers,
20165708 prh 29 -11- 2018 receive text resulting from speech-to-text conversion and optional translation of the digital sound data from the network entity, optionally additionally receiving text resulting from speech-to-text conversion and/or text-to-speech synthesized speech data resulting from text conversion, language translation 5 and synthesis, of a speech input by a remote second user connected to the terminal device via a second terminal device and the intermediate network entity, and rendering the received text visually on the display for user review, optionally further audibly reproducing the received synthesized speech data indicative of the speech input of the remote party.
In a further embodiment, a method for multimodal information transfer in online environment preferably involving substantially real-time communication between two or more users, to be performed by a system comprising at least one server apparatus, comprises obtaining digital sound data comprising speech captured by a first user terminal from a speech source and associated with a claim of an identity of a 20 first user, said digital sound data incorporating both air microphone and contact microphone based data, wherein said contact microphone based data is utilized in segmenting the speech of the source included at least in said air microphone based data of said digital sound data from further sound data including background noises, preferably verifying on the basis of the digital sound data and reference voiceprint data linked with the first user and characterizing the voice of the first user, whether the sound data has been uttered by the first user to authenticate the speech source as the first user, converting the sound data into corresponding text, optionally further translating the text into another language preferably being the language selected by the first user or a second user utilizing a second terminal, and providing the text, optionally including said translated text, to at least one user terminal, preferably including said first terminal and/or said second terminal of the second user, for reproduction preferably via a display of the concerned terminal.
20165708 prh 29 -11- 2018
In some system, method and device embodiments as alluded to hereinbefore and also as an independent asset fully separate therefrom, a multi-microphone 5 such as a dual microphone apparatus may be provided to transform the speech of a user into electrical form. The dual microphone apparatus may incorporate a first microphone comprising an air microphone, e.g. mouth microphone, for receiving pressure waves emitted from the mouth of the user and converting them into an electrical first signal, a second microphone comprising a contact 10 microphone, such as throat microphone, to be arranged in skin contact with the user to receive vibration conveyed via the skin tissue and converting it into an electrical second signal, and a connection unit for supplying the first and second signals to an external device, such as the aforesaid electronic device, first terminal, second terminal or generally a terminal device, as separate 15 signals or as a common signal established based on the first and second signals via wired or wireless transmission path, optionally via electrical conductors or radio interface, respectively.
In some embodiments, a headset device comprises the dual-microphone 20 apparatus and at least one ear speaker, preferably comprising an ear pad speaker with a head band or an ear-fitting headphone. Optionally a multispeaker headphone may be included.
In some embodiments, the air microphone may comprise a close-speaking 25 microphone.
In some embodiments, the contact microphone may comprise a throat microphone.
In some embodiments, the connection unit may be configured to establish a multi-channel signal, optionally stereo signal, wherein for each microphone there is preferably a dedicated channel allocated.
In some embodiments, the first and second signals may be combined or 35 translated into a common signal that is indicative of the pressures waves and vibration captured by the first and second microphones, respectively.
20165708 prh 29 -11- 2018
Generally, processing of the microphone signal(s) may take place at the microphone apparatus, connected external device and/or external system such an embodiment of the (online) system described herein.
The previously presented considerations concerning the various embodiments of the system may be flexibly applied to the embodiments of the device or method mutatis mutandis, and vice versa, as being appreciated by a skilled person.
io
The utility of the present invention follows from a plurality of issues depending on each particular embodiment.
Communication over communication networks, e.g. speech or video 15 conversation over IP network between users, possibly using different languages, may be translated substantially in real-time by passing voice stream throughout a translating mechanism involving e.g. speech-to-text conversion, language translation of the text, and optionally further synthesizing the translated text. In the synthesis, recorded (voiceprinted, for example) voice 20 characteristics of the users may be applied to adapt the synthesis for improved authenticity of the synthetic speech. In some embodiments, the users may thus communicate using their own languages both for speaking and listening, while the system takes care of the necessary language translation and synthesis.
However, it is alternatively or additionally equally possible to construct a system wherein the speech synthesis phase is omitted, i.e. the comment of the remote party is seen as translated text (in the viewer’s own/selected language) and optionally heard using the original language/signal of the speaking person. Still, the input of each speaker may be speech in their preferred/own language.
Additionally, it may be text as well.
In various embodiments, a multimodal communication UI may be provided via native client application or e.g. web-based browser operable solution. The UI may include a number of windows or generally views for text, video and/or 35 some presentation material (e.g. digital documents). The text may include e.g.
original text typed in by the local and/or remote user, text resulting from speech-to-text conversion of the local and/or remote user, and/or text resulting from language translation based on the input (text or speech) by the local and/or remote user.
20165708 prh 29 -11- 2018
Authentication, speech processing such as segmentation or noise reduction, text conversion, translation, synthesis, etc. require considerable memory capacity (e.g. various vocabularies of/between multiple languages) and processing power, whereupon at least selectively outsourcing them flexibly to a network system of potentially several interconnected and possibly independently operated sub-systems appears a feasible approach.
Speaker verification procedures may be applied based on text-dependent and io independent approaches. Text-dependent solution as described herein may be utilized e.g. in the beginning of the communication task or related activity. The independent solution may be conveniently executed basically throughout the communication session in substantially real-time fashion and transparently from a user perspective. Only upon failure, the user may be notified and e.g.
access rights removed or limited.
The principles set forth herein by the various embodiments promote the usability of a myriad of applications. In addition to translating e.g. IP calls in real-time, e.g. different dictation or archiving solutions may be cultivated, with 20 reference to doctor’s certificates or lawyer’s statements, wherein a person may dictate the essential substance of the document, which is then automatically converted into text and optionally translated for addition in a digital document (template), for example. A meeting, chat, negotiation or e.g. consultation type information transfer scenario may be easily documented using e.g. the original 25 and/or translated speech-based text. Customer service having regard to various online services such as e-commerce or banking services may be cultivated by the suggested conversion and translation services. Online learning arrangements and related examinations may benefit from the present invention in view of language translation and authentication aspects, for instance. Both 30 speech (original or translated) and text (original and/or translated) may be simultaneously provided to a participant to guarantee maximum absorbance of provided information. In some embodiments user terminals may even be connected or provided with wide angle such as 360 deg cameras that additionally monitor visually the environment of a person taking online exam 35 and provide the captured image data to remote terminal or system for archiving and verification. For example, it may be deduced from the video image that there’s no additional persons, crib sheets or other cheating tools in the user’s environment. In all these applications, speaker verification that can be
20165708 prh 29 -11- 2018 executed based on the obtained sound signal may add to the security of the solution and raise the authentication level of participants.
In some embodiments, the documents established based on speech-to-text conversion and possible translation may be further digitally signed by the digital signatures of the concerned user(s) and/or other parties such as legal person(s). Different speaker verification procedures described herein may be then conveniently harnessed also into signature-related authentication, nonrepudiation and/or integrity checking tasks, for instance.
With various signing or other applications involving authentication, a desired assurance level may be achieved by relying on a number of authentication factors for the purpose. In addition to or instead of speaker verification and preferably associative memory (brainworms/earworms, see below) based 15 factor, one or more other factors including but not limited to physical tokens such as mobile devices and e.g. SMS or similar messages transmitted thereto, location, and/or password(s) may be exploited with reference to e.g. the aforementioned WO2015/059365 and FI126129B setting forth options for implementing a scalable multi-factor solution in more detail.
In some embodiments, preferably when speaker verification fails, the system in accordance with the present invention may be configured to switchover the associated connection to a honeypot mode wherein the communication by the speaker in question is re-routed to an isolated resource, such as server, 25 advantageously still residing behind the system firewall (to prevent easy detection of such switchover based on e.g. network addresses), the honeypot resource then mimicking, in terms of its functionality the original resource such as a web or generally network server serving the users by means of information transfer/communication, conversion, translation, or other 30 processing tasks. The honeypot resource may be configured to monitor the connection and user/terminal behind the connection in order to dig up details such as digital addresses or other information potentially identifying the fraudster or the objectives of fraudulent activity. The honeypot resource may continue communicating with the concerned user/user terminal to be able to 35 receive such critical information as feedback therefrom.
Having regard to text-dependent speaker verification, the associative memory of users and also a phenomenon relating to a memory concept often
20165708 prh 29 -11- 2018 referred to as brainworms, or earworms, regarding things and related associations one seems to remember, basically reluctantly but still with ease (e.g. songs that are stuck inside one’s mind/one cannot get out of his/her head), can be cleverly harnessed into utilization in the context of 5 authentication together with voice recognition, referring especially to speaker verification in this case (claimed identity verification). One rather fundamental biometric property, i.e. voice, is indeed exploited as an authentication factor together with features of speech (i.e. voice input message content) recognition. The security risks arising from e.g. spoofing type attacks 10 such as spoofing by imitation in which an attacker could use e.g. recorded speech or speech vocoder imitating the speech of the target person may be reduced as the subset of cues used as a challenge, or ‘prompt’, for receiving the corresponding voice responses from a speaker can be dynamically, e.g. randomly or pseudo-randomly, selected (including determining the number of 15 cues, the cues themselves, and the order of cues) upon each verification round.
Also other factors, e.g. location data indicative of the location of the user (terminal), may be applied for authentication purposes.
Rather regularly people manage to associate different things like sounds, 20 images, videos, etc. together autonomously or automatically and recall such, potentially complex and/or lengthy (advantageous properties in connection with authentication particularly if the related voice inputs and fingerprints exhibit similar characteristics in conjunction with the present invention) association easily after many years, even if the association 25 as such was originally subconscious or at some occasions even undesired as the person in question sees it. By the present solution, device and/or service users may be provided with authentication challenge as a number of cues such as images, videos and/or sounds for which they will themselves initially determine the correct response they want to utilize in 30 the future during authentication. Instead of hard-to- remember numerical or character based code strings, the user may simply associate each challenge with a first associative, personal response that comes into mind and apply that memory image in the forthcoming authentication events based on voice recognition, as for each cue a voiceprint is recorded indicative of 35 correct response, whereupon the user is required to repeat the voice response upon authentication when the cue is represented to him/her as a challenge.
20165708 prh 29 -11- 2018
In various embodiments of the present invention, audio responses provided to the indicated cues are preferably concatenated to construct a temporally longer response that is then compared with a collective voiceprint that is also formed by combining, or ‘concatenating’, the voiceprints of the individual 5 cues provided in the challenge. Accordingly, the reliability of the speaker verification action may be elevated in contrast to e.g. separate comparisons between the voiceprints behind the cues and the given responses. Rather often individual brainworm-type associations really are relatively compact, i.e. a word or sentence that is associated with a cue is short or shortish, even shorter io than one second in duration, so that it is pleasantly easy to remember and utter, whereupon utilizing several voiceprints and responses associated with them together, in a linked or ‘concatenated’ maimer, may turn out beneficial to obtain enough data in view of a single matching procedure.
In practice, each voiceprint incorporates a model, template, feature vector(s), and/or other data capturing the characteristics of the training data (enrollment data) provided by a user for the particular cue. During subsequent testing phase, the person claiming the identity of that user is supposed to provide the same input, meaning repeating the same word or sentence using the same 20 voice as was done during the training. Recording parameters such as conditions do change (used equipment/microphone, background/ environmental noise, changed physiological voice characteristics due to illness, aging, natural fluctuations and inaccuracies in human voice production, etc.) but these may be compensated by using appropriate e.g.
contemporary normalization and possible other techniques. The model/voiceprint type approach taken into use here thus involves aspects of text-dependent speaker verification as the lexicons in the enrollment and testing phases correspond to each other, and the speaker the identity of which is to be verified is assumed as a cooperative person who, by default, tries to 30 repeat the same words or sentences during the testing phase as was uttered during the training phase for the same cue.
Further, it has been found that with many terminals or generally air/mouth/close-speaking microphones especially stop consonants are 35 problematic in view of the sound analysis such as speaker verification as their subtle features potentially facilitating voiceprint generation (i.e. enrollment in the terminology of speaker verification) and subsequent detection (i.e. testing phase) particularly when the sound samples are short, typically only one
20165708 prh 29 -11- 2018 uttered word per cue, are lost in the conversion process. Also background noise possibly present in the use scenarios of the present invention may render acquiring reliable sound data for voiceprint generation or matching difficult and force using complex noise cancellation solutions. Accordingly, 5 the use of throat microphones, or potentially other type of contact microphones to be provided against a solid vibration medium, provides supplementary or alternative technique to tackle with the problem. The throat microphone captures vibrations from a throat of the wearer by transducers contacting his/her neck.
By the application of contact such as throat microphone, also speech detection using VAD (voice activity detector) may be enhanced. The contact microphone data may temporally indicate the segments of true speech input from the synchronized air (audio) microphone data as the contact microphone 15 is more immune against ambient noise. Yet, the VAD and/or contact microphone signal may be applied for speech extraction or segmentation (e.g. speech segment vs. segment of noise/silence) e.g. for speaker verification.
Speech-to-text conversion may further benefit from the availability of contact 20 microphone data as the contact microphone captures certain sounds such as phonemes better than air microphones as described in more detail hereinelsewhere. The accuracy of the conversion and potential subsequent language translation may be thus elevated by the presence of both microphone data types.
Yet, speech-to-text conversion may be configured to adapt its function and/or the incoming speech signal based on the known personal voice or speech characteristics of current speaker. For example, the available voiceprints may be utilized for the purpose. If a user tends to pronounce e.g. certain 30 consonants in a unique manner (e.g. louder/quieter, longer/shorter, or spectrally otherwise unique), such tendency may be well reflected by the available speaker voice data such as voice model including e.g. the voiceprints. Accordingly, the effect of such pronunciation may be taken into account by compensating (processing such as offsetting) the speech prior to 35 the conversion accordingly, and/or through adjusting speech recognition parameters for the same.
20165708 prh 29 -11- 2018
A noise reducer or canceller may be applied to utilize contact microphone signal. For example, the air microphone signal may be more severely attenuated or generally processed to clear it from ambient noise, for example, during the segments corresponding to minimal signal input from the contact 5 microphone.
Yet, a technically feasible, security enhancing procedure is offered for linking a number of terminals together from the standpoint of electronic service, related authentication and ongoing session.
io
Still, electronic devices such as computers and mobile terminals may be cultivated with the suggested, optionally self-contained, authentication technology that enables authentication on operating system level, which is 15 basically hardware manufacturer independent. The outcome of an authentication procedure may be mapped into a corresponding resource or feature (access) control action in the device. Generally, (further) overall access to the device resources (e.g. graphical UI and/or operating system) or access to specific application(s) or application feature(s) may be 20 controlled by the authentication procedure, which may be also considered as a next or new generation password mechanism, identity based solution. Indeed, the suggested authentication technology may be utilized to supplement or replace the traditional ‘PIN’ type numeric or character code based authentication/access control methods.
Reverting to voice capturing, both air microphone and contact microphone may indeed be simultaneously and collectively applied to provide higher quality (e.g. more authentic, comprehensive and/or less noisy) electrical and preferably specifically digital sound signal than being possible with either 30 microphone technology if used in isolation.
The microphone signals complement each other and enable constructing a common signal or a corresponding representation of the originally provided voice responses, which is more comprehensive and characterizing than a 35 single air microphone signal or throat microphone signal.
A contact microphone such as a throat microphone may be superior in detecting e.g. stop consonants in contrast to an air microphone. In any case, contact microphone is relative insensitive to ambient noise. On the other hand,
20165708 prh 29 -11- 2018
e.g. nasal cavity -based or lips/tongue-based sounds are usually captured more authentically using air microphones. However, air microphones also capture ambient noise very effectively. A combined solution and e.g. common signal adopting features from both air and contact microphone signals may be 5 made somewhat noise free while representing the different characteristics of the original speech particularly well.
In some embodiments, e.g. contact microphone signal may be utilized for voice activity detection (VAD) in addition or instead of producing an 10 authentic speech signal. Detection may be based e.g. on signal amplitude/magnitude and/or power. Accordingly, durations that are quiet in the contact microphone signal, may be also filtered out from the air microphone signal as there is good likelihood the durations have consisted of mainly background/ambient noise captured by the air microphone. The VAD 15 may be further applied for detecting speech segments such as sentences in the microphone signal(s) based on e.g. detected pauses or noise-only containing periods therein.
The expression “a number of’ refers herein to any positive integer starting from one (1), e.g. to one, two, or three.
The expression “a plurality of’ refers herein to any positive integer starting from two (2), e.g. to two, three, or four.
The expression “data transfer” may refer to transmitting data, receiving data, or both, depending on the role(s) of a particular entity under analysis relative a data transfer action, i.e. a role of a sender, a role of a recipient, or both.
The terms “electronic service” and “electronic application” are herein utilized interchangeably unless otherwise clearly indicated.
The terms a and an do not denote a limitation of quantity, but denote the presence of at least one of the referenced item.
The terms first and second do not denote any order, quantity, or 40 importance, but rather are used to distinguish one element from another.
20165708 prh 29 -11- 2018
Different embodiments of the present invention are disclosed in the dependent claims.
BRIEF DESCRIPTION OF THE RELATED DRAWINGS
Next the invention is described in more detail with reference to the appended drawings in which
Fig. la illustrates one applicable context of the present invention via both block and signaling diagram approaches relative to an embodiment thereof.
Fig. lb is a block diagram representing an embodiment of selected internals of 15 the system or device according to the present invention.
Fig. lc illustrates scenarios involving embodiments of an electronic device in accordance with the present invention.
Fig. Id illustrates an embodiment of analyzer in accordance with the present invention for use in speaker verification.
Fig. 2a represents one example of service or device UI view in connection with user authentication.
Fig. 2b represents a further example of service or device UI view in connection with user authentication.
Fig. 2c represents one further example of service or device UI view in connection with user authentication.
Fig. 2d represents one further example of service or device UI view in connection with user authentication.
Fig. 3 is a flow chart disclosing an embodiment of voiceprint based authentication.
Fig. 4 illustrates an embodiment of a dual microphone apparatus and headset applicable in connection with the present invention.
Fig. 5 is a block diagram representing an embodiment of the dual microphone/headset apparatus.
Fig. 6 illustrates an embodiment of a method in accordance with the present invention and related potential UI aspects of an applicable terminal device.
DETAILED DESCRIPTION
Figure la illustrates an embodiment and potential context of the present invention with a particular focus on voiceprint creation and authentication utilizing speaker verification.
20165708 prh 29 -11- 2018
It shall be generally mentioned here that even though speaker verification is indeed a preferred feature in this and many other embodiments of the present invention, a skilled reader will appreciate the fact that some feasible embodiments may omit such feature while still adopting other beneficial but possibly still optional features such as the utilization of multiple different microphone signals (e.g. air and contact) in the concerned solution.
The shown embodiment may be generally related, by way of example only, to 10 the provision of a network- based or particularly online type electronic service such as a communication, banking, shopping, virtual desktop, archiving or document delivery service. Entity 102 refers to a service user (recipient) and associated terminal devices such as a desktop or laptop computer 102a and/or a mobile device 102b utilized for accessing the service 15 in the role of a client, for instance. The device(s) preferably provide access to a network 108 such as the Internet. The mobile device, such as a mobile phone (e.g. a smartphone) or a PDA (personal digital assistant) may preferably be wirelessly connected to a compatible network, such as a cellular network. Preferably the Internet may be accessed via the mobile device as well. The terminal device(s) may comprise a browser. Entity 106 refers to a system or network arrangement of a number of at least functionally connected devices such as servers. The communication between the entities
102 and 106 may take place over the Internet and underlying technologies, for example. Optionally the entity 106 is functionally also 25 connected to a mobile network.
Indeed, in the context of the shown embodiment of the present invention, the user 102 is preferably associated with a first terminal device 102a such as a desktop or laptop computer, a thin-client or a tablet/hand-held computer 30 provided with network 108 access, typically Internet access. Yet, the user
102 may optionally have a second terminal 102b such as a mobile communications device with him/her, advantageously being a smartphone or a corresponding device with applicable mobile subscription or other wireless connectivity enabling the device to transfer data e.g. between local 35 applications and the Internet. Many contemporary and forthcoming higher end mobile terminals qualifying as smartphones bear necessary capabilities for both e-mail and web surfing purposes among various other sophisticated
20165708 prh 29 -11- 2018 features including e.g. camera with optional optical code, e.g. QR ™ or other matrix code, reader application. In most cases, such devices support a plurality of wireless communication technologies such as cellular and wireless local area network (WLAN) type technologies. A number of 5 different, usually downloadable or carrier provided, such as memory card provided, software solutions, e.g. client applications, may be run on these ‘smart’ terminal devices.
The potential users of the provided system include different network service io providers, operators, cloud operators, virtual and/or remote desktop service providers, application/software manufacturers, financial institutions, companies, and individuals in the role of a service provider, intermediate entity, or end user, for example. The invention is thus generally applicable in a wide variety of different use scenarios and applications.
In some embodiments the service 106 may include e.g. customer portal service and the service data may correspondingly include customer portal data. Through the portal, the user 102 may inspect the available general data, company or other organization-related data or personal data such as 20 data about rental assets, estate or other targets. Service access in general, and the access of certain features or sections thereof, may require authentication. Multi-level authentication may be supported such that each level can be mapped to predetermined user rights regarding the service features. The rights may define the authentication level and optionally also 25 user-specific rules for service usage and thereby allow feature usage, exclude feature usage, or limit the feature usage (e.g. allow related data inspection but prevent data manipulation), for instance.
Initially, at 127 the system 106 may be ramped up and configured to offer predetermined service to the users, which may also include creation of user accounts, definition of related user rights, and provision of necessary authentication mechanism(s). Then, the user 102 may execute necessary registration procedures via his/her terminal(s) and establish a service user account cultivated with mandatory or optional information such as user id, service password, e-mail address, personal terminal identification data (e.g. mobile phone number, IMEI code, IMSI code), and especially voiceprints in the light of the present invention. This obviously bi-directional information transfer between the user/user device(s)
20165708 prh 29 -11- 2018
102 and the system/service 106, requiring performing related activities at both end, is indicated by items 128, 130 in the figure.
Figure 2a visualizes the voiceprint creation in the light of possible related user experience. A number of potential cues, such as graphical elements, 202 may be first indicated to the user via the service or device (mutatis mutandis) UI 200. These suggested cues may be selected, e.g. randomly or pseudo-randomly, from a larger group of available preset cues by the system. Advantageously, the user naturally links at least some cues with io certain associations based on e.g. his/her memories and potentially brainworms so that the association is easy to recall and preferably unambiguous (only one association per cue; for example, upon seeing a graphical representation of a cruise ship, the user always come up with memory relating to a trip to Caribbean, whereupon the natural association is ‘Caribbean’, which is then that 15 user’s voice response to the cue of a cruise ship).
Further information 204 such as the size of a captured sound file may also be shown. The user may optionally select a subset of all the indicated cues and/or provide (upload) cues of his/her own to the system for use during 20 authentication in the future. There is preferably a minimum size defined for the subset, i.e. number of cues, each user should be associated with. That could be three, five, six, nine, ten, or twelve cues, for example. Further, the sound sample to be used for creating the voiceprint, and/or as at least part of a voiceprint, may be defined a minimum acceptable duration in terms of e.g. 25 seconds and/or tenths of seconds.
As mentioned hereinearlier, the cues may be visual, audible, or a combination of both. Regarding the user associated cues, the user may then 30 input, typically utter, his/her voice response based on which the system determines the voiceprints, preferably at least one dedicated voiceprint corresponding to each cue in the subset.
In some embodiments, the training or ‘enrollment’ phase during which a 35 voiceprint is generated based on a user’s voice input to the cue he/she selected may comprise repeated uttering of the basically same (but in practice always more or less fluctuating due to the nature of human speech production mechanisms and e.g. environmental issues) input for a predetermined number of times having regard to a single cue and/or until the used other input criteria
20165708 prh 29 -11- 2018 are met. Based on such repeated input, a common voiceprint associated with the cue may be established using predefined combination logic. The logic may incorporate averaging and/or other type of merging of features extracted from different instances of the repeated input.
A voiceprint associated with a cue preferably characterizes both the voice and the spoken sound, or message (semantic meaning of the input voice), of the captured voice input used for establishing it. In other words, same message later uttered by other user does not match with the voiceprint of io the first user during the voice authentication phase even though the first user uttered the very same message to establish the fingerprint. On the other hand, a message uttered by the first user does not match the voiceprint established based on other message uttered by the first user. Preferably, for speech input purposes, including e.g. voiceprint creation or matching, both air 15 microphone and contact microphone signals are captured and utilized as described in further detail hereinelsewhere.
For voice characterization to be used in establishing the voiceprints, the 20 system may be configured to extract a number of features describing the properties of the user’s vocal tract, for example, and e.g. related formant frequencies from the voice input. Such frequencies typically indicate the personal resonance frequencies of the vocal tract of the speaker.
Next, reverting to Fig. la and switching over (indicated by the broken line in the figure) to a scenario in which the user has already set up a service account and wishes to authenticate to reach desired authentication status within the service 106 (in terms of speaker verification, one could say the execution switches over to testing phase from enrollment), at 132 the user 30 102 may trigger the browser in the (first) terminal and control it to connect to the target electronic service 106. Accordingly, the user may now log into the service 106 using his/her service credentials or provide at least some means of identification thereto as indicated by items 132, 134.
The service side may optionally allocate e.g. a dynamic id and deliver it at 136 to the browser that indicates the id and optionally other information such as domain, user id/name, etc. to the user via a display at item 138.
20165708 prh 29 -11- 2018
Figure 2b illustrates an embodiment of potential service or device UI features at this stage through a snapshot of UI view 200B. The dynamic id may be shown both as a numeric code and as embedded in a matrix type QR code at 208 to the user. Items 206 indicate the available authentication elements or factors, whereas data at 210 implies the current authentication level of the session with related information.
With reference to Fig. la again, at 140 the code may be optionally read by second terminal such as mobile terminal of the user. Then the mobile io device may be configured, using the same or other predetermined application, to transfer an indication of the obtained data such as dynamic id and data identifying the terminal or entity such as smart card therein, to the system 106, whereupon the particular first and second terminals could be linked to the same ongoing service (authentication) session at 142. Some 15 feasible embodiments can be found in the aforementioned publications
WO2015/059365 and FI126129B.
At 144, the system 106 fetches a number (potentially dynamically changing according to predetermined logic) of cues associated with the user account initially indicated/used in the session and for which voiceprints are available.
The cues may be basically randomly selected (and order-wise also randomly represented to the user). The cues are indicated (transferred) to the browser in the terminal that then represents them to the user at 146 e.g. via a display and/or audio reproduction means depending on the nature of the cues. E.g. Ajax ™ (Asynchronous JavaScript and XML) and PHP (Hypertext Preprocessor) may be utilized for terminal side browser control. Mutually the cues may be of the same or mixed type (e.g. one graphical image cue, one audio cue, and one video cue optionally with audio track).
As the user 102 perceives the cues as an authentication challenge, he/she provides the voice response preferably via the second terminal at 148 to the service 106 via a client application that may be the same application used for transferring the dynamic id forward. The client side application for the task may be a purpose-specific Java ™ application, for 35 example. In Figure 2c, four graphical (image) cues are indicated at 212 in the service or device UI view 200C (browser view). Being also visible in the figure is a plurality of service features at 214, some of which are greyed out, i.e. non-active features, due to the current insufficient authentication level. Indeed, a service or particularly service application or UI feature may
20165708 prh 29 -11- 2018 be potentially associated with a certain minimum security level required for access.
Automatic expiration time for the session may also be indicated via the UI.
Preferably, a session about to expire or expired may be renewed by repeated/new authentication.
In Figure la, at 150 the service 106 analyzes the obtained user responses io (typically one response per one cue) relative to the cues against the voiceprints using predetermined matching technique(s) and/or algorithms. In primary embodiments, the input order of responses corresponding to individual cues in the overall response should match the order in which cues were represented in the service UI (e.g. in a row, from left to right). 15 In some other embodiments, the system 106 may, however, be configured to analyze whether the order of responses matches the order of cues given, or at least to try different ordering(s). Optionally, the system 106 may be configured to rearrange the voice responses to the cues to obtain e.g. better voiceprint matching result during the analysis.
Optionally, concatenation of two or more responses to establish a single response of longer duration from the standpoint of matching is utilized. Longer duration converts into more available data, which may improve the matching and resulting classification (verification decision). Naturally, the voiceprints 25 associated with the used cues and the claimed user identity should be concatenated as well in a manner compatible with the concatenated responses to enable appropriate comparison between the two.
Indeed, concatenation may herein refer to at least conceptually joining 30 temporally associated (typically successive) data elements together by linking or ‘chaining’, or by other suitable action, to form a single entity or ensemble. One key point here is that due to the concatenation, notwithstanding its actual implementation, the (concatenated) voice responses provided to the represented cues and the (concatenated) personal voiceprints linked with the 35 cues and the claimed identify of existing (enrolled) user can be validly compared with each other as they should temporally match each other, i.e. extend over substantially same period of time, as being clear to a skilled person in view of achieving relevant matching results.
20165708 prh 29 -11- 2018
When the response (in practice e.g. feature(s) derived therefrom) matches with the voiceprints sufficiently according to predetermined logic, the voice authentication procedure, or ‘speaker verification’ procedure, may be considered successful, and the authentication level may be scaled (typically 5 raised) accordingly at 152. On the other hand, if the voice-based authentication fails (non-match), the authentication status may be left intact or actually lowered, for instance. Outcome of such authentication procedure is signaled to the user (preferably at least to the first terminal, potentially both) for p review e.g. as an authentication status message io via the service UI at 154. New features may be made available to the user in the service UI.
Figure 2d depicts a possible service or device UI view 200D after successful voice authentication. Explicit indication of the outcome of the 15 authentication procedure is provided at 218 by authentication status message and as an implicit indication thereof, more service features 216 have been made available to the user (not greyed anymore, which the user will immediately recognize).
In some embodiments, on the basis of the terminal location, the system 106 may introduce a further factor, i.e. a location -based factor, to the authentication procedure and verify, whether the current location of the terminal in question matches with predetermined location information defining a number of allowed locations and/or banned locations in the 25 light of the service and/or document access.
Fig. lc illustrates few other scenarios involving the embodiments of an electronic device 102c in accordance with the present invention. The device 30 102c may be a self-contained in a sense it can locally take care of authentication procedure based on program logic and data stored thereat. Optionally, the data such as voiceprint data may be still updated e.g. periodically or upon fulfillment of other triggering condition from or to a remote source such as a server 106 via a communications connection 35 possibly including one or more communication networks 108 in between.
Optionally, the device 102c is registered before a remote service such as network-based service that maintains a database of devices, associated users and/or related voiceprints. Some feasible techniques for implementing user and/or device enrollment for authentication and/or other solutions have 40 been provided e.g. in publication W02012/045908A1
20165708 prh 29 -11- 2018 “ARRANGEMENT AND METHOD FOR ACCESSING A NETWORK SERVICE” describing different features of ZEFA ™ authentication mechanism along with various supplementary security and communications related features. Depending on the embodiment, the 5 personal voiceprints may be generated at device 102c or network side (e.g.
server 106) from the voice input provided by the user in question.
In some embodiments, the device 102c may incorporate a plurality of functionally (communications-wise) connected elements that are 10 physically separate/separable and may even have their own dedicated housings, etc. For example, an access control panel or terminal providing UI to a subject of authentication (person) may be communicationswise connected, e.g. by a wired or wireless link, to an access control computer and/or actuator, which may also take care of one or more task(s) 15 relating to the authentication and/or related access control procedures.
The device 102c may include e.g. a computer device (e.g. laptop or desktop), or a portable user terminal device, such as a smartphone, tablet or other mobile or even wearable device.
The device 102c may generally be designated as a personal device, or used, typically alternately, by several authorized persons, such as multiple family members or team members at work, and be thus configured to store personal voiceprints of each user, not just of a single user. Storing personal 25 voiceprints of single user only is often sufficient in the case of a truly personal device. The authentication procedure suggested herein may be utilized to control the provision of (further) access to the resources and feature(s) such as application(s) or application feature(s), for instance, in or at the device and/or at least accessible via the device by means of a 30 communications connection to a remote party such as remote terminal or remote network-based service and related entities, typically incorporating at least one server.
The device 102c may thus be, include or implement at least part of an access control device.
In some embodiments, the access control device 102c may include or be at least connected to a particular controllable physical asset or entity 40 140a, 140b, such as a door, fence, gate, window, or a latch providing
20165708 prh 29 -11- 2018 access to a certain associated physical location, such as space (e.g. a room, compound or building) or physical, restricted resource such as container, safe, diary or even briefcase internals potentially containing valuable and/or confidential material. Particularly, the device 102c and 5 the suggested authentication logic provided thereat may be at least functionally connected to an (electricallycontrollable) locking or unlocking mechanism of such an asset/entity. Yet, the asset/entity may bear data transfer capability to communicate with external entities regarding e.g. the authentication task and/or outcome thereof as io already contemplated hereinbefore.
Figure lb shows, at 109A, a block diagram illustrating the selected internals of an embodiment of device 102, 102c or system 106 presented 15 herein. The system 106 may incorporate a plurality of at least functionally connected servers, and typically indeed at least one device such as a server or a corresponding entity with necessary communications, computational and memory capacity is included in the system. A skilled person will naturally realize that e.g. terminal devices such as a mobile 20 terminal (e.g. smartphone, tablet, phablet, or wearable device such as wristop device) or desktop type computer (terminal) utilized in connection with the present invention could generally include same or similar elements. In some embodiments, also a number of terminals, e.g. aforesaid first and/or second terminal, may be included in the system 106 itself. 25 Correspondingly, devices 102c applied in connection with the present invention may in some embodiments be implemented as a single-housing stand-alone devices, whereas in some other embodiments, may include or consist of two or more functionally connected elements potentially even provided with their own housings (e.g. access control terminal unit at a 30 door connected to a near-by or more distant access control computer via a wired and/or wireless communication connection).
The utilized device(s) or generally entities in question are typically provided 35 with one or more processing devices capable of processing instructions and other data, such as one or more microprocessors, micro-controllers, DSP’s (digital signal processor), programmable logic chips, etc. The processing device, or ‘entity’, 120 may thus, as a functional entity, comprise a plurality of mutually co-operating processors and/or a number of sub40 processors connected to a central processing unit, for instance. The
20165708 prh 29 -11- 2018 processing device 120 may be configured to execute the code stored in a memory 122, which may refer to instructions and other data relative to the software logic and software architecture for controlling the device 102, 102c or (device(s) of) system 106.
The processing device 120 may at least partially execute and/or manage the execution of the authentication tasks including speaker verification based on the instructions, and thereby implement at least part of an authentication platform or entity. Other implemented platforms or entities included may 10 execute e.g. speech-to-text conversion and optionally language translation, which may be generally elected from commonly available contemporary solutions or be based on proprietary technology.
Also the memory entity 122 may be constituted by one or more physical memory chips or other memory elements, optionally integral with the processing device 120. The memory 122 may indeed store program code for authentication and potentially other applications/tasks such as (speechto-)text conversion and translation, and other data such as voiceprint repository, user contact information, electronic documents, service data, vocabularies, etc.
The memory 122 may further refer to and include other storage media such as a preferably detachable memory card, a floppy disc, a CD-ROM, or a 25 fixed storage medium such as a hard drive. The memory 122 may be nonvolatile, e.g. ROM (Read Only Memory), and/or volatile, e.g. RAM (Random Access Memory), by nature. Software (product) may be provided on a carrier medium such as a memory card, a memory stick, an optical disc (e.g. CD-ROM or DVD), or some other memory carrier.
The UI (user interface) 124, 124B may comprise a display, a touchscreen, or a data projector 124, and keyboard/keypad or other applicable user (control) input entity 124B, such as a touch screen, a number of separate keys, buttons, knobs, switches, a touchpad, a joystick, or a mouse, configured 35 to provide the user of the system with practicable data visualization/reproduction and input/device control means, respectively.
The UI 124 may include one or more loudspeakers, earphones, and associated circuitry such as D/A (digital-to-analogue) converter(s) for
20165708 prh 29 -11- 2018 sound output, and/or sound capturing elements 124B such as microphone(s) with A/D converter for sound input (obviously the device capturing voice input from the user at least has one microphone, preferably both air/mouth microphone and contact/throat microphone) or external loudspeaker(s), 5 earphones, and/or microphone(s) may be utilized thereat for which purpose the UI 124, 124B preferably contains suitable wired or wireless (e.g. Bluetooth) interfacing means).
In some embodiments, a camera or specifically video camera, optionally io substantially wide angle or even 360 deg view angle covering camera, may be included or at least functionally connected to record image data about the environment for e.g. security, verification and/or surveillance purposes. The camera may optionally be remotely controllable in terms of its alignment and viewing direction by the system, for example. For the remote, optionally real15 time or pre-programmed, control, suitable servo motor(s) may be applied among other options.
The device 102, 102c/system 106 may further comprise a data interface 126 such as a number of wired and/or wireless transmitters, receivers, and/or 20 transceivers for communication with other devices such as terminals and/or network infrastructure(s). For example, an integrated or a removable network adapter may be provided. Non-limiting examples of the generally applicable technologies include WLAN (Wireless LAN, wireless local area network), LAN, WiFi, Ethernet, USB (Universal Serial Bus), GSM 25 (Global System for Mobile Communications), GPRS (General Packet
Radio Service), EDGE (Enhanced Data rates for Global Evolution), UMTS (Universal Mobile Telecommunications System), WCDMA (wideband code division multiple access), CDMA2000, PDC (Personal Digital Cellular), PHS (Personal Handy-phone System), and Bluetooth. Some technologies may be 30 supported by the elements of the system as such whereas some others (e.g.
cell network connectivity) are provided by external, functionally connected entities.
It is clear to a skilled person that the device 102, 102c or system 106 may comprise numerous additional functional and/or structural elements for providing advantageous communication, processing or other features, whereupon this disclosure is not to be construed as limiting the
20165708 prh 29 -11- 2018 presence of the additional elements in any manner. Entity 125 refers to such additional element(s) found useful depending on the embodiment.
At 109B, potential functional or logical entities implemented by the device
102c or system 106 (mostly by processing element(s) 120, memory element(s) 122 and communications element(s) 126) for voice authentication are indicated.
io Profiler 110 may establish the cue-associated and -specific voiceprints for the users based on the voice input by the users. The input may include speech or generally voice samples originally captured by user terminal(s) and funneled to the profiler 110 for voiceprint generation including e.g. feature parameter extraction. Element 112 refers to a voiceprint repository 112 that may contain a number of databases or other data structures for maintaining the personal voiceprints determined for the cues based on voice input by the user(s).
A voiceprint associated with a cue may be established using any suitable 20 method and stored accordingly. For example, a number of features (parameters) or feature vectors may be extracted from the captured input and used as such and/or utilized as a basis for establishing a higher level cuespecific user model.
For example, LP (linear prediction) coefficients and/or related coefficients such as MFCCs (mel-filter cepstral coefficients) or LPCCs (linear predictive cepstral coefficients) may be determined and utilized in voiceprints.
In one embodiment, a voiceprint is or includes a reference template established from the feature vectors obtained based on extracted features or feature vectors, typically short-term spectral feature vectors. Later during the testing phase, similar entities such as vectors are determined from voice responses to the cue prompts, whereupon template matching may take place. Alignment procedures such as dynamic time warping (DTW) may be applied in 35 connection with template based matching.
Alternatively, a neural network based model may be established using extracted features such as the aforementioned LP coefficients or related cepstral coefficients as input thereto.
20165708 prh 29 -11- 2018
On the other hand, e.g. HMM (hidden Markov model) or some preferred derivative or relative thereof such as GMM (Gaussian mixture model) may be used to establish each associated voiceprint. HMMs etc. can reasonably model 5 statistical variation in spectral features.
In addition, more general voiceprint or user model, not solely tight to any specific cue, may be established based on the voice data obtained during enrollment and optionally afterwards, if adaptation is allowed also based on io testing phase utterances.
Yet, for scoring and normalization purposes a reference model may be determined using a set of imposter test utterances (i.e. utterances provided to the same cue(s) by other persons that may in practical circumstances include 15 other enrolled users for convenience).
Based on the foregoing, voiceprint data are obviously preferably personal (user account or user id related) and characterize correct voice response to each cue (in the cue subset used for authenticating that particular user).
Voiceprint data may indicate, as already alluded hereinbefore, e.g.
fundamental frequency data, vocal tract resonance(s) data, duration/temporal data, loudness/intensity data, etc. Voiceprint data may indicate personal (physiological) properties of the user 102 and characteristics of received sample data (thus advantageously characterizing also the substance or message and semantic content of the input) obtained during the voiceprint generation procedure. In that sense, the voice recognition engine, or ‘speaker verification engine’, used in accordance with the present invention may also incorporate characteristics of speech recognition.
Analyzer 114 may take care of substantially real-time matching or generally analysis of voice input and already existing voiceprints during authentication. The analyzer 114 thus accepts or rejects the identity claim of the user. Such analysis may include a number of comparisons according to predetermined logic for figuring out whether the speaker/utterer really is the user initially indicated to the system. In some embodiments, profiler 110 and analyzer 114 may be logically implemented by a common entity due to e.g. similarities between the executed associated tasks. Authentication entity 116 may be such an entity or it 116 may at least generally control the
20165708 prh 29 -11- 2018 execution of authentication procedure(s), determine cues for an authentication task, raise/lower permanent or session-specific authentication levels based on the outcome thereof, and control e.g. data transfer with terminal devices and network infrastructure(s) including various elements.
As mentioned hereinbefore, multiple cues (so-called subset) are preferably dynamically selected from the potential ones, i.e. the ones having voiceprint associated with for the claimed user, for each speaker verification round. The selection may be random or pseudo-random. The utilized logic may, for io example, be configured to keep track of the utilization frequency of different cues so that different cues are more or less alternately and evenly selected. In this sense, the embodiments of the present invention have some common features with so-called ‘random digit strings’ or ‘randomized phrase prompting’ type speaker verification, however the present solution being 15 based on user-selected cues and related natural and fully personal memory associations instead of e.g. digits or predefined phrases.
Also an indication of duration of each voice response used to establish a corresponding voiceprint may be stored together with the voiceprint and cue 20 data so that the selection may be based on the duration data. For example, it may be ascertained that the selected cues correspond to voiceprints associated with a total, combined duration exceeding a predefined minimum threshold enabling reliable verification. A larger number of shorter duration -associated cues or a smaller number of longer duration -associated cues may be selected 25 as a result so that the minimum duration criterion is reached. The minimum overall duration may be e.g. a few seconds such as five seconds depending on the embodiment.
The suggested strategy for dynamically picking up cues having voiceprints 30 associated therewith for the identity verification of the claimant can thus be considered as innovative evolution of phrase prompting type and text dependent speaker verification solutions, whereupon existing tools such as feature extraction, modeling, matching and scoring methods used in such are also generally applicable in connection with the embodiments of the present 35 solution.
In some embodiments, the analyzer 114 may additionally or alternatively, having regard to any of the aforementioned features, include e.g. a VAD.
20165708 prh 29 -11- 2018
In some embodiments, the analyzer 114 may further additionally or alternatively include an audio signal processing platform for processing e.g. air and contact microphone based sound data into a common sound signal or specifically speech signal, adapting air microphone based data based on the 5 contact microphone data, converting sound data into text, translating the text, and/or synthesizing the (translated) text or providing related synthesis or synthesized data enabling synthesis or related audible reproduction. The VAD may be utilized in the processing. Data characterizing the speaker (user) e.g. in the form of voiceprints may be applied in adapting the conversion and/or 10 synthesis processes, for example.
Figure Id illustrates one embodiment of analyzer 114 configured to perform speaker verification tasks in connection with the authentication procedure of the present invention.
At 151, sound data indicative of captured sound input is received whereas numeral 160 refers to obtaining an indication of a claimed user identity so that a corresponding personal voiceprint data such as user model(s) may be retrieved from the repository for the selected cues. A number of features may 20 be optionally extracted from the data at 153.
Item 155 refers to matching/comparison actions including e.g. template matching, nearest-neighbor matching and/or other non-parametric matching or alternatively parametric model matching such as HMM matching. Preferably, 25 sound data representing several, preferably all, voice responses uttered (one response per cue) are concatenated to establish a sound data entity corresponding to longer duration of voice or generally sound input. Also related voiceprint data (voiceprints associated with the particular cues in question for which voice responses have been given by the claimant and 30 potentially general user model data mentioned hereinbefore) is concatenated in a compatible maimer so that their mutual comparison becomes possible. For example, cue-specific HMMs or other models based on voice input gathered from the claimed user during the enrollment phase may be combined or chained for the comparison.
The used scoring method providing e.g. probability of a match between the claimed identity and the voice input from the claimant may involve different normalization methods to cope with dynamic conditions such as environmental conditions like background noise etc. For example, parameter40
20165708 prh 29 -11- 2018 domain equalization such as blind equalization may be considered. Alternatively, likelihood ratio (detection) based scoring may be applied utilizing besides a determined probability of the responses truly representing the claimed identity also the probability of the responses representing other 5 persons. Generally, H-norm, Z-norm or T-norm score normalization may be additionally or alternatively utilized.
In some embodiments, several models and relating classifiers may be utilized in parallel so that score fusion involving combining the classifiers providing io uncorrelated results becomes possible.
At 156, a decision is made about the outcome of the verification procedure based on predefined fixed or adaptive threshold(s) set for the established score(s). The result 158 (often binary: identity claim accept/reject) is signaled 15 forward.
In some embodiments, a throat microphone preferably in contact with the skin surrounding the larynx may be used to capture the voice responses in addition or instead of mouth or ‘close-speaking’ type air microphone. The resulting 20 signal (hereafter called throat speech) is quite similar to normal speech. Due to its proximity to the speech production system, speech recorded from a throat microphone is clean, and is not affected by environmental noise in contrast to the mouth microphone.
In some embodiments, voiceprints may be adapted based on the voice responses obtained during verification actions (testing phase). When the claimant has been deemed as the claimed identity, the associated voiceprints used during the verification round in question, and optionally also voiceprints and/or more general user model(s) unused during the verification round, could 30 be updated to increasingly resemble the latest response, for instance.
Figure 3 discloses, by way of example only, a method flow diagram in accordance with an embodiment of voiceprint based authentication.
At 302 the device and/or system of the present invention is obtained and configured, for example through loading and execution of related software,
20165708 prh 29 -11- 2018 for managing the electronic service and related authentication mechanism(s).
Further, for users willing or obliged to use voice authentication, the 5 voiceprints shall be established as described in this text earlier using a preferred training/emollment procedure. For example, the device/system may be trained by the user such that the user utters the desired response (association) to each cue in his/her preferred and/or at least partially machine-selected (sub-)set of cues, whereupon the system extracts or derives io the voiceprints based on the voice input. Further, the user may be asked to provide some general or specific voice input that is not directly associated with any voiceprint. Using that voice input, the system may generally model the user- specific voice and/or speech features to be later applied in voice-based authentication and voiceprint matching, for example.
At 304, an indication of a required authentication, such as voice authentication request, is received from a user via feasible UI such as access control terminal, digital service UI (e.g. browser-based UI) or e.g. via 20 a dedicated application. The request may be associated with a certain user whose voiceprints are available. The request may identify such a user identity by a user ID, for example. Procedures potentially incorporating linking first and second terminals of the user relative to the current service session have been already discussed in this text. Naturally, e.g. in the case 25 of a single user personal, self-contained device comprising personal voiceprints only for the particular user, such user identity indication is not necessary.
At 306, a number of cues (for which voiceprint is available by the indicated user) are determined or selected preferably from a larger group thereof. The selection may be random, alternating (subsequent selections preferably contain different cue(s)), and/or following some other logic. The number of cues per authentication operation may be dynamically 35 selected by the system/device as well. For example, if a previous voice authentication procedure regarding the same user identity failed, the next one could contain more (or less) cues, and potentially vice versa. Also the status of other authentication factor(s) may be configured to affect the number. For example, if the user has already been authenticated using some 40 other authentication factor or element, e.g. location, the number of cues
20165708 prh 29 -11- 2018 could be scaled lower than in situation wherein overall authentication status of the user is weaker.
At 308, the cues are represented to the user via a user device utilized for service access, stand-alone user device, or e.g. an access control (terminal) device. For example, at least indication of the cues may be transmitted by a remote system to the (first) user terminal potentially with instructions regarding visual and/or audible reproduction thereof e.g. via a browser, io Preferably, the cues are represented in easily noticeable and recognizable order so that the responses thereto may be provided as naturally as s possible following the same order. For example, graphical cues may be represented in series extending from left to right via the service or application UI, and the user may provide the voice responses acknowledging each cue in the 15 same, natural order advantageously without a need to provide any separate, explicit control command for identifying the target cue during the voice input stage.
The user may utter the response to each cue one after each other by just keeping a brief pause in between so that cue-specific responses may be distinguished from each other (and associated with proper cue) in the overall response afterwards by the terminal or the system based on the pauses having e.g. reduced or at least different signal energy or power in contrast to speech-containing portions. The uttered response per cue may be 25 just a single word or alternatively, a complete sentence or at least several successively pronounced words or generally voices (e.g. exclamations). Alternatively, the user may explicitly indicate via the UI, through cuespecific icon/symbol selection, for instance, to which cue he/she is next providing the voice response.
Indeed at 310, the voice response to the challenge formed by the cues, such as graphical images, videos, and/or audio files, is provided by the user and potentially forwarded via the terminal to a remote analyzing entity such as the authentication system. The sound signal data forwarded 35 may include digital sound samples, such as so-called raw or PCM (pulsecode modulation) samples, or e.g. a more heavily parameterized compressed representation of the captured voice.
20165708 prh 29 -11- 2018
At 312, the obtained voice response data is analyzed against the corresponding personal (user-specific) voiceprints of the represented cues. The analysis tasks may include different matching and comparison actions following a selected logic as explained hereinbefore. For example, 5 a preferred classification algorithm may be exploited potentially followed by additional quality checking rules determining whether even the obtained matching score (e.g. probability figure) was sufficient to acknowledge the current user as the claimed one. The logic may apply fixed threshold(s) for making decisions (successful authentication, failed authentication), io or alternatively dynamic criteria and/or so-called normalization procedures may be applied. For instance, if e.g. heavy background noise is detected in the obtained sound data, criteria could be loosened.
Preferably, the analysis and matching phase incorporates concatenation of 15 several responses to several cues together to establish a response of longer duration. Accordingly, the corresponding voiceprints (associated with the claimed user identity and with the same cues as for which responses have been now gathered) shall be combined for the matching/comparison phase. A voiceprint is typically associated with a single cue only, but there may also be 20 voiceprint(s) or user model(s) to be used in matching that correspond to several cues or the associated user in general (they may generally characterize the user such as vocal tract properties, fundamental frequency, etc.). Such general voiceprints may be established based on e.g. all enrollment data provided by the user as original response/training data having regard to the 25 cues he/she has selected, for example, and/or based on additional voice input potentially not relating to the cues.
At 314, based on the outcome of the speaker verification process, the 30 authentication status or level associated with the user is updated accordingly (raised, lowered, or left as is). The user may be provided with access to new location(s) or source(s) (typically takes place only if the authentication status is raised).
At 316, the method execution is ended.
A computer program, comprising a code means adapted, when run 40 on a computer, to execute an embodiment of the desired method steps in
20165708 prh 29 -11- 2018 accordance with the present invention, may be provided. A carrier medium such as an optical disc, floppy disc, or a memory card, or other nontransitory carrier medium comprising the computer program may further be provided. The program may be delivered over a communication network 5 or generally over a communication channel.
Fig. 4 illustrates, at 400, an embodiment of an essentially a dual microphone apparatus preferably integrated in a headset apparatus applicable in connection with the present invention. Duality in this case refers to the integration of both 10 air microphone 410 and contact microphone 412, preferably a throat microphone, in the same accessory. Fig. 5 is, in turn, a block diagram 500 representing selected internals of a dual-microphone of e.g. Fig. 4. The microphones 410, 412 may be active simultaneously or one at a time, which is preferably user-selectable.
The contact microphone 412 may be positioned against the skin of the neck of user 402 thus substantially corresponding to the location of the throat and larynx, for example, to capture the vibration emanating from a vocal tract due to ongoing voice production process. For the purpose, the contact microphone 20 412 may comprise one or more sensors.
Instead, the air microphone 410 preferably comprises a mouth microphone (close-speaking microphone) that may be positioned by arm 418, for example, next or in front of the mouth of the user 402. Accordingly, sound pressure 25 waves emanating via the mouth may be effectively captured.
The actual technology underlying the sensors used for capturing the vibration/pressure waves may be selected by a skilled person to best fit each particular use scenario and application of the present invention having regard 30 to e.g. size, shape, sensitivity, frequency response, and directivity/directional pattern.
Preferably the first 410 and second 412 microphone comprise or are connected to suitable fastening or positioning means to enable locating and aligning the 35 microphone as desired. These means may include e.g. band or earpiece 416, arm 418, band or strap 408, adhesive surfaces and/or different clips for enabling attachment to clothing or skin. These may be user-adjustable as to their effective size or shape.
20165708 prh 29 -11- 2018
Preferably the apparatus 400, 500 comprises at least one ear speaker 104 for converting the signal received from e.g. external device to audible pressure waves (sound). Optionally or alternatively, a number of internally generated 5 signals such as status signals may be converted to audible form. In some embodiments, dual or stereo headphones may be considered where there’s a dedicated speaker for each ear. The speaker 104 may be of ear pad or in-ear type, for instance.
In addition to ear speaker 104, or instead thereof, a speaker configured to emit sound generally to the environment may be included.
For the speaker(s), the apparatus 400, 500 preferably contains an amplifier 246 to amplify D/A (digital-to-analogue) converted 544 signal to a desired level, 15 which may be controllable by a UI (user interface) element such as a button, control wheel, or other UI element accessible by the user 402.
Indeed, depending on the embodiment, the apparatus 400, 500 may contain UI such as at least one button, control wheel, touchscreen, touchpad, slide switch, 20 (turn) switch, indicator light (e.g. LED), voice (command) control, vibration element (e.g. electrical motor) etc. in connection with e.g. a connection unit 420 or a suitable wire/cable.
The electronics 532, which are preferably at least partly included in the 25 connection unit 420, may further incorporate a power source such as at least one (disposable) battery, rechargeable battery or at least connector for external power source. In wired embodiments, power/current may be at least partially provided via conductors (wire/cable) from an external device or other source. Wireless power supply or charging is possible as well, whereupon the 30 apparatus may include a wireless power receiving element such as a coil for inductive coupling.
Accordingly, a headset comprising the dual microphone apparatus and at least one speaker may be provided. In some embodiments, the dual microphone or 35 headset may be integrated with headwear such as a helmet or other garment, preferably removably.
20165708 prh 29 -11- 2018
Contact microphone 112 may in some embodiments include several, at least, two sensor elements (412, 413), e.g. on both sides or otherwise symmetrically or asymmetrically positioned having regard to the throat or larynx of the user 402.
One or more elements of the dual-microphone or headset apparatus 400, 500 may establish at least part of a body 540, or of a body part in cases where the apparatus 400, 500 can be considered to contain multiple body portions connected together e.g. via wires or cables. Preferably the apparatus 400, 500 10 contains one or more body parts incorporating or connecting to at least one of the aforesaid air and contact microphones and/or the speaker(s) e.g. via a preferably user-adjustable joint or sliding mechanism.
The body (parts) may have been configured to host, protect and/or support other elements such as the aforesaid microphone, speaker or additional elements. The body may further have acoustic function (e.g. sound box) in view of the microphone or speaker. The body may contain a substantially solid or rigid, but optionally still reasonably flexible or elastic from the standpoint of e.g. wearing or bedding, piece such as a housing, arm or band,
e.g. element 416 for placement against the head or neck (e.g. throat) of the user 402. Additionally or alternatively, the body may contain a strap or band, such as item 408, for neck or chin placement with a desired level of flexibility and/or elasticity.
Yet, the apparatus 400, 500 may include various conductors (signal, earth, current, phase, etc.), corresponding wires or cables that connect the various elements of the apparatus functionally, e.g. electrically, together, regarding e.g. connection unit, microphone(s) and speaker(s). In some embodiments, mere cable may be used to connect element such as in-ear speaker 404, air microphone 410, or contact microphone to another element such as connection unit 420 without more rigid body part(s).
Preferably A/D transducer 542 is included to convert microphone signals into digital form. Pre-amplifier (as part of amplifier block 546) may be provided to 35 amplify the electrical microphone signals.
20165708 prh 29 -11- 2018
Yet, the apparatus 400, 500 may include a sound codec to encode the microphone signals or common signal derived based thereon in accordance with a desired standard or encoding scheme e.g. for compression purposes.
A sound decoder may be included for decoding received signals for audible reproduction via speakers. The codec(s) may be implemented through a number of ASIC circuits (application-specific integrated circuit) or more general processing device 534. These elements and e.g. the aforesaid D/A transducer 244 and/or speaker amplifier may be included in the connection 10 unit 520 or implemented separate therefrom.
The apparatus 400, 500 may include, e.g. in connection unit 520, a wireless transmitter, receiver or transceiver 538 for the transfer of electrical, or digital, sound signals (which may in practice be coded data signals) relative to 15 external device(s) 530 such as mobile terminal, or other potentially personal terminal device such as a tablet, phablet, computer, laptop, or an entertainment electronics device, e.g. a game console, a portable music or multimedia player, a (smart) television, wearable electronics (e.g. wristop computer, smart clothing or smart goggles), a tuner-amplifier/stereo, or a corresponding device. 20 Remote systems such as the system 106 may be functionally connected to the apparatus 400, 500 e.g. via the device 530.
The used wireless technology may follow a selected standard of e.g. radio frequency band. Bluetooth ™ or its variant is applicable, so is WLAN 25 (wireless LAN), or a cellular solution. The transmission path between the described apparatus and external device 530 may be direct or contain a number of intermediate elements such as network infrastructure elements.
Instead of or in addition to radio frequencies, e.g. infrared frequencies or optical data transmission is possible. For wireless communication, the 30 apparatus 400, 500 may further include one or more antennae, optionally inmolded within the body part(s) or other elements of the apparatus.
In some embodiments, the apparatus 400, 500 contains at least a wire, cable 414 and/or connector 520, to wiredly connect to the external device 530 such 35 as mobile terminal (e.g. smartphone or tablet) for electrical or particularly digital audio signal transfer and e.g. control data signal transfer. This wired interface may follow a selected standard such as USB (universal serial bus) standard. The connector 520 may receive a cable or directly the compatible
20165708 prh 29 -11- 2018 counterpart (matching connector, e.g. male vs. female connectors in the general shape of protrusion or recess) of the external device 530.
In some embodiments, at least part of the connection unit 420 and related 5 electronics 532 may be provided in a dedicated separate housing that is connected e.g. to the microphones 410, 412 wirelessly and/or wiredly, using e.g. cables or low-power wireless communication such as Bluetooth low energy ™. The connection to the external device 530 such as a mobile terminal may be then wireless or wired as described above, using e.g. a cable 10 or connector 520.
In some embodiments, the apparatus 400, 500 is configured to determine a common signal based on the first and second microphone signals e.g. in the connection unit 420. Alternatively, such processing could take place at the 15 device 530 or e.g. remote system 106 connected thereto. The common signal may be electrical/digital signal established from the first and second signals preferably through means of signal processing. For the purpose, the apparatus 400, 500 may apply the processing device 534 comprising e.g. a microprocessor, signal processor, microcontroller, related memory 536 e.g. for 20 storing control software and/or sound signal data, and/or other circuit. The device 534 may execute different coding, decoding and/or other processing tasks such as filtering, voice (activity) detection or noise or echo detection/cancelling.
In embodiments where the first and second microphone signals locally captured are used to determine the components of a multi-channel signal, e.g. stereo signal, each microphone signal or signal derived therefrom may have been optionally allocated with a dedicated channel (e.g. left or right). These channels may be jointly or independently coded, using e.g. different codecs 30 per channel.
The audio codecs applied in connection with the present invention may include any suitable codec, e.g. selected codec under Bluetooth A2DP (audio distribution profile) definition. The supported codec(s) may include at least 35 one element selected from the group consisting of: aptX or derivative thereof,
SBC (subband codec), mp3 (MPEG-1 audio layer 3), AAC (advanced audio coding), ATRAC (adaptive transform acoustic coding), and CELP (codeexcited linear prediction) or derivative thereof (e.g. ACELP, algebraic CELP).
20165708 prh 29 -11- 2018
In some embodiments, the apparatus 400, 500 may be configured to establish further information or metadata such as control or synchronization information for transmission to the external device 530 for (mutual) 5 synchronization of the microphone signals and/or indicating parameters used in capturing or coding them (e.g. microphone properties, background noise properties as captured using at least one microphone, such as level or stationarity), for optimized further processing of the signals/common signal or for other purposes. As mentioned hereinbefore, further processing may take 10 place in connected remote systems 106 as well.
Fig. 6 illustrates, at 600, an embodiment of a method in accordance with the present invention and at 620, related possible UI aspects of a terminal device.
Item 602 refers to different preparatory activities as discussed hereinbefore with reference to e.g. user account creation, downloading and installation of applicable client software, setting up a network service/server side functionalities, configuring the terminal(s) and the server(s), etc.
At 604, sound data preferably based on captured air and contact microphone signals is obtained, optionally originally captured using a dual microphone as set forth hereinearlier. A user terminal may first locally capture the signals and them transmit them, optionally e.g. as components of a stereo signal, to the online server/service for processing. At least some processing may also take place at the terminal itself with reference to e.g. segmentation possibly relying on a VAD and/or contact microphone signal. Nevertheless, the sound data preferably includes speech of a user utilizing the terminal.
In the beginning or set-up phase of a communication session, the 30 aforementioned item 604 may refer to communication of sound data including speech intended specifically for text-dependent speaker verification to identify and/or authenticate the concerned speaker as a claimed user.
Later on, the item 604 may refer to communication of sound data including 35 speech primarily targeted, by the speaker, to one or more remote participants (users) of the preferably substantially real-time type online communication session unless it is a question of a single user application such as dictating or archiving/documentation application. Still, the same sound data dominantly
20165708 prh 29 -11- 2018 targeted to archiving and/or other users may be utilized, by the system, for text-independent, potentially transparent (executed in the background, possibly substantially concealed from the user), speaker verification as described hereinbefore.
In various embodiments, sound/speech transfer may be technically realized by streaming utilizing e.g. a dedicated stateful IP protocol typically via a native client application. Alternatively, input sound/speech may composed into batches at a user terminal and forwarded in blocks by using a stateless io standard protocol, like HTTPS, using e.g. a web browser based, on-demand loaded application. However, these are only examples of feasible technologies and a person skilled in the art may validly end up using some alternative options depending on the use context and e.g. available communication connections.
Item 606 refers to speaker verification of one or more types occurring at one or more stages of the process. As explained above, the initial verification may be text-dependent as thoroughly reviewed hereinbefore with reference to earworms, audio cues, voiceprints, etc. An example of one feasible 20 implementation is shown in the method diagram of Fig. 3.
Subsequently, text-independent speaker verification may be executed e.g. at intervals in the background to authenticate the speaker also during the ongoing communication session, not only in the beginning.
The dashed loop-back arrow after item 606 reflects the fact that once the possible initial authentication involving text-dependent speaker verification has been executed, the execution may revert to capturing sound data 604, this time for archiving and/or delivery to the remote participants, for example, 30 instead of text-dependent speaker verification. Accordingly, the item 606 has been rasterized in the figure to highlight especially the optional and e.g. intermittent nature of subsequent text-independent speaker verification activities. In some embodiments, though, even the text-independent speaker verification could be omitted having regard to at least one user. Accordingly, 35 speaker verification, if used at all, 606 could be then solely text-independent.
Item 608 refers speech-to-text conversion. As discussed hereinbefore, e.g. a VAD may be applied in enhancing the conversion result through proper
20165708 prh 29 -11- 2018 speech segmentation and/or separation from silent or noise periods. In particular, contact microphone based data may be exploited by the VAD. Yet, the contact microphone based data may be configured to improve, besides the performance of possible speaker verification tasks, also the conversion as 5 providing e.g. more complete input signal for the speech recognition. Further elements possibly utilized in the conversion include speaker voice/speech model data optionally including or derived from the voiceprints used for speaker verification. Such data is also applicable in enhancing the optional speech synthesis 618 by adapting the synthesis model by the characterizing 10 vocal features of the speaker so that the synthesized speech preferably increasingly reminds of the actual voice and particularly speech of the speaker.
The resulting text may be subjected to translation at 614. The target language 15 may be user-selectable e.g. via the UI. In the case of single-user applications such as dictation or archiving type applications, the speech sources themselves shall logically select the target language whereas in the case of a multi-user scenario, any or each user to receive text or synthesized speech derived therefrom may preferably select a desired language, typically the language the 20 person is best familiar with.
Item 618 refers to possible creation of synthesized speech data (e.g. samples or otherwise coded digital speech to be converted into audio at the receiving terminal) from the translated text.
Item 610 refers to provision (transmission by the system for visual or audible reproduction by the receiving terminal) of original digital sound/speech, synthesized digital speech, original text and/or translated text to one or more concerned users such as the user (speech source) that has provided the 30 particular speech input in question and one or more potential remote users that are communicating with the speech source.
Even the speech source could be optionally provided with an option (via the UI) to listen to the synthesized version of the input speech to get a grasp of the 35 listening experience at a remote end of the communication session in terms of
e.g. synthesis performance and intelligibility. Likewise, the receiving user could be provided with UI feature enabling to at least temporarily switch over
20165708 prh 29 -11- 2018 to listening to the original speech instead of the available synthesized, translated version.
The text resulting from the speech-to-text conversion may be provided, in the original language, to the speech source for review via the UI of the terminal and communication software running thereon. In case the text contains conversion errors, the UI may be provided with user-accessible feature such as selection feature to indicate the presence of the error and/or manually correct the error therein by typing via a keyboard or touchscreen, for example. If a 10 correction is made, the previous conversion result may be generally replaced with the corrected one. This applies both to dictation/archiving applications as well as communication sessions involving several users. The correction may be conveyed to the system (server) and potential remote terminals of other users e.g. therethrough.
A similar approach may be generally applied to a translation result, which may be inspected and optionally corrected by the speech source, whereupon the corrected translated text may be forwarded to the server and potential remote terminals/users.
Corrective actions by the users may be utilized by the system in adapting the conversion and/or translation activities, for example.
In some embodiments, user input such as a keypress may be awaited, 25 optionally for a predefined duration, to acknowledge the correctness of speech-to-text conversion result prior to providing the text forward for archiving, translation, communication and/or other purposes.
Item 612 refers to potential execution of further tasks such as logging of 30 converted and/or translated text in a memo, minutes, certificate, form, e-mail, or some other document.
Yet, e.g. in online education or security-critical applications, camera data obtained via user terminal(s) may be inspected either by a monitoring person 35 such as an exam supervisor or a customer servant at a bank, by means of their user terminals, or by automated analysis logic based on computer vision/image-based pattern recognition executed at a network server, for instance.
20165708 prh 29 -11- 2018
The obtained digital records such as logs, minutes, doctor’s certificates, etc. may be digitally signed. For signing, the concerned user(s) may be authenticated by the speaker verification and/or using some alternative means.
The aforementioned PKI scheme is one option for implementing the signing procedure.
Yet, fraudulent activities such as identity thefts may be recognized through a honeypot scheme as discussed hereinbefore.
io
The method execution is ended at 616. The dashed loop-back arrow going back to item 604 indicates the fact the method items shown may in practical implementations be repeatedly executed, even in overlapping or parallel fashion. The information is preferably processed in segments or batches at 15 least in applications involving substantially real-time communication between multiple users.
For example, a terminal or server-based VAD or other solution, even explicit user input monitoring feature (a PTT, push-to-talk, feature provided via the 20 terminal side UI) may be applied for segmenting the input sound data containing speech into sentences, for example, whereupon the segments may be processed one at a time including e.g. conversion, translation, review, correction, forwarding, logging etc. Accordingly, the communication/conversion/processing delay may be kept reasonable, while 25 the processed segments still covering some meaningful ensemble of information instead of e.g. isolated words.
At 620, a display view of selected UI features is shown by way of example only. Indeed, a user terminal may run a client application, such as a native 30 application or e.g. a browser based solution, implementing a graphical UI for user input and output.
A presentation view 622 may be provided for sharing digital content such as documents, optionally contracts or educational material, online among the 35 participants of the session. A web page or frame could be shown in the case of
e.g. customer service or some other online type (web) embodiment of the present invention where also the depicted remaining elements could essentially belong to the same web page or at least web site. Identity
20165708 prh 29 -11- 2018 information regarding the remote user, such as a friend, colleague, interlocutor or customer servant, may be shown at 624 optionally including e.g. user name or other ID data, avatar, and/or (real-time) video image of the remote user and optionally associated environment in the case of a monitoring or surveillance 5 camera as deliberated hereinbefore. Obviously such view 624 may be unnecessary in the case of a single-user application such as dictation or documentation application.
A number of different text views, windows, or other objects containing textual 10 information 626 may be provided. Local end originated direct text input, remote end originated direct text input, local end originated and converted text (resulting from speech-to-text conversion of local end sound/speech input, preferably taking place in the remote server), remote end originated converted text, local end originated, converted and translated text, and/or remote end 15 originated, converted and translated text may be rendered via the UI.
In some embodiments, also translated text resulting from direct textual input could be shown via the UI of at least remote end terminal.
A text segment, such as a sentence or other processing (conversion, translation, and/or synthesis) unit of information, may be shown as a clearly discrete or otherwise visually easily distinguishable element, separated by e.g. line feed(s) or empty spaces from the other segments. Also different colors and e.g. fonts could be used for the purpose.
Preferably there’s sufficient level of synchronization arranged between e.g. the shown text and reproduced corresponding audio (speech). The system may be configured to take care of the synchronization, and/or the terminal client(s) may synchronize the output by means of e.g. time codes or other sync data 30 provided with the actual payload data by the system.
Consequently, a skilled person may on the basis of this disclosure and general knowledge apply the provided teachings in order to implement the scope of the present invention as defined by the appended claims in each 35 particular use case with necessary modifications, deletions, and additions. For example, HTML5 hypertext mark-up language standard includes application program interfaces for camera, voice recording and geolocation. These HTML features could be utilized in connection with the present invention e.g. instead of a (native) client, e.g. Java, application and/or QR reader as described hereinbefore.

Claims (10)

PatenttivaatimuksetThe claims 5 1. Elektroninen järjestelmä (106, 109A) online-ympäristössä tapahtuvaa multimodaalista tiedonsiirtoa varten sisältäen edullisesti olennaisesti tosiaikaisen tietoliikenteen kahden tai useamman käyttäjän välillä, joka mainittu järjestelmä käsittää vähintään yhden palvelinlaitteen varustettuna prosessointivälineellä (120) ja muistientiteetillä (122) käskyjen ja muun datan prosessoimiseksi ja vastaavasti 10 tallentamiseksi, ja datansiirtoliittymän (126) datan vastaanottamiseksi ja lähettämiseksi online-ympäristön verkon, edullisesti Internetin, kautta, jolloin käskyt ovat prosessointivälineellä suoritettuina konfiguroitu saattamaan kyseinen vähintään yksi palvelinlaiteAn electronic system (106, 109A) for multimodal communication in an online environment, preferably including substantially real-time communication between two or more users, said system comprising at least one server device provided with a processing means (120) and a memory entity (122) for processing instructions and other data. and, respectively, 10 for storing and receiving data transmission interface (126) for receiving and transmitting data over the network of the online environment, preferably the Internet, wherein the instructions, when executed by the processing means, are configured to bring the at least one server device 15 hankkimaan digitaalinen äänidata käsittäen puhetta, jonka ensimmäinen käyttöpääte on siepannut puheen lähteestä ja joka liittyy ensimmäisen käyttäjän identiteettivaateeseen, jolloin mainittu digitaalinen äänidata sisältää sekä ilmamikrofoniin että kontaktimikrofoniin, edullisesti kurkkumikrofoniin perustuvan datan, tunnettu siitä, että mainittua kontaktimikrofoniin perustuvaa 20 dataa hyödynnetään lähteen puheen segmentoimisessa muusta äänidatasta, mukaan luettuna taustakohinoista, ainakin mainitun digitaalisen äänidatan käsittämän ilmamikrofoniin perustuvan datan osalta, edullisesti verifioimaan digitaalisen äänidatan ja ensimmäiseen käyttäjään liitetynObtaining digital audio data comprising speech intercepted by a first user terminal from a speech source and associated with a first user identity claim, said digital audio data including data based on both an air microphone and a contact microphone, preferably a throat microphone, characterized by utilizing said contact microphone voice data, including background noise, for at least microphone-based data comprising at least said digital audio data, preferably for verifying digital audio data and data associated with the first user; 25 ja ensimmäisen käyttäjän äänen karakterisoivan vertailuäänijäljen perusteella se onko äänidata tuotettu ensimmäisen käyttäjän toimesta puheen lähteen todentamiseksi ensimmäiseksi käyttäjäksi, muuntamaan äänidata vastaavaksi tekstiksi lisäksi valinnaisesti kääntäen teksti 30 jollekin toiselle kielelle, joka on edullisesti ensimmäisen käyttäjän tai toista päätettä hyödyntävän toisen käyttäjän valitsema kieli, ja toimittamaan teksti, valinnaisesti mukaan luettuna mainittu käännetty teksti, vähintään yhteen käyttöpäätteeseen, mukaan luettuna edullisesti mainittu 35 ensimmäinen pääte ja/tai toisen käyttäjän mainittu toinen pääte, jälleentuotettavaksi edullisesti kyseessä olevan päätteen näytön kautta.25 and the first user to determine whether the audio data has been generated by the first user to authenticate the source of speech to the first user, further optionally translating the text into another language, preferably a language selected by the first user or second user utilizing the second terminal; text, optionally including said translated text, to at least one access terminal, preferably including said first terminal 35 and / or said second user said second terminal, preferably for reproduction via the display of said terminal. 2. Patenttivaatimuksen 1 mukainen järjestelmä, joka on konfiguroitu toimittamaan tekstin lisäksi vastaava digitaalinen puhedata, joka sisältää haltuun 40 otetun digitaalisen äänidatan tai on siitä peräisin, valinnaisesti mukaan luettuna The system of claim 1, configured to provide, in addition to text, corresponding digital speech data containing or derived from 40 captured digital audio data, optionally including 20165708 prh 29 -11- 2018 teksti puheeksi -synteesiä hyödyntämällä käännetystä tekstistä johdettu puhe, ensimmäiseen päätteeseen ja/tai toisen käyttäjän toiseen päätteeseen kuunneltavaa toistoa varten.20165708 prh 29 -11-2018 Text-to-Speech synthesis utilizing translated text derived speech to a first terminal and / or a second user for a second terminal for playback. 5 3. Patenttivaatimuksen 2 mukainen järjestelmä, joka on konfiguroitu soveltamaan äänijälkidataa kyseisen tekstin teksti puheeksi -synteesin adaptoimisessa, valinnaisesti ihmisäänen tuottamismekanismin kuten ääniväylän ominaisuudet rekonstruoivan synteesimallin yhtä tai useampaa parametria.The system of claim 2, configured to apply voice trace data to adapting said text-to-speech synthesis, optionally one or more parameters of a synthesis model that reconstructs the characteristics of a human voice production mechanism such as a voice path. 10 4. Jonkin edellä olevan patenttivaatimuksen mukainen järjestelmä, joka on konfiguroitu tallentamaan (122, 112, 200) useille käyttäjille, mukaan luettuna mainittu ensimmäinen käyttäjä, joukko henkilökohtaisia äänijälkiä (204), joista kukin liittyy erityiseen visuaaliseen, audiovisuaaliseen tai auditiiviseen merkkiin (202) käyttäjien haaste-vaste todentamiseksi, jolloin merkit ovat edullisestiA system as claimed in any one of the preceding claims, configured to store (122, 112, 200) for a plurality of users, including said first user, a plurality of personal audio tracks (204), each associated with a specific visual, audio-visual or auditory sign (202). to verify the users' challenge response, whereby the characters are inexpensive 15 käyttäjän valitsemia, käyttäjän hankkimia tai käyttäjän luomia.15 user-selected, user-generated, or user-created. 5. Patenttivaatimuksen 4 mukainen järjestelmä, joka on konfiguroitu suorittamaan puheen lähteelle tekstiriippuvainen puhujan verifiointi lähteen todentamiseksi ensimmäiseksi käyttäjäksi, jolloin järjestelmä on edullisestiThe system of claim 4, configured to perform text-dependent speaker verification on a source of speech to authenticate the source to a first user, wherein the system is preferably 20 konfiguroitu poimimaan (116, 200C, 142, 144) ensimmäisen käyttäjän identiteettiä (160) koskevaan väittämään liittyvän todentamistehtävän yhteydessä osajoukko merkkejä (212), joille on olemassa ensimmäisen käyttäjän tallennetut (212)20 configured to extract (116, 200C, 142, 144), in connection with the first user identity (160) authentication task, a subset of characters (212) having first user stored (212) 25 äänijäljet, ja toimittamaan merkit esitettäviksi haasteena identiteettiä vaativaan puheen lähteeseen, vastaanottamaan (126, 148, 150) digitaalinen äänidata, joka indikoi puheen lähteen tuottamia äänivasteita esitettyihin merkkeihin, jolloin mainittu data sisältää25, and supply the characters to be challenged to an identity-requiring speech source, to receive (126, 148, 150) digital audio data indicative of the voice-response of the speech source to the represented characters, said data including 30 edullisesti sekä ilmamikrofoniin että kontaktimikrofoniin perustuvan datan, määrittelemään (114, 151, 153, 155, 156, 158) digitaalisen äänidatan, esitettyjen merkkien ja niihin liittyvien, ensimmäiseen käyttäjään liitettyjen äänijälkien30, preferably the air microphone and the contact microphone, to define (114, 151, 153, 155, 156, 158) digital audio data, represented characters and associated first user audio 35 perusteella se onko vaste tuotettu ensimmäisen käyttäjän toimesta, ja edellyttäen, että näin näyttää olevan, ja korottamaan (116, 152, 200D, 218, 216) puheen lähteen todentamisstatus35 whether it is a response produced by the first user, and provided it appears to be, and to increase (116, 152, 200D, 218, 216) the speech source authentication status 40 ensimmäiseksi käyttäjäksi edullisesti seuraavaan todentamismenettelyyn saakka.40 first user preferably until the next authentication procedure. 20165708 prh 29 -11- 201820165708 prh 29 -11- 2018 6. Jonkin edellä olevan patenttivaatimuksen mukainen järjestelmä, joka on konfiguroitu todentamaan puheen lähde, edullisesti väliajoin, multimodaalisen tiedonsiirtomenettelyn aikana ja olennaisesti sen alkamisen jälkeen hyödyntämällä tekstistä riippumatonta puhujan verifiointia, jolloin mainittuA system as claimed in any one of the preceding claims, configured to authenticate a speech source, preferably at intervals, during and substantially after a multimodal communication procedure by utilizing text independent speaker verification, wherein said 5 todentaminen tapahtuu edullisesti käyttäjän kaimalta taustalla, jolloin poistuu tarve toteuttaa jokin tietty käyttäjäsyöte puheen lähteestä5 authentication is preferably done from the background of the user, eliminating the need to implement a particular user input from the speech source 7. Jonkin edellä olevan patenttivaatimuksen mukainen järjestelmä, joka lisäksi käsittää vähintään yhden elementin valittuna ryhmästä, jonka muodostavat:The system of any one of the preceding claims, further comprising at least one element selected from the group consisting of: 10 puheaktiviteetin ilmaisin konfiguroituna hyödyntämään digitaalisen äänidatan kontaktimikrofoniosuutta eron tekemiseksi puhesegmenttien kuten lauseiden välille digitaalisessa äänidatassa tai erityisesti sen ilmamikrofoniosuudessa, kohinan vähentäjä tai poistaja kohinan poistamiseksi digitaalisesta äänidatasta tai erityisesti sen ilmamikrofoniosuudesta, jolloin kohinan vähentäjä tai poistajaA voice activity detector configured to utilize the contact microphone portion of the digital audio data to discriminate between speech segments such as sentences in the digital audio data, or in particular its air microphone portion, to reduce or remove noise from the digital audio data or particularly its air microphone portion; 15 hyödyntää kontaktimikrofonipohjaista dataa poistamisen suorittamiseksi, adaptiivinen puhe tekstiksi -muuntoalusta konfiguroituna soveltamaan äänijälkidataa alustan toiminnon adaptoinnissa ja/tai syötepuheen adaptoinnissa, sanelualusta, -loki tai -tietokanta tekstin tallentamiseksi valinnaisesti pöytäkirja-, muistio-, lausunto- tai lääkärintodistusformaattina, online-opetus- tai15 utilizes contact microphone-based data to perform deletion, the adaptive speech-to-text conversion platform configured to apply voice trace data for platform function adaptation and / or input speech adaptation, dictation pad, log, or database for optional text, memo, online, or medical recording 20 oppimisalusta tekstin ja/tai vastaavan digitaalisen puhedatan käsittävän digitaalisen oppimateriaalin toimittamiseksi alkuperäisellä ja/tai käännetyllä kielellä useisiin päätteisiin, online-viestintä-, kokous- tai yhteistoiminta-alusta tekstin ja/tai vastaavan digitaalisen puhedatan siirtämiseksi alkuperäisellä ja/tai käännetyllä kielellä toiseen päätteeseen, online-tutkintoalusta, jossa ensimmäiseltä20 learning platforms for delivering text and / or corresponding digital voice data digital instructional material in original and / or translated language to multiple terminals, online communication, meeting or collaboration platform for transmitting text and / or corresponding digital voice data in original and / or translated language to another terminal, online degree platform, with the first one 25 käyttäjältä saatu digitaalinen äänidata sisältää ensimmäisen käyttäjän vastauksia tutkintokysymyksiin, digitaalisen kameradatan online-hallintamoduuli konfiguroituna vastaanottamaan digitaalinen kameradata käyttöpäätteeltä ja tallentamaan kameradata ja/tai siitä johdettu analyysidata liittyen kyseisen käyttäjän identiteettiin, online-kauppa-, -pankkitoiminta- tai -asiakaspalvelualustaDigital audio data received from 25 users includes first-user degree questions, an online digital camera data management module configured to receive digital camera data from a user terminal and store camera data and / or analysis data derived therefrom relating to that user's identity, online trading, banking, or customer service platform. 30 konfiguroituna mahdollistamaan puhe- ja tekstiviestintä asiakastyyppisen ensimmäisen tai toisen käyttäjän ja vastaavasti asiakaspalvelijatyyppisen toisen tai ensimmäisen käyttäjän välillä, ja honeypot-alusta konfiguroituna siirtämään puhujan verifiointiin perustuvassa todentamisessa epäonnistunut käyttäjäyhteys alkuperäistä järjestelmää matkivaan eristettyyn resurssiin.30 configured to enable voice and text messaging between a client-type first or second user and a client-server-type second or first user, respectively, and a honeypot platform configured to transfer a failed user connection to the isolated system mimicking the original system for speaker verification. 8. Elektroninen päätelaite (102, 102c, 109A) online-ympäristössä tapahtuvaa multimodaalista tiedonsiirtoa varten sisältäen edullisesti olennaisesti tosiaikaisen tietoliikenteen kahden tai useamman käyttäjän välillä, joka mainittu laite käsittää prosessointivälineen (120) ja muistientiteetin (122) käskyjen ja muun datanAn electronic terminal (102, 102c, 109A) for multimodal data transmission in an online environment, preferably including substantially real-time communication between two or more users, said device comprising processing means (120) and instructions for memory integrity (122) and other data. 20165708 prh 29 -11- 2018 prosessoimiseksi ja vastaavasti tallentamiseksi, ilmamikrofonin ja kontaktimikrofonin (124B) audiodatan sieppaamiseksi, näytön (124) datan visualisoimiseksi ja datansiirtoliittymän (126) datan vastaanottamiseksi ja lähettämiseksi online-ympäristön verkon, edullisesti Internetin, kautta, jolloin 5 käskyt ovat prosessointivälineellä suoritettuina konfiguroitu saattamaan laite muodostamaan mainittujen ilmamikrofonin ja kontaktimikrofonin kautta siepattuihin signaaleihin perustuva digitaalinen äänidata, jolloin mainitut signaalit käsittävät puheen lähteestä tulevaa puhetta, joka liittyy ensimmäisen käyttäjän 10 identiteettivaateeseen, tunnettu siitä, että mainittua kontaktimikrofoniin perustuvaa dataa hyödynnetään lähteen puheen segmentoimisessa muusta äänidatasta, mukaan luettuna taustakohinoista, ainakin mainitun digitaalisen äänidatan käsittämän ilmamikrofoniin perustuvan datan osalta,20165708 prh 29 -11-2018 for processing and storing respectively, capturing audio data of the air microphone and contact microphone (124B), visualizing data on the display (124) and receiving and transmitting data from the data interface (126) via an online environment network, preferably the Internet; configured by the processing means to configure the device to produce digital audio data based on the signals captured through said air microphone and contact microphone, said signals comprising speech source speech associated with a first user identity requirement 10, characterized in that said contact microphone data is utilized background noise, at least for air microphone data comprising said digital audio data, 15 lähettämään digitaalinen äänidata verkkoentiteettiin (106) prosessoitavaksi mukaan luettuna edullisesti puhujan verifiointi.15 transmitting digital audio data to a network entity (106) for processing, including, preferably, speaker verification. vastaanottamaan toisen päätelaitteen ja välillä olevan verkkoentiteetin kautta kyseiseen päätelaitteeseen yhteydessä olevan toisen etäkäyttäjän puhesyötteen 20 puhe tekstiksi -muunnosta saatua tekstiä ja valinnaisesti kääntämään verkkoentiteetistä tuleva digitaalinen äänidata, lisäksi valinnaisesti vastaanottamaan puhe tekstiksi -muunnosta tulevaa tekstiä ja/tai tekstin muuntamisesta, kielen kääntämisestä ja synteesistä peräisin olevaa teksti puheeksi -syntetisoitua puhedataa, ja esittämään vastaanotettu teksti visuaalisesti näytöllä käyttäjän tarkasteltavaksi, lisäksi valinnaisesti kuultavasti jälleentuottamaan etäällä olevan osapuolen puhesyötteen indikoiva vastaanotettu syntetisoitu puhedata.receive text from the speech-to-text conversion 20 of the voice input from another remote user communicating with said terminal via the second terminal and the network entity between it and optionally translate digital audio data from the network entity, optionally receiving text from / and -syntetisoitua the text-to-speech voice data, and present the received text to visually display the user's consideration, in addition, optionally, audibly jälleentuottamaan away from the speech input indicating the party received synthesized speech data. 3030 9. Patenttivaatimuksen 8 mukainen pääte, joka käsittää puheaktiviteetin ilmaisimen konfiguroituna havaitsemaan hiljaiset tai kohinavaiheet digitaalisesta äänidatasta hyödyntämällä kontaktimikrofonin signaalia.The terminal of claim 8, comprising a speech activity detector configured to detect silent or noise phases from digital audio data by utilizing a contact microphone signal. 10. Menetelmä (600) online-ympäristössä tapahtuvaa multimodaalista tiedonsiirtoa10. Method (600) for multimodal data transmission in an online environment 35 varten sisältäen edullisesti olennaisesti tosiaikaisen tietoliikenteen kahden tai useamman käyttäjän välillä suoritettavaksi vähintään yhden palvelinlaitteen käsittävällä järjestelmällä, jossa menetelmässä35, preferably including substantially real-time communication between two or more users for execution on a system comprising at least one server device, wherein the method comprises: 20165708 prh 29 -11- 2018 hankitaan (604) digitaalinen äänidata käsittäen puhetta, jonka ensimmäinen käyttöpääte on siepannut puheen lähteestä ja joka liittyy ensimmäisen käyttäjän identiteettivaateeseen, jolloin mainittu digitaalinen äänidata sisältää sekä ilmamikrofoniin että kontaktimikrofoniin perustuvan datan, tunnettu siitä, että 5 mainittua kontaktimikrofoniin perustuvaa dataa hyödynnetään lähteen puheen segmentoimisessa muusta äänidatasta, mukaan luettuna taustakohinoista, ainakin mainitun digitaalisen äänidatan käsittämän ilmamikrofoniin perustuvan datan osalta,20165708 prh 29 -11-2018 acquiring (604) digital audio data comprising speech intercepted by a first user terminal from a speech source and associated with a first user identity claim, said digital audio data including data based on both an air microphone and a contact microphone, characterized in that the data is utilized in segmenting the source speech from other audio data, including background noise, at least for the aerial microphone data comprised in said digital audio data, 10 edullisesti verifioidaan (606) digitaalisen äänidatan ja ensimmäiseen käyttäjään liittyvän ja ensimmäisen käyttäjän äänen karakterisoivan vertailuäänijäljen perusteella se onko äänidata tuotettu ensimmäisen käyttäjän toimesta puheen lähteen todentamiseksi ensimmäiseksi käyttäjäksi,Preferably verifying (606) on the basis of the digital audio data and the reference audio trace associated with the first user and characterizing the voice of the first user whether the audio data has been produced by the first user to authenticate the source of speech to the first user, 15 muunnetaan (608) äänidata vastaavaksi tekstiksi lisäksi valinnaisesti kääntäen teksti jollekin toiselle kielelle, joka on edullisesti ensimmäisen käyttäjän tai toista päätettä hyödyntävän toisen käyttäjän valitsema kieli, ja toimitetaan (610) teksti, valinnaisesti mukaan luettuna mainittu käännetty teksti,Further converting (608) the audio data into corresponding text, optionally translating the text into another language, preferably a language selected by the first user or a second user utilizing the second terminal, and providing (610) text, optionally including said translated text, 20 vähintään yhteen käyttöpäätteeseen, mukaan luettuna edullisesti mainittu ensimmäinen pääte ja/tai toisen käyttäjän mainittu toinen pääte, jälleentuotettavaksi edullisesti kyseessä olevan päätteen näytön kautta.20 to at least one access terminal, including preferably said first terminal and / or said second user of said second terminal, preferably for reproduction via the display of the terminal in question. 11. Tietokoneohjelma käsittäen koodivälineet, jotka tietokoneella ajettuina on11. A computer program comprising code means executed on a computer 25 sovitettu suorittamaan patenttivaatimuksen 10 mukaiset menetelmävaiheet.25 adapted to carry out the method steps of claim 10. 12. Tietoväline, joka käsittää patenttivaatimuksen 11 mukaisen tietokoneohjelman.A medium comprising a computer program according to claim 11.
FI20165708A 2016-09-20 2016-09-20 Online multimodal information transfer method, related system and device FI127920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
FI20165708A FI127920B (en) 2016-09-20 2016-09-20 Online multimodal information transfer method, related system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
FI20165708A FI127920B (en) 2016-09-20 2016-09-20 Online multimodal information transfer method, related system and device

Publications (2)

Publication Number Publication Date
FI20165708A FI20165708A (en) 2018-03-21
FI127920B true FI127920B (en) 2019-05-15

Family

ID=61865742

Family Applications (1)

Application Number Title Priority Date Filing Date
FI20165708A FI127920B (en) 2016-09-20 2016-09-20 Online multimodal information transfer method, related system and device

Country Status (1)

Country Link
FI (1) FI127920B (en)

Also Published As

Publication number Publication date
FI20165708A (en) 2018-03-21

Similar Documents

Publication Publication Date Title
EP3272101B1 (en) Audiovisual associative authentication method, related system and device
EP3412014B1 (en) Liveness determination based on sensor signals
Yan et al. The catcher in the field: A fieldprint based spoofing detection for text-independent speaker verification
US8589167B2 (en) Speaker liveness detection
US10665244B1 (en) Leveraging multiple audio channels for authentication
US8812319B2 (en) Dynamic pass phrase security system (DPSS)
US10623403B1 (en) Leveraging multiple audio channels for authentication
Sahidullah et al. Robust voice liveness detection and speaker verification using throat microphones
WO2021200082A1 (en) In-ear liveness detection for voice user interfaces
US11170089B2 (en) Methods and systems for a voice ID verification database and service in social networking and commercial business transactions
JP7120313B2 (en) Biometric authentication device, biometric authentication method and program
US11900730B2 (en) Biometric identification
Turner et al. Attacking speaker recognition systems with phoneme morphing
Zhang et al. Volere: Leakage resilient user authentication based on personal voice challenges
US20200411014A1 (en) User authentication with audio reply
Shirvanian et al. Quantifying the breakability of voice assistants
Shirvanian et al. Short voice imitation man-in-the-middle attacks on Crypto Phones: Defeating humans and machines
FI127920B (en) Online multimodal information transfer method, related system and device
Zhang et al. A continuous liveness detection for voice authentication on smart devices
Zhang et al. LiVoAuth: Liveness Detection in Voiceprint Authentication with Random Challenges and Detection Modes
Phipps et al. Securing voice communications using audio steganography
FI126129B (en) Audiovisual associative authentication method and equivalent system
Li et al. Toward Pitch-Insensitive Speaker Verification via Soundfield
Anand et al. Motion Sensor-based Privacy Attack on Smartphones
Turner Security and privacy in speaker recognition systems

Legal Events

Date Code Title Description
FG Patent granted

Ref document number: 127920

Country of ref document: FI

Kind code of ref document: B