GB2584827A - Multilayer set of neural networks - Google Patents

Multilayer set of neural networks Download PDF

Info

Publication number
GB2584827A
GB2584827A GB1907476.4A GB201907476A GB2584827A GB 2584827 A GB2584827 A GB 2584827A GB 201907476 A GB201907476 A GB 201907476A GB 2584827 A GB2584827 A GB 2584827A
Authority
GB
United Kingdom
Prior art keywords
voice call
words
call
score
fraudulent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1907476.4A
Other versions
GB201907476D0 (en
Inventor
Charles Herrema Simon
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ai First Ltd
Original Assignee
Ai First Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ai First Ltd filed Critical Ai First Ltd
Priority to GB1907476.4A priority Critical patent/GB2584827A/en
Publication of GB201907476D0 publication Critical patent/GB201907476D0/en
Publication of GB2584827A publication Critical patent/GB2584827A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • H04L63/304Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information intercepting circuit switched data communications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • H04L63/306Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information intercepting packet switched data communications, e.g. Web, Internet or IMS communications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/80Arrangements enabling lawful interception [LI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/55Aspects of automatic or semi-automatic exchanges related to network data storage and management
    • H04M2203/551Call history
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/55Aspects of automatic or semi-automatic exchanges related to network data storage and management
    • H04M2203/555Statistics, e.g. about subscribers but not being call statistics
    • H04M2203/556Statistical analysis and interpretation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/55Aspects of automatic or semi-automatic exchanges related to network data storage and management
    • H04M2203/558Databases
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/60Aspects of automatic or semi-automatic exchanges related to security aspects in telephonic communication systems
    • H04M2203/6027Fraud preventions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/60Aspects of automatic or semi-automatic exchanges related to security aspects in telephonic communication systems
    • H04M2203/6054Biometric subscriber identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/42025Calling or Called party identification service
    • H04M3/42034Calling party identification service
    • H04M3/42059Making use of the calling party identifier

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Continuous speech is analysed to detect whether a voice call is fraudulent, malicious or nuisance. Words 202 are identified in speech from the caller and the recipient, and the words are processed with an analysis engine 200 to identify their context. A score is generated that indicates a risk of the call being nuisance, fraudulent or malicious. If the score exceeds a predetermined threshold then an action is taken during the voice call, such as ending the call, recording the call, or warning the user. In an embodiment a neural network is used where a first layer 204 detects concepts 206, a second layer 210 detects intentions, a third layer 212 detects conversational intents and a fourth layer 214 generates a score. An assistant layer 208 may provide past concepts and past intentions to the second and third layers to provide better context.

Description

MULTILAYER SET OF NEURAL NETWORKS
The present disclosure relates to monitoring voice calls, in particular, but not exclusively monitoring telephone calls and analysing the call for potential fraudulent, malicious and/or nuisance activity.
Speech recognition systems are known in the art. Such systems enable automatic conversion from spoken words into text which may be used by or displayed on an electronic device.
Speech recognition systems may be used to initiate pre-determined actions. Such systems may initiate an action upon recognition of a key word or words. A speech recognition system may be used to detect specific answers, such as "yes" or "no", to set questions as part of an automated call answering system. Such systems are often operable to execute an action for each incoming call.
Fraudulent schemes conducted through telephone calls have increased in number over time. Fraudulent callers often rely on scripts comprising a number of methods of convincing a potential victim into providing them with details such as bank account details or the like. These scripts are constantly changing over time in order to avoid detection by authorities and to find ever more convincing methods of deceiving potential victims.
One prior art system is shown in US2016/0150092.
Description
According to a first aspect of the invention there is provided a system arranged to analyse continuous speech within a voice call to determine if the voice call is fraudulent, malicious and / or nuisance. The system will typically be arranged to use a processing circuitry to perform one or more of the following: i) digitise audio signals containing the continuous speech, from the caller and the recipient of the voice call to generate audio data and to identify words contained within the audio data; ii) processing the identified words with an analysis engine which is arranged to process the words to identify the context in which the words and are used; iii)generating a score according to the analysis performed by the analysis engine according to a set of metrics; The calculated score, based upon the analysis performed by the analysis engine, may be indicative of the risk that the telephone call is fraudulent, malicious, nuisance.
The method may be arranged to execute an action, during the voice call, upon the calculated score reaching a pre-determined threshold and wherein the method may be arranged to allow update of the metrics as the voice call is analysed.
Such a system is believed advantageous, in part, because it is able to process both sides of the voice call which is believed to allow for more accurate analysis of the content of the call.
Conveniently, the system is arranged such that the calculated score depends on historic content from the voice call in addition to the identified words.
The system may be arranged such that the calculated score is includes any of the following factors in its calculation: the occurrence of predetermined words, phrases and/or sentences; the order of words, phrases and/or sentences; the intonation / change of intonation in the voice call; a volume change(s) within the voice call; frequency of change of intonation; repetition of change of intonation; frequency of change of volume; repetition of change of volume; frequency of key words, frequency of words within a given topic; repetition of words; recognition of a voice (eg biometric) profile; length of the call; repeated calls from the same voice profile; time of day of the call; the day of the week; etc. The executed action may be any one or more of: Terminating the voice call; Forwarding the voice call to one or more telephone numbers stored in a database; Recording at least a portion of the voice call; and Indicating to a user that the voice call may be fraudulent.
The digitised audio signals may include any of the following: Letters, Words, Numbers, Key tones.
Conveniently the system is arranged to compare the digitised audio signals against data stored in at least one database.
The system may be arranged such that the database is automatically updated on the analysis of previous telephone calls.
The system may be arranged such that the or each database is operable to be manually updated.
According to a second aspect of the invention there is provided a machine readable medium containing instructions which when executed caused a mobile telephone to operate with the system of the first aspect.
According to a third aspect of the invention there is provided a machine readable data carrier containing instructions which when read by a computer cause that computer to provide the system of the first aspect.
According to a fourth aspect of the invention there is provided a method arranged to analyse continuous speech within a voice call to determine if the voice call is fraudulent, malicious and / or nuisance, the method comprising at least one of the following steps: using processing circuitry arranged to: I) digitise audio signals containing the continuous speech, from the caller and the recipient of the voice call to generate audio data and to identify words contained within the audio data; 2) processing the identified words with an analysis engine which is arranged to process the words to identify the context in which the words and may be used to; to generate a score according to the analysis performed by the analysis engine according to a set of metrics; wherein the calculated score, may be based upon the analysis performed by the analysis engine, is indicative of the risk that the telephone call is fraudulent, malicious, nuisance; and the method may be arranged to execute an action, during the voice call, upon the calculated score reaching a pre-determined threshold.
Conveniently, the method arranged to allow update of the metrics as the voice call is analysed.
The machine readable medium referred to in any of the above aspects of the invention may be any of the following: a CDROM; a DVD ROM / RAM (including -R/-RW or +R/+RW); a hard drive; a memory (including a USB drive; an SD card; a compact flash card or the like); a transmitted signal (including an Internet download, ftp file transfer of the like); a wire; etc. Features described in relation to any of the above aspects of the invention may be applied, mutatis mutandis, to any of the other aspects of the invention.
The skilled person will appreciate that many portions of the technology described herein may be provided by software, hardware and/or firmware and or any combination thereof.
Example embodiments will no be described w th reference to the accompanying drawings, in which: Figure 1 schematically shows components of an example Figure 2 shows a diagrammatic representation of a system arranged to determine the probability that a telephone call is fraudulent, malicious or the like; Figure 3 schematically shows further details of an analysis engine of Figure 2; and Figure 4 shows a flow chart of the system shown in Figure I. The embodiments described herein relate to a system arranged to determine the risk that a telephone call is fraudulent, malicious, nuisance or the like.
It is believed that the call monitoring system will be used personally, but it may also be used by businesses, linguists and fraud prevention authorities.
in the embodiment being described and shown in Figure I, a system 100 comprises a first server 102 and a second server 104 to which a remote processing device 106a, b c, d, e can make a connection across a network 108. In the following text, it is convenient to refer to the processing devices as 106 with no following letter if the context supports this reference, or with a following letter if a specific reference is being made to one of the illustrated devices.
An analysis engine 105 is provided in at least one of the server(s) 102, 104.
The skilled person will appreciate that the system 100 might contain a number of servers 102, 104 other than two shown in the embodiment being described. There may for instance be a single server 104 or more than two servers 102, 104. The servers 102, 104 need not be located close to one another and may be separated by significant distances.
In the embodiment being described, the network 108 is a Wide Area Network (WAN) and indeed is the Internet. The skilled person will appreciate that there may be other WAN's, etc. that would be suitable for implementing the technologies described herein.
The processing device 106c is a smartphonc such as an AppleTm iPhoneTm" a SamsungTm GalaxvTM, or any other such phone including those running AndroidTm, josTm. WindowsTM mobile, BlackberryTm, etc. To aid understanding of the embodiment being described, sonic of the internal components are schematically shown.
Processing circuitry 110 is connected to an internal memory 112 which contains instructions to cause the processing circuitry to function and also contains data including an internal database 114. The internal database 114 may be used should a portion of the system be implemented on the processing circuitry 110 and in such embodiments the internal database 114 may act as the database 107 described elsewhere.
The processing circuitry 110 is also connected to a speaker 116, or other such sound producer, and a microphone 118, or other sound detector. The skilled person will appreciate that the smartphone 106c is useable as a telephone to provide a voice call with the speaker 116 and the microphone 118 being used to generate sound and input sound respectively.
The processing devices 106a, 106b are also provided and are shown with access to the network 108. The devices 106a, 106b may also be smartphones but alternatively they may be tablets (such as an iPadlm, or other Androidlm tablet, etc), a laptop computer, a desktop computer, a television (including a smart television), or any other such suitable device capable of making a voice call in addition to providing an embodiment.
The skilled person will also appreciate that voice calls can be made over networks other than the TCP/IP protocol used by the Internet 108. The skilled person will appreciate the further networks, but by way of further example Figure 1 shows a PSTN (Packet Switched Telephone Network) 120 to which a telephone 122 is connected. Further, a mobile telephone network 124 shows a further example connecting two further processing apparatus 106d, e to the smartphone 106c. The mobile telephone network may be any suitable mobile telephone network such as GSM (Global System for Mobile communication), GPRS (General Packet Radio Service), UMTS (Universal Mobile Telecommunications System), LTE (Long Term Evolution) or the like.
As the skilled person will appreciate from the following embodiments of this technology allow a voice call to be monitored, regardless of the underlying transport mechanism for that voice call. Figure 1 attempts to exemplify a selection of underlying transport technologies but the figure is not intended to be exhaustive. Indeed, embodiments may be arranged to work with an analogue transport mechanism rather than the packetized transport mechanisms shown in Figure 1.
In use, the system of Figure 1 is arranged to analyse continuous speech within a voice call, commonly referred to as a telephone call. However, as will be evident from Figure 1 the voice call may be provided by a variety of carrier technologies and may for example be a Skype'TM, WhatsApp", voice call or the like.
When a voice call is connected, the processing circuitry 110 is arranged to make a data connection, across the network 108 (or otherwise) to the server(s) 102, 104. Additionally, the processing circuitry 110 is arranged to monitor the speaker 116 and the microphone 118 of the processing unit 106 being used and to digitise the audio signals sent to the speaker 116 or generated at the microphone 118, and so generate audio data representing the audio call (step 400). The voice call proceeds without an impact on the user and audio data is generated which contains a digitised version of the continuous speech.
However, whilst the voice call is proceeding the processing circuitry 110 transmits the digitised speech (ie the audio data) of the voice call to the server(s) 102, 104 for ongoing processing (step 402). In other embodiments the processing device 106 may be arranged to perform some or all of the processing that is now described. Indeed, it is conceivable that some embodiments may transmit the voice call to the server(s) 102, 104 before the voice call is digitised.
It is convenient for the functionality of the processing device 106, such as the smartphone 106c, to be provided by software running on that processing device. In the case of a smartphone 106c this may be loaded on to the smartphone 106c as an App. The skilled person will also appreciate that the functionality could also be provided in the firmware, or hardware of the processing devices 106, or a combination of software (eg an App), firmware and/or hardware.
The server(s) 102, 104 arc arranged to receive the digitised voice call. That is the server(s) 102,104 are arranged to receive the digitised audio signals generated by the processing circuitry 110 from the voice call.
The analysis engine 105 within the server(s) 102, 104 is arranged to process the digitised audio signals received from the processing device 106.
In the embodiment being described, the digitised audio is first processed by the analysis engine 105 to identify individual words within the audio data 300 (step 404). That is, the embodiment being described may be thought of as generating a transcription of the digitised audio 300 which transcription is subsequently processed as described below in part to identify the context in which the words are used. As such, the embodiment being described comprises a transcription engine 302 which is arranged to generate the words that are contained within the audio data 300.
In the described embodiment, the analysis engine 105 comprises a neural network 200, further details of which are shown in Figure 2. The analysis engine 105 also comprises or has access to a database which contains data to be used in the analysis performed by the analysis engine, described hereinafter. Additionally, in the embodiment being described, the analysis engine 105 comprises additional components, shown in Figure 3 and described in more detail below, which in combination with the neural network 200, generate a score as to whether the voice call in progress is likely to be fraudulent, malicious and / or nuisance.
First Layer.
The words 202 identified within the audio data are input to a first layer 204 of the neural network 200. The first layer 204 is arranged to generate, as an output 206, a vector which specifies concepts contained in the input words 202. The concepts are created with the name of the most frequent word used to generate the concept. Here, a concept may be thought of a label as to what the words relate. For example, should the words be discussing a bank account (ie relate to a bank account) then the concept may be labelled 'bank-account'.
In the training text is provided to the neural network 200 with concepts that are expected to occur within voice calls analysed by the system 100. Accordingly, the skilled person will appreciate that the selection of the training data will the domain of voice calls that the system 100 can process.
Assistant Layer.
The neural network 200 also comprises an assistant layer 208, which may be thought of as a supervisory layer.
The assistant layer 208 is in charge of time management and is arranged to assign weight to concepts identified by the first layer 206. In the embodiment being described, the weights are assigned to the concepts dependent on the time that has elapsed since the concept occurred in the voice call. For example, if a concept is raised early in a voice call and then not used again the importance of that concept is likely to of less importance. However, if a concept is repeatedly used within the voice call then it is likely to be of greater importance and is given a higher weight. Thus, it will be seen that the assistant layer 208 allows the score that is being generated to depend on the historic content from the voice call.
In some embodiments, including that being described, the assistant layer 208 is arranged to assign greater weights to predetermined words. Here, the predetermined words may be words that have been found likely used in a fraudulent, malicious and / or nuisance voice call. The presence of such predetermined words may be thought of as increasing the likelihood that such a call is fraudulent, malicious and / or nuisance. Thus, the use of an increased weighting in such embodiments attempts to increase the accuracy of the system.
It can be seen, from Figure 2, that the assistant layer 208 may viewed as a layer running alongside the other layers of neurons. An advantage of an embodiment using the assistant layer 208 is that it frees the neural network from tracking the weight related with time and use of a concept which helps to reduce the number of intermediate neurons.
In some embodiments, including the one being described, the assistant layer 208 is arranged to input concepts into a second layer 210, which is described hereinafter, which concepts have not been generated by the first layer 204, but which have appeared in previous sentences. Embodiments employing such an arrangement have been found to help the NN 200 include interpretations that depend on previous words but helps to modify the meaning in future sentences.
The concepts that are input to the second layer 210 are typically of reduced weight when compared to the concepts output from the first layer 204. This reduced weight becomes appropriate because concepts generated by the assistant layer 208 relate to parts of the voice call that are earlier in the call (ie further in the past) and therefore, they carry less importance than the concepts output from the first layer 204 which are more immediate. This assistant layer, in the embodiment being described, is arranged to manage the weight of the concept taking into account how frequently is used in final intents determined by later layers as described below.
Second Layer.
The concepts generated by the first layer 206 and the assistant layer 208 are input to the second layer 210.
The output of the second layer 210 arc what may be thought of as the intents; ie the second layer 208 is trained to detect intentions derived from the voice call as it is being processed (step 410).
The second layer 208 is trained using full statements that are input to the first layer 202. Training of the first layer 204 is dc-activated whilst the second layer 208 is being trained.
Embodiments that arrange the second layer208 to determine intent in this manner are advantageous as it aids the understanding of complex sentences. The second layer 208 is arranged such that it is able to detect multiple intents within a portion of the voice call that is being analysed. Here the portion of a voice call may be a sentence or within sentences that are close to one another, such as neighbouring sentences.
If multiple intents are detected as being possible then the second layer 208 does not choose which intent should be processed and outputs all the detected intents to a third layer 212 of the neural network 200. It is believed that embodiments that arrange the third layer in this manner, including the embodiment being described, are advantageous over the prior art which can sometimes force selection of a 'best intent' therefore reducing the effectiveness of such systems.
Third Layer.
The assistant layer 208 is further arranged to input past intents that are not occurring in the current portion of the voice call that is being processed.
The embodiment being described is arranged such that the current portion of the voice call is arranged to be the present sentence that is being processed and which is built up as described below. However, other embodiments may be arranged differently, may be such that the current portion is a predetermined number of words, or a predetermined time window, or the like.
Thus, in the embodiment being described, the output of the transcription engine 302 provides information on the current portion, which is based around the sentence being spoken. The system may be arranged, to highlight the current text as the current portion and to concatenate that the current portion unless it is determined that a new sentence has begun.
For example, we could have as input: Current text: I Current text: I want you (concatenated onto 'V); Current text: I want you to provide (concatenated onto I want yote); Current text: I want you to provide account information (concatenated onto 'I want you to provide).
Once it is deemed that a sentence has finished the system, in the embodiment being described, is arranged to start building up a new sentence. For example: Current text: This is important Current text: This is important to move (concatenated on to 'This isimportant'); Current text: This is important to move forward (concatenated on to 'This is important to move forward).
Thus, the system, in the embodiment being described, is arranged to analyse the current text and yet use information derived from previous occurrences of the 'current text'. In the described embodiment, the system is thus arranged to analyse the current sentence using information, via the assistant layer 208, derived from previous sentences.
As with the past concepts that are input, by the assistant layer 208, into the second layer 210, these past intents arise from earlier in the voice call. Accordingly, the neural network is arranged to reduce the weight applied to the past intents when compared to the current intents generated by the second layer 210.
The third layer2I2 is arranged and trained to manage intents inside a conversation and the layer is trained using complete conversations.
In an example conversation between a stammer' and a user, the conversation may be along the following lines: Stammer: I need you to give me more information to protect your account.
User: Like what? Stammer: With you user and password is more than enough to let me help you.
In the embodiment being described, the third layer is arranged to use the last sentence as the current sentence for analysis, but will use previous concepts like "more information" and "your account" which have been picked up in earlier parts of the conversation.
Fourth Layer.
It is conceivable that the third layer 212 of the NN identifies intents that have dependencies between them, are in conflict or the like. In such examples, the output of the third layer 212 may be thought of as being in conflict and making little sense. As such, the fourth layer 214 is used to try and resolve ambiguities in the output of the third layer 212.
Embodiments utilising a fourth layer, such as the one being described, are advantageous because they allow the NN 200 to analyse voice calls where the context of a conversation changes quickly, etc. The training of the fourth layer is performed using complex conversation where the intents contained in the conversation are mixed, change quickly and the like.
The embodiment being described, the analysis engine 105 is arranged such that the metrics upon which the score is calculated can be updated during a call. Such an arrangement helps the system react to new techniques used to try and trick users and allows the system to be more responsive that other embodiments which did not allow such rapid change. The skilled person will appreciate that a portion of the analysis engine is provided by a Neural Network 200 which ordinarily have a training phase and a use phase; as such a Neural Network cannot ordinarily be updated in real time (here, in real time means within the context of an ongoing call).
The skilled person will appreciate that a voice call, such as a telephone call, will ordinarily last from a few seconds to a few minutes. As such, real time, in the context of the embodiments described herein is on the order of roughly any of the following: a few seconds, 30 seconds, a minute. 5 minutes, 10minutes.
In order to allow the update of the metrics during the call, the embodiment being described is arranged to allow the Neural Network 200 that is being accessed to be changed during execution of the method (ie during run-time). That is, the embodiment being described comprises a plurality of neural networks, at least one of which can be trained as the system is executed and at least one of which can be used to analyse the voice call.
To allow to switch between neural networks 200 multiple threads are run at any one instance and the system is arranged such different neural networks may be accessed by each of the threads.
Further, data generated by the analysis of the voice call is captured within the database 107 for each voice call that is analysed and such an arrangement allows the neural network 200 to be changed and the next portion of the text to be analysed. I5
As is illustrated with the aid of Figure 3, there is provided a score generation unit 304 which is arranged to generate a final score of the analysis performed on the ongoing voice call.
The analysis engine 105, in addition to the neural network 200, also comprises a biometric unit 306 which also makes an input into the score generation unit 304. The biometric unit 306 processes the digitised audio signal 300 rather than the words that are input into the neural network 200. Being a digitised version of the audio signal, as opposed to the words, the audio signal 300 still comprises tone, etc. which can be used to further generate a score of the likelihood of the voice call being fraudulent, malicious and/or nuisance.
If the final score, generated by the score generation unit 304, exceeds a predetermined threshold then the system is arranged to perform a predetermined action. The skilled person will appreciate that the final score may be thought of as a confidence level that the call is fraudulent, malicious and / or nuisance. In one embodiment, the system is arranged such that as analysis of the call proceeds the score can be incremented by processing of the analysis engine. Other embodiments may be arranged to decrement the score, or perform other manipulations.
The embodiment being described is arranged such that the fourth layer 214 of the neural network 200 outputs a score to the score generation unit 304. The skilled person will appreciate that many factors could influence whether it can be determined that the call should be held to be fraudulent, malicious and/or nuisance.
In the embodiment being described, the neural network 200 is arranged such that: * A higher score is given is predetermined words are detected within the voice call. The skilled person will appreciate that certain words more likely to be used in a fraudulent, malicious and / or nuisance call than others.
* The second layer 210 of the neural network 200 is arranged, as described above, to detect concepts. In the embodiment being described, the neural network 200 is arranged to increase the score as a function of the number of I5 words located in each concept. Here, it can be helpful to think of the concept as a topic and there may be many words that fall into that topic. For example, the topic / concept might be 'bank account' and words that fall into that concept might be any of the following examples: number, account, bank, sort code, branch, etc. It is likely that the as the number of words increases then the importance of that topic also increases and thus, if that topic is relevant to the call being fraudulent, malicious and / or nuisance then the score should be increased.
* In the embodiment being described the neural network is also arranged such that the age of the topic within the voice call also has an influence on the score. In particular, the system is arranged such that as the age of the topic increases within the voice call then the importance of that topic is consequently reduced and that topic does not have such a large influence on the score.
Embodiments may be arranged to adjust the score being generated by the analysis performed by the analysis engine together with the rest of the system according to any of the following factors. Each of these factors may be thought of as being a metric against which a score can be generated for the voice call being processed and as such, a set of metrics is provided.
1) The occurrence of predetermined words, phrases and sentences.
Including, and in particular, formal introductory language, formal language constructions that convey the impression of authority from the incoming side of the call which can often be used in fraudulent, malicious or nuisance calls.
This may also include citing organisations which are commonly used as narrative 'frames' within telephone frauds, such as the banks, the police or the tax authorities.
2) The order of words, phrases and sentences within any conversation.
3) Intonation and change of intonation during the conversation/s within the voice call, which may include a diminished level of contractions, indicating a higher incidence of formal language.
4) The volume and change in volume of spoken words, phrases and sentences. Telephone fraud depends on creating false realities, usually within limited time frames, and defining elements of stress. These include both implied and overt threats.
5) The frequency of the change in intonation during the conversation. 25 6) The repetition of the change in intonation during the conversation.
7) The frequency in the change in volume during the conversation.
8) The repetition in the change in volume during the conversation.
9) The frequency of key words, phrases and sentences. Keys words and phrases may include commands and demands, as the fraudster uses stress to push the victim towards the intended outcome, of divulging personal or financial information, or transferring funds into a criminal-controlled account.
10) The repetition of key words, phrases and sentences. Repetition is a key trope, as it helps drive the victim towards the intended outcome.
11) The recognition of a particular voice during a conversation. The biometric unit 306 is arranged such it can recognise, via a biometric profile of known fraudsters, the voice of that fraudster. As such, if that biometric profile is recognised a high score may be assigned to the call being fraudulent, malicious and / or nuisance. Indeed, some embodiments may be arranged such that recognition of such a biometric profile is enough in itself to take the score to the predetermined threshold for an action to be taken.
Similarly, the biometric unit 306 may also be arranged to recognise a friendly' biometric profile which may be used to reduce the likelihood of the score reaching the pre-determined threshold for action to be taken.
12) The length of a call is a metric, as telephone fraud often takes several hours to complete.
13) A repeat of a call from the recognised voiceprint and/or CLI (Caller Line Identifier; eg typically the telephone number), where any key metrics have been registered towards a threat analysis.
14) The time of day for the call. Telephone fraud narratives are often inconsistent with the hours kept by legitimate organisations.
15) The day of the week for the call, as above.
The fourth layer 214 of the neural network 200 is arranged to output a final score of the analysis performed by the neural network 200 to the score generation unit 304. The score from the neural network 200 is then combined with the score from the biometric unit 306.
Should the score output from the score generation unit 306 exceed a threshold then the call is deemed to be fraudulent, malicious and/or nuisance the action is executed.
in the embodiment being described, this action is termination of the voice call. However, in other embodiments, the predetermined action may be any one or more of the following: i. terminating the voice call; forwarding the voice call to one or more telephone numbers stored in a database; recording at least a portion of the voice call; iv. storing the call line identification number in at least one database; and v. indicating to a user that the voice call may be fraudulent.
There now follows an example of how the system 100 is used to analyse a voice call between a user of a smartphone 106c (who may be termed a target) and a person (ie a scammer) trying to defraud that user. The example illustrates how the example neural network 200 described above is used to analyse the speech/text within the voice call and assigns a score both to specific words, to sentence constructions and to intentions.
These assigned scores are accumulated to determine an overall risk that the voice call is fraudulent, malicious and/or nuisance. Such calls (ie those that are fraudulent, malicious and/or nuisance) may be thought of as being problematic voice calls.
HMRC' (Her Majesty's Revenue & Customs service) scams are now commonplace, where the target is called by a scammer pretending to work for HMRC.
The scammer may initiate the voice call with the following sentence: Good afternoon. I am calling from Her Majesty's Revenue service to discuss fraudulent activity we have discovered on your account.' While the call is in progress, the system 100 is arranged to score the voice call on a variety of different metrics as described in relation to the described embodiments above. in the particular example being described the voice call is scored on the following metrics: 1. Time of day -Is the time of day appropriate to the caller's purported identity? For example, banks and government organisations will not generally call out of office hours. Thus, if a call is made out of hours a score indicating higher suspicion (which may be a higher or a lower score depending upon whether the low or high threshold is needed to reach the predetermined threshold).
2. Length of conversation -Scam calls are frequently transacted over longer periods of time. Thus, longer voice calls are, in this example, scored with more relevance.
3. Biometrics -Over time the voice print of an incoming call can be matched against a database of known scammers' voices. Thus, if a voice print is matched to an existing profile, then the voice call, in this example, may be scored with more relevance.
4. Formal language -To convey authority scammers may use more formal language, including the use of fewer contractions, i.e. 'We have' rather than 'we've' and '1 am' rather than Also, a formal introductions such as 'Good morning/afternoon/evening.'. As such, in the example being described the system is arranged to provide a score of more relevance if such formal language is used within the voice call.
5. 'Hot' words & phrases -These may include organisations used in the creation of false realities, such as banks, or government bodies.
Therefore, in the example being described, the system is arranged to provide a score of more relevance if such words and/or phrases are used within the voice call.
Thus, noting the above points 1 to 5, the example being described is arranged to score the sentence as follows: Good aft noon' -Formal language. Score of 5 added. Total score 5.
I am calling from' -The formal language continues. Score of 5 added. Total score 10.
Her Majesty's Revenue service' -A 'hot' signifier. Score of 10 added. Total score 20.
to discuss' -Formal language continues. Score of 10 added. Total score 30.
fraudulent activity' -More formal language. Score of 20 added. Total D score 50.
we have discovered on your account' -This reveals the intention within the sentence. On its own these words might be lower-scoring but in the context of the words preceding it, they will score highly because it reveals a clear intention to create a false reality and, in this example, it scores 50. Taking the total score to 100.
At this point, it will be seen that a score of 100 has been accumulated and the call is terminated; where, in this example, 100 is the threshold score for termination.
if biometrics and time of day were also active factors, the scores may accumulate more rapidly.
in the above, the example has been given where the score starts low and is added to until the pre-determined threshold is reached. The skilled person will appreciate that other systems are possible. For example, the score could start high and be decremented until the pre-determined threshold is reached. There may be other scoring system that are suitable to generate the pre-determined threshold.
In the embodiment being described, processing is described as being performed at a specified location, such as on the processing units 106, or the server(s) 102, 104. However, the skilled person will appreciate that in other embodiments processing may be performed at locations other than described here.

Claims (7)

  1. A system arranged to analyse continuous speech within a voice call to determine if the voice call is fraudulent, malicious and / or nuisance, the method comprising: using processing circuitry arranged to: digitise audio signals containing the continuous speech, from the caller and the recipient of the voice call to generate audio data and to identify words contained within the audio data; processing the identified words with an analysis engine which is arranged to process the words to identify the context in which the words and are used; to generate a score according to the analysis performed by the analysis I5 engine according to a set of metrics; wherein the calculated score, based upon the analysis performed by the analysis engine, is indicative of the risk that the telephone call is fraudulent, malicious, nuisance; and execute an action, during the voice call, upon the calculated score reaching a pre-determined threshold.
  2. The system of claim 'wherein the method is arranged to allow update of the metrics as the voice call is analysed.
  3. The system of claim 1 or 2 which is arranged such that the calculated score depends on historic content from the voice call in addition to the identified words.
  4. 5. The system of any preceding claim in which the calculated score is additional calculated according to any of the following: 1) The occurrence of predetermined words, phrases and sentences; 2) The order of words, phrases and sentences within any conversation; 3) Intonation and change of intonation during the voice call; 4) The volume and change in volume of spoken words, phrases and sentences; 5) The frequency of the change in intonation during the conversation; 6) The repetition of the change in intonation during the conversation; 20 7) The frequency in the change in volume during the conversation; 8) The repetition in the change in volume during the conversation; 9) The frequency of key words, phrases and sentences; 10) The repetition of key words, phrases and sentences; 30 11) The recognition of a particular voice during a conversation; 12) The length of a call; 13) A repeat of a call from the recognised voiceprint and/or CLI (ie Caller Line Identifier); 14) The time of day for the call; and 15) The day of the week for the call.
  5. 6. The system of any previous claim wherein the executed action comprises any one or more of: i. Terminating the voice call; i. Forwarding the voice call to one or more telephone numbers stored in a database; iii. Recording at least a portion of the voice call; and iv indicating to a user that the voice call may be fraudulent.
  6. 7. The system of any previous claim wherein the digitised audio signals include any of the following: i. Letters i. Words iii. Numbers iv. Key tones The system of claim 7 wherein the system is arranged to compare the digitised audio signals against data stored in at least one database.The system of any previous claim wherein the database is automatically updated on the analysis of previous telephone calls.The system of any previous claim wherein one or more databases are operable to be manually updated.11. A machine readable medium containing instructions which when executed caused a mobile telephone to operate with the system of any of claims 1 to 10.12. A machine readable data carrier containing instructions which when read by a computer cause that computer to provide the system of any of claims 1 to 10.13. A method arranged to analyse continuous speech within a voice call to determine if the voice call is fraudulent, malicious and / or nuisance, the method comprising: using processing circuitry arranged to: digitise audio signals containing the continuous speech, from the caller and the recipient of the voice call to generate audio data and to identify words contained within the audio data: processing the identified words with an analysis engine which is arranged to process the words to identify the context in which the words and are used; to generate a score according to the analysis performed by the analysis engine according to a set of metrics; wherein the calculated score, based upon the analysis performed by the analysis engine, is indicative of the risk that the telephone call is fraudulent, malicious, nuisance; and execute an action, during the voice call, upon the calculated score reaching a pre-determined threshold.14. The method of claim 13 wherein the method is arranged to allow update of the metrics as the voice call is analysed.15. A system arranged to analyse continuous speech within a voice call to determine if the voice call is fraudulent, malicious and / or nuisance, the method comprising: using processing circuitry arranged to: receive audio data containing digitised audio signals containing the continuous speech, from the caller and the recipient of the voice call processes the received audio data to identify words contained within the audio data; processing the identified words with an analysis engine which is arranged to process the words to identify the context in which the words and arc used; to generate a score according to the analysis performed by the analysis engine according to a set of metrics; wherein the calculated score, based upon the analysis performed by the analysis engine, is indicative of the risk that the telephone call is fraudulent, malicious, nuisance; and execute an action, during the voice call, upon the calculated score reaching a pre-determined threshold.16. A machine readable medium containing instructions which when read by a computer cause a processing circuitry of that computer to: digitise audio signals containing continuous speech, from a caller and a recipient of a voice call to generate audio data and to identify words contained within the audio data; processing the identified words with an analysis engine which is arranged to process the words to identify the context in which the words and are used; to generate a score according to the analysis performed by the analysis engine according to a set of metrics; wherein the calculated score, based upon the analysis performed by the analysis engine, is indicative of the risk that the telephone call is fraudulent, malicious, nuisance; and execute an action, during the voice call, upon the calculated score reaching a pre-determined threshold.
GB1907476.4A 2019-05-28 2019-05-28 Multilayer set of neural networks Withdrawn GB2584827A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1907476.4A GB2584827A (en) 2019-05-28 2019-05-28 Multilayer set of neural networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1907476.4A GB2584827A (en) 2019-05-28 2019-05-28 Multilayer set of neural networks

Publications (2)

Publication Number Publication Date
GB201907476D0 GB201907476D0 (en) 2019-07-10
GB2584827A true GB2584827A (en) 2020-12-23

Family

ID=67385384

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1907476.4A Withdrawn GB2584827A (en) 2019-05-28 2019-05-28 Multilayer set of neural networks

Country Status (1)

Country Link
GB (1) GB2584827A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113301210A (en) * 2021-04-16 2021-08-24 珠海高凌信息科技股份有限公司 Method and device for preventing harassing call based on neural network and electronic equipment
US20220377171A1 (en) * 2021-05-19 2022-11-24 Mcafee, Llc Fraudulent call detection

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143665B (en) * 2019-10-15 2023-07-14 支付宝(杭州)信息技术有限公司 Qualitative method, device and equipment for fraud

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080020092A1 (en) * 2004-04-30 2008-01-24 Arata Suenaga Method For Improving Keeping Quality Of Food And Drink
US20160105457A1 (en) * 2013-08-30 2016-04-14 Bank Of America Corporation Risk Identification
US20160104476A1 (en) * 2014-10-09 2016-04-14 International Business Machines Corporation Cognitive Security for Voice Phishing Activity
US20180020092A1 (en) * 2016-07-13 2018-01-18 International Business Machines Corporation Detection of a Spear-Phishing Phone Call

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080020092A1 (en) * 2004-04-30 2008-01-24 Arata Suenaga Method For Improving Keeping Quality Of Food And Drink
US20160105457A1 (en) * 2013-08-30 2016-04-14 Bank Of America Corporation Risk Identification
US20160104476A1 (en) * 2014-10-09 2016-04-14 International Business Machines Corporation Cognitive Security for Voice Phishing Activity
US20180020092A1 (en) * 2016-07-13 2018-01-18 International Business Machines Corporation Detection of a Spear-Phishing Phone Call

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113301210A (en) * 2021-04-16 2021-08-24 珠海高凌信息科技股份有限公司 Method and device for preventing harassing call based on neural network and electronic equipment
CN113301210B (en) * 2021-04-16 2023-05-23 珠海高凌信息科技股份有限公司 Method and device for preventing harassment call based on neural network and electronic equipment
US20220377171A1 (en) * 2021-05-19 2022-11-24 Mcafee, Llc Fraudulent call detection
US11882239B2 (en) * 2021-05-19 2024-01-23 Mcafee, Llc Fraudulent call detection

Also Published As

Publication number Publication date
GB201907476D0 (en) 2019-07-10

Similar Documents

Publication Publication Date Title
US11210461B2 (en) Real-time privacy filter
US12015637B2 (en) Systems and methods for end-to-end architectures for voice spoofing detection
AU2021212621B2 (en) Robust spoofing detection system using deep residual neural networks
US11238553B2 (en) Detection and prevention of inmate to inmate message relay
EP2622832B1 (en) Speech comparison
US11322159B2 (en) Caller identification in a secure environment using voice biometrics
GB2584827A (en) Multilayer set of neural networks
EP4055592B1 (en) Systems and methods for customer authentication based on audio-of-interest
US12008996B2 (en) System and method for managing an automated voicemail
WO2022107242A1 (en) Processing device, processing method, and program
US20240355322A1 (en) Deepfake detection
EP4412189A1 (en) Methods and apparatus for detecting telecommunication fraud
US20240355336A1 (en) Deepfake detection
US20240355319A1 (en) Deepfake detection
US20240355337A1 (en) Deepfake detection
US20240355323A1 (en) Deepfake detection
US20240355334A1 (en) Deepfake detection
WO2022107241A1 (en) Processing device, processing method, and program
JPS6052440B2 (en) voice recognition device
WO2024220274A2 (en) Deepfake detection
CN116798408A (en) Speech recognition method, terminal device and computer readable storage medium
GB2557375A (en) Speaker identification

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)