RU2790946C1

RU2790946C1 - Method and system for analyzing voice calls for social engineering detection and prevention

Info

Publication number: RU2790946C1
Application number: RU2022103928A
Authority: RU
Inventors: Иван Александрович Оболенский; Кирилл Евгеньевич Вышегородцев; Дмитрий Николаевич Губанов; Илья Владимирович Богданов
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Filing date: 2022-02-16
Publication date: 2023-02-28

Abstract

FIELD: computer technology.

SUBSTANCE: processing of data from incoming audio calls to classify the presence of fraudulent activities. The claimed effect of the invention is achieved by a computer-implemented method of analysing the dialogue during audio calls to detect fraudulent activity, powered by a processor and containing the following steps: receive an incoming audio stream coming from the calling party; process the incoming audio stream using at least one machine-learning model, which includes: converting the incoming audio stream into a vector form; comparing the vector form of the audio stream with previously stored vectors characterizing fraudulent activity; transcribing the audio stream to analyse the dialogue of the calling party for at least the semantic composition of the information and the pattern of the dialogue; classifying the incoming audio stream based on the processing.

EFFECT: increased efficiency and accuracy of recognizing fraudulent activity in incoming audio calls due to the combined analysis of the audio stream and the semantics of the dialogue pattern.

15 cl, 5 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[0001] Настоящее техническое решение относится к области вычислительной техники, в частности к обработке данных входящих аудиовызовов для классификации наличия состава мошеннических действий. [0001] This technical solution relates to the field of computing, in particular to the processing of data from incoming audio calls to classify the presence of fraudulent activity.

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE INVENTION

[0002] Применение различных методов в части анализа аудиопотоков для их последующей классификации является достаточно распространенным подходом, применяемым в различных областях техники и бизнеса. Возросшая активность преступлений в области кибербезопасности особенно часто находит свое отражение в сфере финансов, что негативно сказывается как на благосостоянии клиентов, так и на репутации финансовых институтов. Наиболее частым приемом, применяемым мошенниками при телефонных звонках, является социальная инженерия, при котором клиента вводят в заблуждение и вынуждают самостоятельно совершить определенные действия, приводящие, как правило, к хищению денежных средств.[0002] The use of various methods in terms of analyzing audio streams for their subsequent classification is a fairly common approach used in various fields of technology and business. The increased activity of cybersecurity crimes is especially often reflected in the financial sector, which negatively affects both the well-being of customers and the reputation of financial institutions. The most common technique used by scammers during phone calls is social engineering, in which the client is misled and forced to independently perform certain actions, leading, as a rule, to the theft of funds.

[0003] Одним из примеров решений, направленных на борьбу с мошеннической активностью, является способ определения риск-балла звонка, который заключается в анализе речевой информации звонящего и ее классификации на наличие заданных триггеров, свидетельствующих о намерениях звонящего (US 20170142252 А1, 18.05.2017).[0003] One of the examples of solutions aimed at combating fraudulent activity is a method for determining the risk score of a call, which consists in analyzing the caller's speech information and classifying it for the presence of specified triggers that indicate the caller's intentions (US 20170142252 A1, 05/18/2017 ).

[0004] Другим примером подходов является обнаружение изменения голоса звонящего или формирование синтетической речи, воспроизводимой роботом или ботом, на основе выделения из звуковой дорожки характерных признаков, свидетельствующих о синтетической природе звука (US 10944864 В2, 09.03.2021).[0004] Another example of approaches is the detection of a change in the caller's voice or the formation of synthetic speech reproduced by a robot or bot, based on the extraction of characteristic features from the audio track, indicating the synthetic nature of the sound (US 10944864 B2, 03/09/2021).

[0005] Основным недостатком известных решений является отсутствие комплексного подхода, позволяющего проводить многосторонний анализ аудиопотока на предмет выявления ряда характеристик, в частности помимо анализа звуковой составляющей диалога осуществлять транскрибирование звуковой информации для обработки паттерна диалога звонящего. Также, недостатком является отсутствие автоматизированных способов защиты абонента от мошеннических действий при входящих вызовах, а также автоматическое получение мошеннических аудиопотоков.[0005] The main disadvantage of the known solutions is the lack of an integrated approach that allows for a multilateral analysis of the audio stream to identify a number of characteristics, in particular, in addition to analyzing the audio component of the dialogue, to transcribe audio information to process the caller's dialogue pattern. Also, the disadvantage is the lack of automated ways to protect the subscriber from fraudulent actions on incoming calls, as well as the automatic receipt of fraudulent audio streams.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[G006] Решаемой технической проблемой с помощью заявленного изобретения является повышение эффективности распознавания мошеннической активности.[G006] The technical problem to be solved by the claimed invention is to improve the efficiency of recognition of fraudulent activity.

[0007] Техническим результатом является повышение эффективности и точности распознавания мошеннической активности входящих аудиовызовов, за счет комбинированного анализа аудиопотока и семантики паттерна диалога.[0007] The technical result is to increase the efficiency and accuracy of recognizing the fraudulent activity of incoming audio calls, due to the combined analysis of the audio stream and the semantics of the dialogue pattern.

[0008] Заявленный технический результат достигается за сет выполнения компьютерно-реализуемого способа анализа диалога во время аудиовызовов на предмет выявления мошеннической активности, выполняемого с помощью процессора и содержащего этапы, на которых:[0008] The claimed technical result is achieved by performing a computer-implemented method for analyzing the dialogue during audio calls to detect fraudulent activity, performed using the processor and containing the steps at which:

- получают входящий аудиопоток, поступающий от вызывающей стороны;- receive an incoming audio stream coming from the calling party;

- осуществляют преобразование входящего аудиопотока в векторную форму;- carry out the conversion of the incoming audio stream into vector form;

- осуществляют обработку преобразованного аудиопотока с помощью первой модели машинного обучения, в ходе которой выполняют сравнение векторной формы аудиопотока с ранее сохраненными векторами, характеризующими мошенническую активность;- processing the converted audio stream using the first machine learning model, during which the vector form of the audio stream is compared with previously stored vectors characterizing fraudulent activity;

- осуществляют транскрибирование аудиопотока и его последующую обработку с помощью второй модели машинного обучения, которая выполняет анализ диалога вызывающей стороны, при этом в ходе упомянутого анализа осуществляется:- performing transcription of the audio stream and its subsequent processing using the second machine learning model, which performs the analysis of the dialogue of the calling party, while in the course of said analysis, the following is carried out:

семантический состав информации и паттерн ведения диалога, при этом паттерн ведения диалога включает в себя анализ слов, используемых в разговоре, анализ построения фраз, анализ следование фраз друг за другом;the semantic composition of the information and the pattern of the dialogue, while the pattern of the dialogue includes the analysis of the words used in the conversation, the analysis of the construction of phrases, the analysis of the following of phrases one after another;

наличие и длительность пауз в диалоге входящего аудиопотока;the presence and duration of pauses in the dialogue of the incoming audio stream;

- осуществляют классификацию входящего аудиопотока на основании выполненной обработки первой и второй моделями машинного обучения.- classifying the incoming audio stream based on the processing performed by the first and second machine learning models.

[0009] В одном из частных примеров реализации способа при семантическом анализе транскрибированного диалога выполняется выявление слов, присущих мошеннической активности.[0009] In one of the particular examples of the implementation of the method in the semantic analysis of the transcribed dialogue, the identification of words inherent in fraudulent activity is performed.

[0010] В другом частном примере реализации способа дополнительно входящий аудиопоток анализируется на по меньшей мере одно из: тональность, эмотивность, просодия или их сочетания.[0010] In another particular example of the implementation of the method, an additional incoming audio stream is analyzed for at least one of: tonality, emotiveness, prosody, or combinations thereof.

[0011] В другом частном примере реализации способа векторная форма входящего аудиопотока анализируется на предмет наличия признаков, выбираемых из группы: изменение голоса, синтетическое формирование голоса, наложение фонового аудиопотока или их сочетания.[0011] In another particular example of the implementation of the method, the vector form of the incoming audio stream is analyzed for the presence of features selected from the group: voice change, synthetic voice formation, background audio stream overlay, or combinations thereof.

[0012] В другом частном примере реализации способа дополнительно анализируют исходящий аудиопоток.[0012] In another particular example of the implementation of the method, the outgoing audio stream is additionally analyzed.

[0013] В другом частном примере реализации способа выполняют разделение исходящего и входящего аудиопотоков.[0013] In another particular example of the implementation of the method, the outgoing and incoming audio streams are separated.

[0014] В другом частном примере реализации способа дополнительно анализируется по меньшей мере один параметр входящего аудиопотока, выбираемый из группы: высота тембра, сила звука, интенсивность речи, длительность произнесения слов, придыхание, глоттализация, палатализация, тип примыкания согласного к гласному или их сочетания.[0014] In another particular example of the implementation of the method, at least one parameter of the incoming audio stream is additionally analyzed, selected from the group: timbre pitch, sound intensity, speech intensity, duration of pronunciation of words, aspiration, glottalization, palatalization, type of adjunction of a consonant to a vowel, or combinations thereof .

[0015] В другом частном примере реализации способа дополнительно анализируется наличие посторонних шумов во входящем аудиопотоке.[0015] In another particular example of the implementation of the method, the presence of extraneous noise in the incoming audio stream is additionally analyzed.

[0016] В другом частном примере реализации способа выполняется на устройстве пользователя, представляющим собой смартфон, планшет или компьютер.[0016] In another particular example of the implementation of the method is performed on the user's device, which is a smartphone, tablet or computer.

[0017] В другом частном примере реализации способа при получении входящей аудиодорожки выполняется генерирование синтетического исходящего голосового аудиопотока.[0017] In another particular example of the implementation of the method, upon receipt of the incoming audio track, a synthetic outgoing voice audio stream is generated.

[0018] В другом частном примере реализации способа генерирование исходящего аудиопотока выполняется до момента классификации входной аудиодорожки.[0018] In another particular example of the implementation of the method, the generation of the outgoing audio stream is performed before the classification of the input audio track.

[0019] В другом частном примере реализации способа генерирование синтетического аудиопотока осуществляется на основании голосового образца пользователя устройства.[0019] In another particular example of the implementation of the method, the generation of a synthetic audio stream is based on the voice sample of the device user.

[0020] В другом частном примере реализации способа при классификации входящего аудиопотока как мошеннического выполняется сохранение его векторного представления.[0020] In another particular example of the implementation of the method, when classifying an incoming audio stream as fraudulent, its vector representation is saved.

[0021] В другом частном примере реализации способа при классификации входящего аудиопотока как мошеннического выполняется генерирование сообщения о статусе, отображаемое на дисплее устройства.[0021] In another particular example of the implementation of the method, when classifying an incoming audio stream as fraudulent, a status message is generated that is displayed on the display of the device.

[0022] Заявленный технический результат также достигается с помощью системы анализа диалога во время аудиовызовов на предмет выявления мошеннической активности, которая содержит по меньшей мере один процессор и по меньшей мере одну память, хранящую машиночитаемые инструкции, которые при их выполнении процессором реализуют вышеуказанный способ.[0022] The claimed technical result is also achieved using a dialogue analysis system during audio calls to detect fraudulent activity, which contains at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the above method.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0023] Фиг. 1 иллюстрирует общую схему заявленного решения.[0023] FIG. 1 illustrates the general scheme of the claimed solution.

[0024] Фиг. 2А иллюстрирует блок-схему общего процесса анализа аудиопотока вызова.[0024] FIG. 2A illustrates a flowchart of a general process for parsing an audio call stream.

[0025] Фиг. 2Б иллюстрирует блок-схему процесса анализа аудиопотока на предмет синтетических изменений.[0025] FIG. 2B illustrates a flow diagram of a process for analyzing an audio stream for synthetic changes.

[0026] Фиг. 3 иллюстрирует блок-схему процесса формирования синтетического исходящего аудиопотока для ведения диалога.[0026] FIG. 3 illustrates a flowchart of a process for generating a synthetic outgoing audio stream for dialogue.

[0027] Фиг. 4 иллюстрирует общую схему вычислительного устройства.[0027] FIG. 4 illustrates the general layout of a computing device.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

[0028] На Фиг. 1 представлена общая схема (100) заявленного решения. Решение основано на базе программно-аппаратного комплекса, реализуемого на одном или нескольких вычислительных устройствах, например, на смартфоне (111) пользователя (110), или связанном с ним устройством, которое может обеспечивать обработку входящих аудиовызовов, поступающих от стороннего абонента (120). Под аудиовызовами следует понимать, например, звонки посредством телефонной связи, звонки, осуществляемые посредством мессенджеров (WhatsApp, Viber, Telegram, Facebook Messenger и др.) через сеть Интернет, в том числе видеовызовы.[0028] In FIG. 1 shows a general scheme (100) of the claimed solution. The solution is based on a software and hardware complex implemented on one or more computing devices, for example, on a smartphone (111) of the user (110), or a device associated with it, which can process incoming audio calls coming from a third-party subscriber (120). Audio calls should be understood, for example, calls via telephone, calls made via instant messengers (WhatsApp, Viber, Telegram, Facebook Messenger, etc.) via the Internet, including video calls.

[0029] Поступающие аудиовызовы от абонентов (120) поступают на дальнейшую обработку (200), выполняемую с помощью программной логики, реализуемой вычислительным устройством, например, смартфоном (111). Обработка (200) выполняется посредством одной или нескольких моделей машинного обучения, которые обучены выполнять обработку входящего аудиопотока (аудиодорожки) для анализа на предмет риска мошеннической активности со стороны абонента (120).[0029] Incoming audio calls from subscribers (120) are sent to further processing (200) performed using software logic implemented by a computing device, such as a smartphone (111). Processing (200) is performed by one or more machine learning models that are trained to process the incoming audio stream (audio track) to analyze for risk of fraudulent activity by the subscriber (120).

[0030] На Фиг. 2А представлена схема выполнения способа (200) обработки аудиопотока, выполняемые при получении входящего вызова. На первом этапе (201) осуществляется получение аудиовызова и захват входящего аудиопотока. Захват может осуществляться широко известными из уровня техники средствами записи диалога, например, с помощью специализированного ПО (Voice Recorder, Cube ACR и т.п.). Полученный аудиопоток обрабатывается параллельно для одновременного анализа как аудио составляющей, так и семантики диалога.[0030] In FIG. 2A is a flow diagram of a method (200) for processing an audio stream when an incoming call is received. In the first step (201), an audio call is received and an incoming audio stream is captured. Capture can be carried out by means of dialogue recording widely known from the prior art, for example, using specialized software (Voice Recorder, Cube ACR, etc.). The resulting audio stream is processed in parallel to simultaneously analyze both the audio component and the semantics of the dialogue.

[0031] Полученный на этапе (201) аудиопоток преобразовывается в векторный формат (эмбеддинг, от англ.) на этапе (202) для последующей передачи в модель машинного обучения на этапе (203) для анализа на наличие совпадения с ранее зафиксированными голосовыми эмбеддингами мошенников. Преобразование входного аудиопотока может выполняться с помощью технологии IBM Audio Embedding Generator (https://developer.ibm.com/technologies/artificial-intelligence/models/max-generator/).[0031] The audio stream received at step (201) is converted to a vector format (embedding, from English) at step (202) for subsequent transmission to the machine learning model at step (203) for analysis for a match with previously recorded scam voice embeddings. Transformation of the input audio stream can be performed using the IBM Audio Embedding Generator technology (https://developer.ibm.com/technologies/artificial-intelligence/models/max-generator/).

[0032] Ранее известные векторные представления аудиопотоков, для которых была зафиксирована мошенническая активность, могут храниться в базе данных (БД). БД указанных эмбеддингов может размещаться на удаленном сервере, связь с которым во время аудиовызова устанавливается через смартфон (111). При этом БД может также дублироваться непосредственно на само устройство (111).[0032] Previously known vector representations of audio streams for which fraudulent activity has been detected may be stored in a database (DB). The database of said embeddings can be located on a remote server, the connection with which is established via a smartphone (111) during an audio call. In this case, the database can also be duplicated directly on the device itself (111).

[0033] На этапе (204) по итогам обработки эмбеддинга с помощью модели машинного обучения, обеспечивающей классификацию входящего аудиопотока, принимается решение о характере аудиовызова звонящего. Если сравнение эмбеддингов показывает, что найдено совпадение, выше, чем установленный порог для классификации моделью машинного обучения, то аудиовызов классифицируется как имеющий мошеннический характер (этап 210). В противном случае аудиовызов классифицируется как безопасный (этап 220).[0033] At step (204), based on the results of embedding processing, a decision is made about the nature of the caller's audio call using a machine learning model that classifies the incoming audio stream. If the comparison of the embeddings indicates that a match is found that is higher than the set threshold for classification by the machine learning model, then the audio call is classified as fraudulent (step 210). Otherwise, the audio call is classified as secure (block 220).

[0034] Примером такой модели может быть модель, построенная на основе «метода опорных векторов», модель на основе линейной или нелинейной регрессии, модель на основе метода «k-соседей». В одном из вариантов реализации использует поиск одной ближайшей записи на основе Евклидова расстояния между векторами. В другом варианте реализации может использоваться расстояние Махаланобиса. Также, в одном из частных примеров реализации может использоваться косинусное расстояние, коэффициент корреляции Пирсона, расстояние Минковского r-степени и прочее.[0034] An example of such a model can be a support vector model, a linear or non-linear regression model, a k-neighbor model. One implementation uses a single closest entry search based on the Euclidean distance between vectors. In another implementation, the Mahalanobis distance may be used. Also, in one of the particular implementation examples, cosine distance, Pearson's correlation coefficient, Minkowski's r-power distance, and so on can be used.

[0035] Параллельно с выполнением этапа (202) осуществляется транскрибирование аудиопотока на этапе (205), для чего входящий аудиопоток преобразуется в текстовый формат. Данная процедура может выполняться различными известными алгоритмами, обеспечивающими преобразование аудиодорожки в текст, например, технологии Speech-To-Text. Также может применяться модель машинного обучения для осуществления процедуры транскрибирования.[0035] In parallel with step (202), the audio stream is transcribed in step (205), for which the incoming audio stream is converted to text format. This procedure can be performed by various well-known algorithms that convert an audio track to text, for example, Speech-To-Text technology. A machine learning model may also be applied to perform the transcription procedure.

[0036] Для выполнения анализа аудиопотоков применяется также алгоритм по разделению голосов собеседников в многоголосовом диалоге, который обеспечивает очистку звуковых дорожек от шумов и другого вида артефактов, что обеспечивает более четкий аудиосигнал. Как пример, для этого можно применить подходы, основанные на NMF-разложении (Non-negative matrix factorization) исходного или преобразованного сигнала, использование сверточных искусственных нейронных сетей (Convolutional Neural Network), моделей «Cone of Silence» и иные подходы.[0036] To perform the analysis of audio streams, an algorithm is also used to separate the voices of the interlocutors in a multi-voice dialogue, which cleans the audio tracks from noise and other types of artifacts, which provides a clearer audio signal. As an example, for this you can apply approaches based on NMF decomposition (Non-negative matrix factorization) of the original or transformed signal, the use of convolutional artificial neural networks (Convolutional Neural Network), “Cone of Silence” models and other approaches.

[0037] Переведенный в текстовую форму аудиопоток анализируется на этапе (206) на предмет классификации паттерна ведения диалога звонящим абонентом (120). Классификация может осуществляться с применением технологий анализа естественного языка (NLP - Natural Language Processing), в том числе могут применяться технологии на базе машинного обучения. С помощью обученной модели на этапе (206) выполняется анализ текстовых данных для их последующего отнесения к классам, характеризующим мошенническое поведение, например, свидетельствующих о факте социальной инженерии. Примером социальной инженерии могут служить фразы, в которых от клиента (110) требуют срочно перевести его деньги на чужой счет, просят сообщить полный номер карты, требуют взять кредит, спрашивают CVV-код, код подтверждения или код из смс и т.п.[0037] The text-translated audio stream is analyzed in step (206) to classify the caller's conversation pattern (120). Classification can be carried out using natural language analysis technologies (NLP - Natural Language Processing), including technologies based on machine learning. Using the trained model, at step (206), text data is analyzed for their subsequent assignment to classes that characterize fraudulent behavior, for example, indicating the fact of social engineering. An example of social engineering can be phrases in which the client (110) is required to urgently transfer his money to someone else's account, is asked to provide a full card number, is required to take a loan, is asked for a CVV code, a confirmation code or a code from SMS, etc.

[0038] Под «классом» или «классами» понимается по меньшей мере класс с содержанием данных по мошенникам или класс с данными не мошенников. Также, классификация может являться нечеткой, когда нельзя однозначно осуществить классификацию - мошенник и не мошенник (2 класса); 3 класса - мошенник, не мошенник, неизвестно; несколько классов - мошенник типа А, мошенник типа Б и так далее.[0038] By "class" or "classes" is meant at least a class containing fraud data or a class containing non-fraud data. Also, the classification may be fuzzy, when it is impossible to unambiguously classify - a fraudster and not a fraudster (2 classes); Grade 3 - scammer, not scammer, unknown; several classes - type A swindler, type B swindler and so on.

[0039] Выходом работы модели на этапе (206) является классификация паттерна ведения диалога на этапе (207). Под паттерном следует понимать, в частности, слова, используемые в разговоре, построение фраз, следование фраз друг за другом и т.п. Модель классификации обучена на примерах диалогов, подтвержденного факта мошеннической активности, в частности на паттернах, позволяющих осуществить последующую классификацию данных при обработке входных аудиопотоков.[0039] The output of the model in step (206) is the classification of the conversation pattern in step (207). The pattern should be understood, in particular, the words used in the conversation, the construction of phrases, the following of phrases one after another, etc. The classification model was trained on examples of dialogs, a confirmed fact of fraudulent activity, in particular, on patterns that allow for subsequent classification of data when processing input audio streams.

[0040] Модель анализа паттерна диалога на этапе (206) обучена характеризовать степень достоверности утверждения, что прямой источник текстовых данных является мошенником или не мошенником. Такую оценку модель может проводить на основе выявления, совокупного анализа, сопоставления по близости к устойчивым семантическим конструкциям речи, типичным репликам, паттернам общего смысла диалога. По итогу классификации модели на этапе (207) принимается решение об отнесении входящего аудиовызова к мошеннической активности (210) или к безопасной (220).[0040] The dialogue pattern analysis model at step (206) is trained to characterize the degree of confidence in the assertion that the direct source of text data is a fraud or not a fraud. The model can carry out such an assessment on the basis of identification, cumulative analysis, comparison in terms of proximity to stable semantic structures of speech, typical replicas, patterns of the general meaning of the dialogue. As a result of the classification of the model at step (207), a decision is made to classify the incoming audio call as fraudulent activity (210) or safe (220).

[0041] Дополнительно при выполнении способа (200) анализ аудиопотока осуществляется помощью эмотивно-просодической модели (модель с анализом эмотивности и просодии), которая позволяет по меньшей мере характеризовать степень достоверности утверждения, что прямой источник аудиозаписи является мошенником или не мошенником на основе, как минимум одной из следующей характеристики: выделения общих имманентных свойств языка по выражению психологического (эмоционального) состояния и переживания человека при совершении им мошеннического звонка, выделению общих особенностей мошенников в произношении, например таких как высота, сила/интенсивность, длительность, придыхание, глоттализация, палатализация, тип примыкания согласного к гласному и других признаков, являющиеся дополнительными к основной артикуляции звука, акценте, интонации в общем и других особенностей речи, а также особенностей фонового сопровождения речи, элементов постороннего шума и подобного. Ключевой особенностью модели является то, что она позволяет выявлять и анализировать общие особенности аудиодорожек, в которых присутствуют элементы мошеннических действий, диалогов и прочей информации, свидетельствующей в той или иной степени о мошеннической активности.[0041] Additionally, when performing the method (200), the analysis of the audio stream is carried out using an emotive-prosodic model (model with analysis of emotiveness and prosody), which allows at least to characterize the degree of reliability of the statement that the direct source of the audio recording is a fraudster or not a fraudster based on how at least one of the following characteristics: highlighting the general immanent properties of the language by expressing the psychological (emotional) state and experience of a person when he makes a fraudulent call, highlighting the common features of fraudsters in pronunciation, for example, such as pitch, strength / intensity, duration, aspiration, glottalization, palatalization , the type of adjunction of a consonant to a vowel and other features that are additional to the main articulation of sound, accent, intonation in general and other features of speech, as well as features of the background accompaniment of speech, elements of extraneous noise, and the like. The key feature of the model is that it allows you to identify and analyze the common features of audio tracks that contain elements of fraudulent actions, dialogues and other information that testify to varying degrees of fraudulent activity.

[0042] Данная модель обучается на основе примеров аудиопотоков ранее отмеченных как мошеннические, по обратной информации от потерпевших в мошеннических схемах. Также возможно расширение базы данных через аугментацию данных или на основе самостоятельной генерации мошеннических диалогов. Такую генерацию можно провести через запись диалогов, в которых будут активно использоваться приемы и методы мошенников, выявленные по имеющимся данным или сформированные самостоятельно.[0042] This model is trained on the basis of examples of audio streams previously marked as fraudulent, according to feedback from victims in fraudulent schemes. It is also possible to expand the database through data augmentation or based on self-generated fraudulent dialogs. Such generation can be carried out through the recording of dialogues in which the techniques and methods of fraudsters will be actively used, identified from the available data or formed independently.

[0043] При классификации входящего аудиовызова может формироваться уведомление о статусе, отображаемое на экране смартфона (111). Также может применяться вибросигнал, передача информации на внешнее устройство, связанное со смартфоном, например, смарт-часы, и другие типы оповещения, позволяющие информировать пользователя (110) о статусе входящего звонка.[0043] Classifying an incoming audio call may generate a status notification displayed on the smartphone screen (111). Vibration, information transmission to an external device associated with a smartphone, such as a smart watch, and other types of notifications can also be used to inform the user (110) about the status of an incoming call.

[0044] На Фиг. 2Б представлена блок-схема этапов дополнительной обработки аудиовызовов, при их преобразовании в векторную форму на этапе (202). Дополнительная обработка выполняется с помощью нескольких моделей машинного обучения на этапе (230), которые позволяют выявить те или иные изменения аудиопотока. На этапе (230) выполняется анализ аудиопотока на предмет изменения голоса (231), синтетического формирования голоса (232), наличия наложения фона (233), наличие посторонних шумов (234).[0044] In FIG. 2B is a block diagram of the steps for further processing audio calls as they are vectorized in step (202). Additional processing is performed using several machine learning models at step (230), which allow you to identify certain changes in the audio stream. At step (230), the audio stream is analyzed for voice changes (231), synthetic voice formation (232), presence of background overlay (233), presence of extraneous noise (234).

[0045] На этапах (231, 232) модель анализирует факт программного изменения голоса звонящего абонента (120), например, с помощью применения алгоритмов Deep Fake Voice, алгоритмы клонирования голоса и т.п. Модель осуществляет оценку соответствия входной аудиодорожки естественной записи голоса человека и его окружающего пространства или наличие в ней дополнительной электронной обработки, элементов искусственной генерации звуков, полного или частичного синтеза записи. Реализация данного выявления может основываться на выявлении синтетических особенностей и машинных артефактов при искусственной генерации речи человека. Примерами таких особенностей и артефактов могут быть неестественная монотонность в речи, скрипы в произношении, множество помех и прочее. Данная модель позволяет по меньшей мере характеризовать вероятность наличия намеренных искажений в естественной записи или ее искусственной генерации. Одним из примеров реализации функционала модели может выступать анализ графического представления спектрограмм аудиозаписи или использование архитектур «трансформеров», например, на основании нейронных сетей. Данный пример реализации при этом не ограничивает другие частные формы воплощения реализации функционала вышеуказанной модели машинного обучения.[0045] In steps (231, 232), the model analyzes whether the voice of the caller (120) has been programmatically changed, for example, by applying Deep Fake Voice algorithms, voice cloning algorithms, and the like. The model assesses the conformity of the input audio track with the natural recording of the human voice and its surrounding space or the presence of additional electronic processing, elements of artificial sound generation, full or partial synthesis of the recording. The implementation of this detection can be based on the detection of synthetic features and machine artifacts in the artificial generation of human speech. Examples of such features and artifacts can be unnatural monotony in speech, creaks in pronunciation, a lot of noise, and so on. This model allows at least to characterize the probability of the presence of intentional distortions in a natural recording or its artificial generation. One of the examples of the implementation of the model functionality can be the analysis of a graphical representation of the spectrograms of an audio recording or the use of “transformer” architectures, for example, based on neural networks. This implementation example, however, does not limit other particular forms of implementation of the implementation of the functionality of the above machine learning model.

[0046] На этапе (233) выполняется анализ факта наложения фона на входящий аудиопоток, например, для формирования звуковой активности офиса, колл-центра и т.п. Данный подход может применяться мошенниками для маскирования звуковой дорожки и сокрытия места реального осуществления вызова, что может быть также установлено посредством посторонних шумов при звонке. Обученная модель на этапе (233) анализирует артефакты, присущие синтетическим звуковым сигналам, нехарактерным для реальной обстановки.[0046] At step (233), an analysis is made of the fact that the background is superimposed on the incoming audio stream, for example, to generate the sound activity of an office, a call center, etc. This approach can be used by scammers to mask the audio track and hide the place of the real call, which can also be established by extraneous noise during the call. The trained model at step (233) analyzes artifacts inherent in synthetic audio signals that are uncharacteristic of the real environment.

[0047] На этапе (234) выполняется анализ наличия посторонних шумов в аудиодорожке при входящем вызове, например, при синтезе речи, как правило, наблюдается треск в записи, помехи и т.п. Модель, обеспечивая заданный функционал, также может осуществлять анализ с помощью сравнения спектрограмм или по иному принципу, позволяющему установить «нехарактерные» для обычного звонка аудиоданные.[0047] At step (234), an analysis is made of the presence of extraneous noise in the audio track during an incoming call, for example, in speech synthesis, as a rule, crackling in the recording, interference, etc. is observed. The model, providing the specified functionality, can also carry out analysis by comparing spectrograms or by another principle that allows you to establish "uncharacteristic" audio data for a normal call.

[0048] Применяемая модель на этапе (230) позволяет сверхаддитивно (синергетически) объединять и анализировать по меньше мере двух любых выходов с применяемых моделей. Отличительной особенностью является то, что подобная модель позволяет анализировать в совокупности выходные данные от предыдущих моделей и получать более достоверные оценки о наличии мошеннических элементов в аудиозаписи, чем при каком-либо использовании выходов с моделей самостоятельно или простом обобщении, таком как расчет среднего, извлечение максимального и подобного. Данный эффект может быть достигнут за счет объединения нескольких выходов в общий числовой вектор (упорядоченную последовательность) и использовании в качестве классификатора нейронных сетей, получении характерных объектов каждого класса через метод опорных векторов или к-соседей, построение ансамблей или бустингов деревьев решений.[0048] The applied model at step (230) allows super-additive (synergistically) to combine and analyze at least two of any outputs from the applied models. A distinctive feature is that such a model allows you to analyze in aggregate the output data from previous models and obtain more reliable estimates of the presence of fraudulent elements in the audio recording than any use of the outputs from the models on its own or simple generalization, such as calculating the average, extracting the maximum and the like. This effect can be achieved by combining several outputs into a common numerical vector (ordered sequence) and using neural networks as a classifier, obtaining characteristic objects of each class through the support vector or k-neighbors method, building ensembles or boosting decision trees.

[0049] Итогом отработки одной или нескольких моделей на этапе (230) является дополнительная классификация входящего аудиозвонка на предмет мошеннической активности (210) или отсутствии таковой (220).[0049] The result of working out one or more models at step (230) is an additional classification of the incoming audio call for fraudulent activity (210) or the absence of such (220).

[0050] На Фиг. 3 представлен частный случай выполнения способа (300) защиты абонента (110) от мошеннических действий при входящих вызовах. При получении входящего вызова на этапе (301) с помощью устройства пользователя (110), например, смартфона (111), выполняется активация синтетического исходящего аудиопотока на этапе (302), который выполняется роль роботизированного собеседника (бота) со стороны пользователя (110). Специальное программное обеспечение активирует заданный алгоритм ведения диалога при входящем аудиовызове. Это необходимо для того, что собирать данные и анализировать входящий звонок от абонента (120) на предмет мошеннической активности. Генерирование синтетической исходящей со стороны пользователя (110) аудиодорожки (аудиопотока) может выполняться на основании клонирования или синтезирования по голосовому образцу пользователя (110). Для этого также могут применяться различные известные решения по формированию аудиоданных из заданных образцов, например, AI Voice Generator или похожие решения.[0050] In FIG. 3 shows a particular case of performing a method (300) for protecting a subscriber (110) from fraudulent actions on incoming calls. When an incoming call is received at step (301) using the user's device (110), for example, a smartphone (111), a synthetic outgoing audio stream is activated at step (302), which acts as a robotic interlocutor (bot) from the user (110). Special software activates the specified dialogue algorithm for an incoming audio call. This is necessary in order to collect data and analyze the incoming call from the subscriber (120) for fraudulent activity. The generation of a synthetic outgoing from the user (110) audio track (audio stream) can be performed based on cloning or synthesizing according to the user's voice pattern (110). For this, various well-known solutions for generating audio data from given samples can also be used, for example, AI Voice Generator or similar solutions.

[0051] На этапе (303) захватываемая с помощью бота аудиодорожка входящего аудиовызова проходит этапы обработки вышеописанного способа (200). Программный бот может выполняться на базе технологий голосовых помощников с применением моделей машинного обучения для того, чтобы фиксировать входящие фразы и генерировать соответствующие ответные голосовые команды. На этапе (304) происходит итоговая классификация входящего звонка и пользователю (110) формируется уведомление о статусе звонка, например, с помощью отображения на экране смартфона (111). Диалог ботом может вестись заданное количество времени, необходимое для классификации входящего звонка. Временной диапазон может варьироваться исходя из диалога абонента (120), а также при срабатывании одной или нескольких моделей машинного обучения при выполнении способа классификации, приведенного на Фиг. 2А - 2Б, и вынесении точного суждения, в зависимости от установленного порогового значения классификации типа звонка.[0051] At step (303), the audio track of the incoming audio call captured by the bot goes through the processing steps of the above described method (200). A software bot can run on voice assistant technologies using machine learning models to capture incoming phrases and generate appropriate response voice commands. At step (304) the final classification of the incoming call occurs and the user (110) is notified of the status of the call, for example, by displaying on the screen of a smartphone (111). The dialogue by the bot can be conducted for a specified amount of time required to classify an incoming call. The time range may vary based on the conversation of the subscriber (120) as well as when one or more machine learning models are fired when performing the classification method shown in FIG. 2A-2B, and making an accurate judgment depending on the set call type classification threshold.

[0052] Заявленный способ может также применятся для сбора векторного представления мошеннических голосовых дорожек, паттернов диалогов и иной информации, которая накапливается и применяется для последующих тренировок моделей машинного обучения, а также формирования стоп-листов, идентифицирующих мошенников.[0052] The claimed method can also be used to collect a vector representation of fraudulent voice tracks, dialogue patterns, and other information that is accumulated and used for subsequent training of machine learning models, as well as the formation of stop lists that identify fraudsters.

[0053] На Фиг. 4 представлен общий вид вычислительного устройства (400), пригодного для выполнения способов (200, 300). Устройство (400) может представлять собой, например, сервер или иной тип вычислительного устройства, который может применяться для реализации заявленного технического решения, в том числе: смартфон, планшет, ноутбук, компьютер и т.п. Устройство (400) может также входить в состав облачной вычислительной платформы.[0053] In FIG. 4 is a perspective view of a computing device (400) suitable for performing methods (200, 300). The device (400) may be, for example, a server or other type of computing device that can be used to implement the claimed technical solution, including: smartphone, tablet, laptop, computer, etc. The device (400) may also be part of a cloud computing platform.

[0054] В общем случае вычислительное устройство (400) содержит объединенные общей шиной информационного обмена один или несколько процессоров (401), средства памяти, такие как ОЗУ (402) и ПЗУ (403), интерфейсы ввода/вывода (404), устройства ввода/вывода (405), и устройство для сетевого взаимодействия (406).[0054] In general, the computing device (400) contains one or more processors (401) connected by a common information exchange bus, memory means such as RAM (402) and ROM (403), input/output interfaces (404), input devices / output (405), and a device for networking (406).

[0055] Процессор (401) (или несколько процессоров, многоядерный процессор) могут выбираться из ассортимента устройств, широко применяемых в текущее время, например, компаний Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. В качестве процессора (401) может также применяться графический процессор, например, Nvidia, AMD, Graphcore и пр.[0055] The processor (401) (or multiple processors, multi-core processor) may be selected from a variety of devices currently widely used, such as Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, and etc. The processor (401) can also be a graphics processor such as Nvidia, AMD, Graphcore, etc.

[0056] ОЗУ (402) представляет собой оперативную память и предназначено для хранения исполняемых процессором (401) машиночитаемых инструкций для выполнение необходимых операций по логической обработке данных. ОЗУ (402), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.).[0056] RAM (402) is a random access memory and is designed to store machine-readable instructions executable by the processor (401) to perform the necessary data logical processing operations. The RAM (402) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.).

[0057] ПЗУ (403) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[0057] A ROM (403) is one or more persistent storage devices such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0058] Для организации работы компонентов устройства (400) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (404). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SAT A, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.[0058] Various types of I/O interfaces (404) are used to organize the operation of device components (400) and organize the operation of external connected devices. The choice of the appropriate interfaces depends on the specific design of the computing device, which can be, but not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SAT A, IDE, Lightning, USB (2.0, 3.0, 3.1, micro , mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0059] Для обеспечения взаимодействия пользователя с вычислительным устройством (400) применяются различные средства (405) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[0059] To ensure user interaction with the computing device (400), various means (405) of I/O information are used, for example, a keyboard, a display (monitor), a touch screen, a touchpad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, indicator lights, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc.

[0060] Средство сетевого взаимодействия (406) обеспечивает передачу данных устройством (400) посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п.В качестве одного или более средств (406) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[0060] The network communication tool (406) provides data transmission by the device (400) via an internal or external computer network, for example, Intranet, Internet, LAN, etc. As one or more means (406), it can be used, but not limited to : Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module, etc.

[0061] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (400), например, GPS, ГЛОНАСС, BeiDou, Galileo.[0061] Additionally, satellite navigation tools in the device (400) can also be used, for example, GPS, GLONASS, BeiDou, Galileo.

[0062] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники. [0062] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

1. A computer-implemented method for analyzing dialogue during audio calls to detect fraudulent activity, performed by a processor and comprising the steps of:

- receive an incoming audio stream coming from the calling party;

- carry out the conversion of the incoming audio stream into vector form;

- processing the converted audio stream using the first machine learning model, during which the vector form of the audio stream is compared with previously stored vectors characterizing fraudulent activity;

- performing transcription of the audio stream and its subsequent processing using the second machine learning model, which performs the analysis of the dialogue of the calling party, while in the course of said analysis, the following is carried out:

the semantic composition of information and the pattern of conducting a dialogue, while the pattern of conducting a dialogue includes an analysis of the words used in the conversation, an analysis of the construction of phrases, an analysis of the sequence of phrases one after another;

the presence and duration of pauses in the dialogue of the incoming audio stream;

- classifying the incoming audio stream based on the processing performed by the first and second machine learning models.

2. The method according to p. 1, characterized in that during the semantic analysis of the transcribed dialogue, the identification of words inherent in fraudulent activity is performed.

3. The method according to claim 1, characterized in that the additionally incoming audio stream is analyzed for at least one of: tonality, emotiveness, prosody, or combinations thereof.

4. The method according to claim 1, characterized in that the vector form of the incoming audio stream is analyzed for the presence of features selected from the group: voice change, synthetic voice formation, background audio stream overlay, or combinations thereof.

5. The method according to claim 1, characterized in that the outgoing audio stream is additionally analyzed.

6. The method according to claim 5, characterized in that the outgoing and incoming audio streams are separated.

7. The method according to claim 1, characterized in that at least one parameter of the incoming audio stream is additionally analyzed, selected from the group: pitch, sound intensity, speech intensity, word pronunciation duration, aspiration, glottalization, palatalization, consonant-to-vowel junction type or their combinations.

8. The method according to claim 1, characterized in that the presence of extraneous noise in the incoming audio stream is additionally analyzed.

9. The method according to p. 1, characterized in that it is performed on the user's device, which is a smartphone, tablet or computer.

10. The method according to claim 9, characterized in that upon receipt of the incoming audio track, a synthetic outgoing voice audio stream is generated.

11. The method according to claim 10, characterized in that the generation of the outgoing audio stream is performed before the classification of the input audio track.

12. The method according to claim 10, characterized in that the generation of a synthetic audio stream is based on the voice sample of the user of the device.

13. The method according to claim 1, characterized in that when classifying an incoming audio stream as fraudulent, its vector representation is saved.

14. The method of claim. 12, characterized in that when the incoming audio stream is classified as fraudulent, a status message is generated and displayed on the display of the device.

15. A system for analyzing dialogue during audio calls to detect fraudulent activity, comprising at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the method according to any one of paragraphs. 1-14.