CN112559724A

CN112559724A - Method and system for preventing malicious search chat robot vulnerability

Info

Publication number: CN112559724A
Application number: CN202110000300.XA
Authority: CN
Inventors: 路林林
Original assignee: Shenzhen Suoxinda Data Technology Co ltd
Current assignee: Shenzhen Suoxinda Data Technology Co ltd
Priority date: 2021-01-02
Filing date: 2021-01-02
Publication date: 2021-03-26
Anticipated expiration: 2041-01-02
Also published as: CN112559724B

Abstract

The invention discloses a method and a system for preventing a vulnerability of a malicious search chat robot, wherein the method comprises the following steps: receiving a dialogue request of a user; extracting personal information of the user from the conversation request, and storing the personal information into a user identity database; monitoring the chatting process in real time; monitoring whether a specific event occurs in a certain number of conversation times; and adopting a corresponding conversation strategy for the user based on the monitoring result. The invention can identify the client who maliciously collects the robot bugs and adopt the corresponding conversation strategy in time.

Description

Method and system for preventing malicious search chat robot vulnerability

Technical Field

The invention belongs to the field of computers, and particularly relates to a method and a system for preventing a vulnerability of a malicious search chat robot.

Background

Chat robots (chatbots) are a product that is widely used in the natural language processing field of the business industry. Many companies use chat robots to partially or even completely replace/assist in human customer service. In the design of the chat robot, some words are listed in a sensitive word list, once the robot has the sensitive words in a conversation with a client, the chat robot generally answers the question of the client automatically and avoids direct answer, such as 'no answer' or 'asking for switching to manual' and the like. Sensitive vocabularies are manually incorporated into the design of the conversation robot, and the sensitive vocabularies are numerous, difficult to cover completely and easy to leak.

Through long-time and many-round conversations with the robot, the chat robot has a plurality of obvious errors during the period, but does not find that the conversant searches the vulnerability of the chat robot maliciously, does not suspend the conversation, and is held by the conversant to chat.

Disclosure of Invention

Aiming at the defects in the prior art, the method and the system for preventing the malicious search of the chat robot bugs can identify the clients who maliciously search the robot bugs and adopt corresponding conversation strategies in time.

To this end, in a first aspect, the present invention provides a method for preventing a vulnerability of a malicious search chat robot, comprising the following steps:

receiving a dialogue request of a user;

extracting personal information of the user from the conversation request, and storing the personal information into a user identity database;

monitoring the chatting process in real time;

monitoring whether a specific event occurs in a certain number of conversation times;

and adopting a corresponding conversation strategy for the user based on the monitoring result.

Wherein the monitoring whether a specific event occurs in a certain number of dialog times includes:

counting the number of times of occurrence of a specific feature in a certain number of sessions.

Wherein the specific feature comprises seven category features.

Wherein the specific features specifically include:

a first class of features comprising: forward expressions of query opinions, favorite classes such as "you like", "you love", and "you think" number of occurrences;

a second class of features comprising: negative expressions of the query opinion, preference classes, such as "you hate", "you dislike", and "hate" the number of occurrences;

a third class of features comprising: sensitive words, star names and times of celebrities;

a fourth class of features comprising: sensitive words, star names, and maximum number of consecutive occurrences of celebrities;

a fifth class of features comprising: the cumulative number of times the chat robot did not find an appropriate reply (the chat robot replies that the client was manually consulted or directly indicates no understanding, etc.);

a sixth class of features comprising: (cumulative number of times the chat robot did not find an appropriate reply (chat robot replies that the client made a manual consultation or directly indicates no understanding, etc.))/duration of a single continuous conversation by the conversant;

a seventh class of features comprising: the number of sentences of the chatting person inquiry service is a percentage of the number of all chatting sentences.

Further, monitoring whether a particular event occurs for a certain number of sessions identifies a normal class and a malicious class by detecting an abnormal sample.

Further, wherein the identifying normal and malicious classes by detecting anomalous samples comprises:

assuming D represents the number of categories of a particular feature, formed by session records of n clients

Sample data

，

The dimension of the features in the individual sample data is D, and the sample is recorded as

，

For each discrete feature of each sample, respectively

Performing dithering, and recording the dithered data set as

；

Step 1, optionally selecting in D

Each feature, in common

Seed combinations, the original data set is divided into m subsets, denoted as

Wherein

；

Step 2, in each subset

In (1), each sample is calculated

Is abnormal score of

；

Step 3, calculating the total abnormal score of each sample

Wherein

，

，

；

Results

，

Step 4, setting a threshold value as

Then the set of outliers is:

collection

Wherein

，

Representation collection

Is/are as follows

And (5) dividing the site.

Further, wherein step 2 comprises:

(1) random decimation

One of the d features of (1)

，

And randomly selecting a value of the feature as a boundaryDividing the data set

Dividing into 2 classes, recording data number of each class, and storing in list

(ii) a Then randomly selecting a feature, randomly selecting the value of the feature as a boundary line for segmentation, and dividing the boundary line

Dividing into 4 data sets, recording the number of each type of data

(ii) a Iterating until each data point is classified into a different class, wherein

Is shown as

A set of sample numbers in each class after the sub-division;

(2) from the start of the segmentation, if at

In

The number of occupied classes is 1, i.e.

Independently of class 1, then

In that

Is divided into a total number of times

Then, the division frequency of all samples is recorded as

Wherein

；

Then

Is abnormal score of

。

Further, the adopting, based on the monitoring result, a corresponding dialogue strategy for the user includes:

if the behavior is determined to be the behavior of searching the chat robot vulnerability, submitting the user information, and converting the user information into manual processing, wherein the user information is

Corresponding user information.

Further, the personal information includes one or more of an IP address, a device code, a micro signal, or a QQ number of the user.

In a second aspect, the present invention further provides a system for preventing a vulnerability of a malicious search chat robot, which implements the method described above, and includes:

a request receiving module for receiving a dialogue request of a user;

the information extraction module is used for extracting personal information of the user from the conversation request and storing the personal information into a user identity database;

the real-time monitoring module is used for monitoring the chatting process in real time;

the event judging module is used for monitoring whether a specific event occurs in a certain number of conversation times;

and the strategy selection module adopts a corresponding conversation strategy for the user based on the monitoring result.

Compared with the prior art, the chat robot has the advantages that the chat robot is more intelligent, the client who maliciously collects the loopholes of the chat robot can be identified, the reply of the chat robot is stopped in time, public opinion hazards caused by multiple wrong answers of the chat robot are avoided, and companies or platforms using the chat robot are protected.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a flow chart illustrating a method for preventing vulnerability of a malicious search chat robot according to an embodiment of the present invention;

FIG. 2 is a flow diagram illustrating analysis of a data set according to an embodiment of the invention; and

fig. 3 is a block diagram illustrating a system for preventing a vulnerability of a malicious search chat robot according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.

It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.

Alternative embodiments of the present invention are described in detail below with reference to the accompanying drawings.

In the conversation design of the chat robot, a function of identifying a conversation person who maliciously collects the errors of the robot is added. The functional module marks out a malicious dialog person by extracting features from the dialog between the dialog person and the robot and adopting an unsupervised machine learning method, returns the information (ip address, equipment code and the like) of the dialog person to the module for generating the content reply, and after recognition, the subsequent reply of the chat robot to the malicious dialog person is set to be 'contact person' or other replies transferring the attention of the dialog person instead of generating the reply in the way of generating the chat content as it is.

The first embodiment,

Referring to fig. 1, the invention discloses a method for preventing a vulnerability of a malicious search chat robot, comprising the following steps:

receiving a dialogue request of a user;

extracting personal information of the user from the conversation request, and storing the personal information into a user identity database; the personal information includes one or more of an IP address, a device code, a micro-signal, or a QQ number of the user;

monitoring the chatting process in real time;

Example II,

On the basis of the above embodiments, the embodiments of the present invention may include the following:

after the chat process is monitored in real time, the monitoring whether a specific event occurs in a certain number of dialog times includes:

In one application scenario, the features specific to the embodiments of the present invention include seven class features. Further, the specific feature occurrence times specifically include:

the first category features include: forward expressions of query opinions, favorite classes such as "you like", "you love", and "you think" number of occurrences;

the second category features include: negative expressions of the query opinion, preference classes, such as "you hate", "you dislike", and "hate" the number of occurrences;

the third category features include: the occurrence times of sensitive words, stars and celebrities in the conversation;

the fourth category features include: the maximum number of continuous occurrences of sensitive words, stars and celebrities in the conversation;

the fifth category features include: the cumulative number of times the chat robot did not find an appropriate reply (the chat robot replies that the client did not find a manual consultation or directly indicates no comprehension, etc.);

the sixth category features include: (frequency, i.e. cumulative number of times) that the chat robot does not find a suitable reply (chat robot replies that the client finds a manual consultation or directly indicates no comprehension, etc.)/duration of continuous conversation by the conversant;

the seventh category of features includes: the chat partner asks for the percentage of the business content to all chat content.

Sample data

，

The dimension of the feature in the individual sample data is

The sample is recorded as

，

For each discrete feature of each sample, respectively

Performing dithering, and recording the dithered data set as

；

Step 1, optionally selecting in D

Each feature, in common

Seed combinations, the original data set is divided into m subsets, denoted as

Wherein

；

Step 2, in each subset

In (1), each sample is calculated

Is abnormal score of

；

Step 3, calculating the total abnormal score of each sample

Wherein

，

，

；

Results

，

Step 4, setting a threshold value as

Then the set of outliers is:

collection

Wherein

，

Representation collection

Is/are as follows

And (5) dividing the site.

Further, wherein step 2 comprises:

(1) random decimation

Is/are as follows

One of the features

，

Randomly selecting one value of the characteristic as a boundary to divide the data set

Dividing into 4 data sets, recording the number of each type of data

Is shown as

A set of sample numbers in each class after the sub-division;

(2) from the start of the segmentation, if at

In

The number of occupied classes is 1, i.e.

Independently of class 1, then

In that

Is divided into a total number of times

Then, the division frequency of all samples is recorded as

Wherein

；

Then

Is abnormal score of

。

Corresponding user information.

After monitoring whether a specific event occurs in a certain number of conversation times, the embodiment of the invention adopts a corresponding conversation strategy for a user based on a monitoring result, and the conversation strategy comprises the following steps:

and if the behavior is determined to be the behavior of searching the chat robot vulnerability, submitting the user information and converting into manual processing.

Example III,

On the basis of the above embodiments, the embodiments of the present invention may further include the following:

in an application scenario, after acquiring a data set of 7 features, the next step is to find an outlier, that is, a malicious client, that is, to analyze a data set formed by 7 features in a certain number of sessions within a certain time, as shown in fig. 2, which may specifically include:

1. given a

Sample data

Feature dimension is 7, and sample is

，

. Dithering each discrete characteristic of each sample, and dithering

I.e. increase

A random number in between. The dithered data set is noted

. The purpose of dithering is to prevent data from overlapping. Here, the

。

2. In that

Optionally d features in common

Seed combinations, the original data set is divided into m subsets, denoted as

Wherein

。

3. In each subset

In (1), each sample is calculated

Is abnormal score of

。

A. Random decimation

One of the d features of (1)

，

Randomly selecting boundary lines for segmentation, and collecting data

Dividing into 2 classes, and recording the data number of each class

(ii) a Then randomly extracting a feature, randomly selecting a boundary line for division, and dividing

Dividing into 4 data sets, comparing the number of data recorded in each type

. Iterate until each data point is classified into a different class. Wherein

Is shown as

And (5) the collection of the number of each type after the secondary segmentation.

B. Computing samples

Is abnormal score of

。

From the start of the segmentation, if at

In

The number of occupied classes is 1, i.e.

Independently of class 1, then

In that

Is divided into a total number of times

Then, the division frequency of all samples is recorded as

Wherein

。

Then

Is abnormal score of

。

4. Calculating a total anomaly score for each sample

Wherein

，

，

Results

。

5. Set the threshold value as

Generally select

Then the set of outliers is:

collection

Wherein

，

Representation collection

Is/are as follows

And (5) dividing the site.

Specifically, the data set according to the embodiment of the present invention originally has D (7) features, and D features (where D may be selected as 2, 3, 4, etc.) are selected at a time, so that a subset of the original data set is formed, and the number of samples is unchanged, but the features are fewer.

In each subset, a feature is optionally selected one at a time, and a segmentation point for the feature is optionally selected, for segmentation. For example, the characteristics are selected: the characteristics of the age are taken as segmentation points, and further the characteristics of the age are selected as follows: age <45 is classified into one category, and age > 45 is classified into another category.

After many segmentations, each point becomes a class individually. The logic here is: the distribution of normal points is denser and the outliers are further from the normal points. The abnormal points can be classified into a class after a few segmentations. However, since there are many other points around the normal point, it is necessary to perform multiple segmentations. The fewer the number of segmentations required, the greater the probability that the sample is anomalous.

Wherein the number of division is p_iIs normalized, i.e. (p)_i-pmin)/(pmin of p-pmin) such that the values normalized by the number of divisions are all between 0 and 1. The division number normalized value indicates the division by the minimum number of times when it is 0, and the division number normalized value indicates the division by the maximum number of times when it is 1.

Further, the anomaly score = 1-number of divisions-normalized value. Because of the anomaly score, the larger the score should be, the more anomalous. But after normalizing the values, the smaller the value is, the more abnormal, since a smaller value indicates that fewer segmentations are required. The embodiment of the present invention uses a value normalized by the number of 1-division to represent an anomaly score.

Through the above operation, the abnormality score is calculated for each point, and the value range of the abnormality score is 0 to 1. The closer it is to 0, the more normal it is. The closer to 1, the more abnormal.

The embodiment of the invention further integrates the abnormal scores of each sample of the m subsets. Specifically, the anomaly scores are summed and divided by m. The division by m is because the summed range is within [0, m ], and is therefore divided by m for normalization.

The total abnormal score of each sample is between 0 and 1 through the normalization process. And the closer to 1, the more abnormal.

By setting a threshold, then finding the quantile point of the abnormal score corresponding to the threshold. For example, the threshold is set to 95%. That is, all abnormalities within 95% are normal. Assuming that there are 100 points, the anomaly scores of 95 points are all less than a certain value. Then this value is the 95% quantile and the remaining outlier score greater than this is the outlier.

Example four,

As shown in fig. 3, the present invention also provides a system for preventing a vulnerability of a malicious search chat robot, which includes:

a request receiving module for receiving a dialogue request of a user;

Example V,

The disclosed embodiments provide a non-volatile computer storage medium having stored thereon computer-executable instructions that may perform the method steps as described in the embodiments above.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local Area Network (AN) or a Wide Area Network (WAN), or the connection may be made to AN external computer (for example, through the internet using AN internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The foregoing describes preferred embodiments of the present invention, and is intended to provide a clear and concise description of the spirit and scope of the invention, and not to limit the same, but to include all modifications, substitutions, and alterations falling within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for preventing malicious search chat robot bugs is characterized by comprising the following steps:

receiving a dialogue request of a user;

extracting personal information of a user from the dialogue request, and storing the personal information into a user identity database, wherein the personal information comprises one or more of an IP address, a device code, a micro signal or a QQ number of the user;

monitoring the chatting process in real time;

2. The method of claim 1, wherein said monitoring whether a particular event occurs for a number of sessions comprises:

a certain number of occurrences or rate of a feature in a certain number of sessions is counted.

3. The method of claim 2, wherein the particular feature comprises seven class features.

4. The method according to claim 3, wherein the specific feature specifically comprises:

a first class of features comprising: inquiring forward expression occurrence times of opinions and preference classes;

a second class of features comprising: inquiring negative expression occurrence times of opinions and preference classes;

a third class of features comprising: sensitive words, stars, names and times of appearance of celebrities in the conversation;

a fourth class of features comprising: the maximum number of continuous occurrences of sensitive words, stars, names and celebrities in the conversation;

a fifth class of features comprising: the cumulative number of times that the chat robot does not find a suitable reply;

a sixth class of features comprising: the frequency with which the chat robot does not find a suitable reply;

5. The method of claim 1, wherein monitoring a number of sessions for the occurrence of a particular event is performed by detecting an abnormal pattern to identify a normal class and a malicious class.

6. The method of claim 5, wherein said identifying a normal class and a malicious class by detecting anomalous samples comprises:

assuming D represents the number of categories of a particular feature, n customersThe dialog records being formed