CN116484230A - Method for identifying abnormal business data and training method of AI digital person - Google Patents

Method for identifying abnormal business data and training method of AI digital person Download PDF

Info

Publication number
CN116484230A
CN116484230A CN202310735268.9A CN202310735268A CN116484230A CN 116484230 A CN116484230 A CN 116484230A CN 202310735268 A CN202310735268 A CN 202310735268A CN 116484230 A CN116484230 A CN 116484230A
Authority
CN
China
Prior art keywords
data
business
service
dimension
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310735268.9A
Other languages
Chinese (zh)
Other versions
CN116484230B (en
Inventor
李伟
王英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4u Beijing Technology Co ltd
Original Assignee
4u Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4u Beijing Technology Co ltd filed Critical 4u Beijing Technology Co ltd
Priority to CN202310735268.9A priority Critical patent/CN116484230B/en
Publication of CN116484230A publication Critical patent/CN116484230A/en
Application granted granted Critical
Publication of CN116484230B publication Critical patent/CN116484230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The application provides a method for identifying abnormal business data and a training method for AI digital people, wherein the method for identifying the abnormal business data comprises the following steps: acquiring a plurality of service data, and extracting service scene characteristics from the plurality of service data; performing cluster analysis on the service scene characteristics by adopting a clustering algorithm to obtain a plurality of service scene characteristic classes; determining a dimension matrix of each of the plurality of business scenario feature classes based on respective business data under each of the business scenario feature classes; identifying whether abnormal service data exists in the service scene feature class corresponding to the dimension matrix based on the dimension matrix, and eliminating the abnormal service data under the condition that the abnormal service data exists. The method and the device solve the technical problem that abnormal data exist in service data for training the AI digital person in the prior art, so that the trained AI digital person cannot respond accurately.

Description

Method for identifying abnormal business data and training method of AI digital person
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for identifying abnormal service data, and a training method and an apparatus for AI digital persons.
Background
AI digital persons are virtual characters created using artificial intelligence techniques that are highly realistic in appearance, motion, and speech capabilities. Through AI algorithms and techniques, AI digital humans can simulate the appearance, behavior, and manner of communication of humans, making them visually and audibly indistinct from real humans.
The AI numerator can act as a numerator staff in the enterprise, such as professional customer service, administrative foreground, sales host, etc., to provide services such as content distribution, brand marketing, sales conversion, etc. for the enterprise. The method can be applied to various terminal scenes, such as PC, APP, applet, VRMR and the like, so as to meet the diversified requirements of different industries, improve the data interaction capability and realize the great development of power-assisted enterprises in marketing.
However, although the current interactive technology of AI digital people uses machine learning algorithm and natural language processing technology, so that AI digital people can understand and respond to questions or interactions of users, the interactive capability is usually based on training of a large model such as chatGPT, and the personalized response capability of the AI digital people to enterprises is limited.
In order to solve the problem, a technical scheme for training the exclusive AI digital person of the enterprise according to the business data of the enterprise is provided in the prior art, so that the exclusive AI digital person can respond to the inquiry of the user more in line with the actual situation of the enterprise, more accurately and strictly.
However, business data types generated by enterprises are various, and business logic is complex. Therefore, how to process the service data and reject the abnormal data is a technical problem to be solved currently.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides a method and a device for identifying abnormal business data and a training method and a device for AI digital persons, which are used for at least solving the technical problem that the response of trained AI digital persons is inaccurate due to the fact that abnormal data exist in the business data for training the AI digital persons in the prior art.
According to an aspect of an embodiment of the present application, there is provided a method for identifying abnormal service data, including: acquiring a plurality of service data, and extracting service scene characteristics from the plurality of service data; performing cluster analysis on the service scene characteristics by adopting a clustering algorithm to obtain a plurality of service scene characteristic classes; determining a dimension matrix of each business scene feature class based on the business data under each business scene feature class in the plurality of business scene feature classes, wherein the dimension matrix represents the distribution condition of parameter values corresponding to different parameter dimensions of each business scene feature class; identifying whether abnormal service data exists in the service scene feature class corresponding to the dimension matrix based on the dimension matrix, and eliminating the abnormal service data under the condition that the abnormal service data exists.
According to another aspect of the embodiments of the present application, there is also provided a training method for AI digital persons, including: acquiring service data; preprocessing the service data based on the method for identifying abnormal service data; training the AI digital person based on the preprocessed business data.
According to still another aspect of the embodiments of the present application, there is further provided an apparatus for identifying abnormal service data, including: the extraction module is configured to acquire a plurality of business data and extract business scene features from the business data; the clustering module is configured to perform clustering analysis on the service scene characteristics by adopting a clustering algorithm to obtain a plurality of service scene characteristic classes; a determining module configured to determine a dimension matrix of each of the plurality of service scene feature classes based on respective service data under the service scene feature class, wherein the dimension matrix represents a distribution of parameter values corresponding to different parameter dimensions of the each service scene feature class; the processing module is configured to identify whether abnormal business data exists in the business scene feature class corresponding to the dimension matrix based on the dimension matrix, and reject the abnormal business data when the abnormal business data exists.
According to still another aspect of the embodiments of the present application, there is also provided an AI digital person training apparatus, including: the acquisition module is configured to acquire service data; the device for identifying abnormal service data as described above is configured to pre-process the service data; a training module configured to train the AI digital person based on the preprocessed business data.
In the embodiment of the application, a plurality of service data are acquired, and service scene characteristics are extracted from the plurality of service data; performing cluster analysis on the service scene characteristics by adopting a clustering algorithm to obtain a plurality of service scene characteristic classes; determining a dimension matrix of each of the plurality of business scenario feature classes based on respective business data under each of the business scenario feature classes; identifying whether abnormal service data exists in the service scene feature class corresponding to the dimension matrix based on the dimension matrix, and eliminating the abnormal service data under the condition that the abnormal service data exists, so that the technical problem that the trained AI digital person cannot respond accurately due to the fact that abnormal data exists in the service data for training the AI digital person in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of a method of identifying anomalous business data in accordance with an embodiment of the application;
FIG. 2 is a flow chart of another method of identifying anomalous business data in accordance with an embodiment of the application;
FIG. 3 is a flow chart of a method of clustering business data according to an embodiment of the present application;
FIG. 4 is a flow chart of a method of determining a dimension matrix according to an embodiment of the present application;
FIG. 5 is a flow chart of a method of identifying and culling abnormal business data according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for identifying abnormal service data according to an embodiment of the present application;
fig. 7 shows a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description. Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate. In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
Example 1
The embodiment of the application provides a method for identifying abnormal service data, as shown in fig. 1, the method comprises the following steps:
step S102, a plurality of business data are acquired, and business scene features are extracted from the business data.
First, a series of related business data, such as sales records, customer behavior data, supply chain data, etc., for an enterprise is obtained from a plurality of business data sources. The data may exist in structured form, such as database tables, or unstructured form, such as text, log files, etc.
Next, features related to the business scenario are extracted from the acquired plurality of business data. These features may be numerical, discrete, or textual data that describe different business conditions and behaviors.
In this example, data is obtained from different service data sources, and multiple data sources are integrated, so that more comprehensive and diversified data can be obtained, and service scenes can be better described. In addition, by extracting key features, the most representative and important information in the business scenario can be captured, while secondary and irrelevant data is ignored, thereby simplifying the complexity of subsequent analysis.
And step S104, carrying out cluster analysis on the service scene characteristics by adopting a cluster algorithm to obtain a plurality of service scene characteristic classes.
For each unclassified business scene feature, after the feature vector of the business scene feature is acquired, the distance between the feature vector and a plurality of clustering centers in the business scene feature class is calculated. If the distance between the feature vector and the nearest cluster center is greater than or equal to a preset distance threshold, a new service scene feature class is created. Classifying the unclassified business scene features corresponding to the feature vectors into the new business scene feature class, and taking the feature vectors of the unclassified business scene features as the clustering center of the new business scene feature class. If the distance between the feature vector and the nearest cluster center is smaller than a preset distance threshold value, the unclassified business scene features corresponding to the feature vector are assigned to the business scene feature class corresponding to the nearest cluster center.
According to the embodiment, the distance between the unclassified business scene features and the existing clustering center is calculated, and the classification decision is carried out according to the preset distance threshold, so that the business scene features can be more accurately classified into corresponding business scene feature classes. This helps to improve the accuracy and reliability of classification. In addition, by performing distance calculation and classification judgment between the existing cluster center and the unclassified business scene features, a large amount of business scene feature data can be efficiently processed. This helps to reduce the time cost of computation and processing and improves data processing efficiency.
Step S106, determining a dimension matrix of each business scene feature class based on the respective business data under each business scene feature class in the plurality of business scene feature classes.
And carrying out parameter distribution analysis on each underlying service data aiming at each service scene characteristic class so as to determine the distribution condition of different parameter dimensions in each service scene characteristic class. Through such analysis, a dimension matrix of each business scene feature class can be constructed, wherein the dimension matrix reflects the distribution condition of parameter values corresponding to different parameter dimensions in the business scene feature class. In particular, for each parameter dimension, the frequency of occurrence of different parameter values for the respective traffic data in that parameter dimension may be calculated. From these frequencies, the distribution of different parameter dimensions in each business scenario feature class can be determined.
According to the embodiment, the distribution condition of different parameter dimensions in each business scene feature class can be known in depth by carrying out parameter distribution analysis on each business data under each business scene feature class. This helps to obtain the preference, distribution range and possible abnormal situation of parameter values in the service scene feature class, and further understand the characteristics and features of the service scene feature class. In addition, a dimension matrix of each business scene feature class can be constructed by analyzing the distribution condition of different parameter dimensions in each business scene feature class. The dimension matrix reflects the distribution condition of parameter values corresponding to different parameter dimensions in the service scene feature class. This provides a basis for subsequent data analysis and decision making, and allows better understanding and comparison of the parameter distribution differences between different traffic scenario feature classes.
Step S108, identifying whether abnormal service data exists in the service scene feature class corresponding to the dimension matrix based on the dimension matrix, and eliminating the abnormal service data under the condition that the abnormal service data exists.
Mapping each service data to a dimension matrix of each service scene feature class aiming at each service data to obtain mapped service data; and calculating the matching degree of the mapped service data and each parameter dimension in the dimension matrix, and identifying whether the service data corresponding to the mapped service data is abnormal service data or not based on the matching degree.
According to the embodiment, by mapping each service data to the dimension matrix of the service scene feature class, the original service data can be converted into mapped service data in a specific service scene. Such a mapping may better reflect the relationship and interaction between data and business scenario features. In addition, the matching degree of the data on different parameter dimensions can be evaluated by calculating the matching degree of the mapped business data and each parameter dimension in the dimension matrix. This helps to determine if the data is consistent with the expected distribution and features in the business scenario feature class, further analyzing the reliability and accuracy of the data.
Example 2
The embodiment of the application provides another method for identifying abnormal service data, as shown in fig. 2, the method comprises the following steps:
step S202, clustering the service data to obtain service scene feature classes.
In some embodiments, the method of clustering may be as shown in fig. 3, including the steps of:
step S2022 extracts the service scenario features from the service data within the preset time period.
First, a preset time period for extracting characteristics of a service scene is determined. The time period may be set to last month, quarter, year, etc., according to specific needs. Then, service data including user behavior records, transaction data, log files and the like within a preset time period are acquired from corresponding data sources so as to cover the service scene concerned. Next, a traffic scenario to be extracted is determined, which may be, for example, a certain behavior pattern of the user, a use condition of a certain product, or an interactive procedure of a certain service, etc. Finally, extracting the business scene features from the business data, and extracting features by using statistical indexes, time sequence modes, association rules, machine learning algorithms and the like. For example, for user behavior data, characteristics such as user access frequency, browsing time, interaction path and the like can be extracted; for transaction data, characteristics of transaction amount, transaction time, commodity category, etc. may be extracted.
Step S2024 obtains feature vectors of each of the unclassified traffic scenario features extracted from the traffic data.
For each unclassified business scene feature, converting the unclassified business scene feature into a feature vector by a feature extraction method. For example, for text features, natural language processing techniques may be applied to convert text into word vector representations. For numerical features, its value may be directly taken as an element of the feature vector. The feature vectors may be generated using mathematical calculations, algorithmic transformations, or other corresponding methods.
Step S2026 calculates a distance between a cluster center closest to the feature vector among the plurality of cluster centers of the business scene feature class and the feature vector.
For the feature vector of each unclassified business scene feature, taking the feature vector of any business scene feature in the business scene feature class as the clustering center of the business scene feature class under the condition that the clustering center does not exist, calculating the distance from the feature vector of the unclassified business scene feature to the clustering center, and taking the distance as the distance between the nearest clustering center and the feature vector of the unclassified business scene feature.
And in the case of the clustering centers, respectively calculating the distance between the feature vector of the unclassified business scene feature and the clustering center of each business scene feature class. For example, using a metric such as euclidean distance or manhattan distance equidistance, the distance between the feature vector and each cluster center is calculated, and the cluster center nearest to the feature vector is determined. For example, a minimum distance is found by comparing each distance value, and the minimum distance is taken as the distance between the nearest cluster center and the feature vector.
In step S2028, a traffic scene feature class is determined based on the distance.
And under the condition that the distance is greater than or equal to a preset distance threshold value, establishing a new service scene feature class, attributing the unclassified service scene feature as the new service scene feature class, and taking the feature vector of the unclassified service scene feature as a clustering center of the new service scene feature class. And under the condition that the distance is smaller than a preset distance threshold value, attributing the unclassified business scene characteristics as business scene characteristic classes corresponding to the cluster centers closest to the characteristic vectors.
Step S204, determining a dimension matrix of each of the plurality of service scenario feature classes based on the respective service data under each service scenario feature class.
In an exemplary embodiment, the method for determining the dimension matrix may be as shown in fig. 4, and includes the following steps:
step S2042, parameter distribution analysis is performed on the service data.
And aiming at each service scene feature class, carrying out parameter distribution analysis on each service data under the service scene feature class. The distribution of each parameter dimension in the business scene feature class is calculated, such as the frequency, the duty ratio and the like of parameter values are calculated.
Specifically, for each service scene feature class, the service data under the class are taken out. For each parameter dimension, calculating the distribution condition of the dimension in the business scene feature class. This may be achieved, for example, by counting the number or duty cycle at which each parameter value occurs in the class.
For example, assume that there is one parameter dimension W1, which contains a parameter value A, B, C. In the service scene feature class, the number of times of occurrence of a parameter value A under the dimension W1 of the statistical parameter is a, the number of times of occurrence of a parameter value B is B, and the number of times of occurrence of a parameter value C is C. Then, the frequency of the parameter value a is calculated as a/(a+b+c), the frequency of the parameter value B is calculated as B/(a+b+c), and the frequency of the parameter value C is calculated as C/(a+b+c).
For each parameter dimension, distribution data of different parameter values in the dimension is constructed. The parameter values and the corresponding frequencies or duty ratios are combined into one data pair, forming distribution data for the parameter dimension. Taking dimension W1 as an example, the constructed distribution data may be expressed as { (a, frequency a), (B, frequency B), (C, frequency C) }.
Step S2044, constructing a dimension matrix.
And constructing a dimension matrix of each service scene feature class based on the analysis result of the parameter distribution. The dimension matrix is a multidimensional array, wherein each dimension corresponds to a parameter dimension, and each element in the matrix represents the distribution of parameter values corresponding to the parameter dimension.
Specifically, first, the size of the dimension matrix, that is, the number of rows and columns of the dimension matrix, is determined. The number of rows is the number of parameter dimensions, and the number of columns is the number of parameter values in each parameter dimension. Then, for each parameter dimension, its corresponding parameter value and its distribution are added as a row of data to the dimension matrix. For example, assume the following parameter dimensions and their corresponding parameter values: dimension W1: parameter A, parameter B, parameter C; dimension W2: parameter value X, parameter value Y, parameter value Z, then for each parameter dimension in the business scenario feature class, the parameter distribution case for dimension W1 is { (a, 0.4), (B, 0.3), (C, 0.3) }, and the parameter distribution case for dimension W2 is { (X, 0.2), (Y, 0.5), (Z, 0.3) }. The dimension matrix is constructed according to the distribution condition as shown in table 1:
TABLE 1
The number of rows of the dimension matrix is 2, namely two parameter dimensions are corresponding to: w1 and W2, the number of columns is 3, i.e. the number of parameter values per parameter dimension is 3.
Step S2046, normalization processing.
And carrying out normalization processing on the dimension matrix to ensure that weights among different parameter dimensions are balanced with each other. The distribution value for each parameter dimension may be divided by the sum or maximum of all parameter values for that dimension such that the value for each parameter dimension is between 0 and 1.
For each parameter dimension in the dimension matrix, the sum or maximum of all parameter values for that parameter dimension is calculated, which will be used as the denominator for normalization. Then, each parameter value in the parameter dimension is traversed, and the corresponding distribution value is divided by a denominator to obtain a normalized distribution value. Updating the distribution value of each parameter dimension in the dimension matrix to be the normalized distribution value.
Step S206, identifying whether abnormal business data exists in the business scene feature class corresponding to the dimension matrix based on the dimension matrix.
In some embodiments, the method for identifying abnormal business data may include the following steps as shown in fig. 5:
step S2062, for each service data, maps it to a corresponding dimension matrix.
Service data is acquired, wherein the service data comprises a plurality of parameter values. And for each parameter dimension, finding the corresponding parameter dimension in the dimension matrix to obtain mapped service data.
In step S2064, for each mapped service data, a degree of matching with each parameter dimension in the dimension matrix is calculated.
And calculating cosine similarity between the mapped service data and each parameter dimension in the dimension matrix. Specifically, for each parameter dimension, the cosine similarity of the mapped service data and the parameter dimension is calculated, and the cosine similarity of all the parameter dimensions is summed. Dividing the summation result by the total number of parameter dimensions to obtain the average value of cosine similarity. This mean represents the average similarity of the mapped business data to all the parameter dimensions in the dimension matrix. Comparing the average value with a preset similarity threshold. And if the average value is smaller than the preset similarity threshold value, identifying the service data corresponding to the mapped service data as abnormal service data.
In this embodiment, by calculating the mean value of cosine similarity, the similarity between the mapped service data and each parameter dimension in the dimension matrix may be comprehensively considered. If the average value is smaller than the preset similarity threshold value, the mapped service data and the dimension matrix are indicated to have low overall similarity, and the service data may be indicated to have abnormality in the service scene feature class. Therefore, whether the business data is abnormal or not can be judged by setting a preset similarity threshold value, and corresponding abnormal processing or analysis is carried out.
Specifically, first, a parameter dimension vector is calculated. The parameter values are constructed as a vector according to the distribution of the parameter values over a certain parameter dimension. For example, in the dimension matrix, the parameter dimension is a place, the distribution of the parameter value "Beijing" is 0.4, and the distribution of the parameter value "Beijing" in the dimension is 0.4. The distribution is constructed as a parameter dimension vector, e.g., [0, 0,0, 0.4, 0,0, 0. ], where the length of the vector is equal to the total number of parameter values and the remaining positions are 0.
The mapped traffic data vector is then calculated. The mapped traffic data is represented as a vector. And setting the value of the corresponding position as 1 and the rest positions as 0 according to the value of the service data in the parameter dimension. For example, if the parameter value of the mapped traffic data is "city a", the corresponding vector is denoted as [1, 0,0 > ].
Then, the cosine similarity is calculated. And calculating cosine similarity between the mapped service data vector and the parameter dimension vector by using a cosine similarity formula. Repeating the steps until the cosine similarity of the mapped service data vector and other parameter dimension vectors is calculated.
And finally summing the cosine similarity and calculating the average value. And summing the cosine similarity of all the parameter dimensions, and dividing the sum by the total number of the parameter dimensions to obtain the average value of the cosine similarity.
For example, the degree of matching may be calculated according to the following formula:
wherein S represents the matching degree, A represents the vector of the mapped service data, and B ij A vector of j-th parameter values representing i-th parameter dimensions. P (P) ij The distribution of the jth parameter value representing the ith parameter dimension, N representing the total number of parameter dimensions, M i The number of parameter values representing the i-th parameter dimension.
Step S2066, judging whether the business data sample is abnormal according to the calculated matching degree.
A threshold may be set to determine if the degree of matching exceeds or falls below the threshold and thus determine if the traffic data is anomalous. For example, in a case where the mean value of the cosine similarity is smaller than a preset similarity threshold, identifying the service data corresponding to the mapped service data as the abnormal service data.
Step S208, processing the abnormal business data.
And marking the abnormal business data sample or performing corresponding elimination processing according to the judgment result. In some embodiments, for abnormal business data, the following methods may also be adopted to troubleshoot and repair the business data:
It is checked whether the data source is abnormal or erroneous. There may be problems with data source configuration errors, data transmission interruptions, data format errors, etc., which need to be checked and repaired. It is checked whether the system configuration associated with the service data is correct. There may be configuration parameter errors, system version issues, etc., that require confirmation and adjustment of the relevant configuration. For dependent components related to business data, such as databases, service interfaces, etc., it is necessary to check whether they are working properly. There may be problems with abnormal database connections, overtime of interface response, service unavailability, etc., which need to be addressed. Check if there is a problem or abnormal situation in the business logic. There may be code logic errors, data processing errors, etc., and code analysis and repair are required. And monitoring and analyzing the data related to the business data by a monitoring tool and a log analysis tool. System logs, error logs, performance logs, etc. may be reviewed to discover anomalies and potential problems. By the method for checking and repairing faults, the problem of abnormal call data can be solved, the faults of the system can be repaired, and the normal operation of the service can be ensured.
The embodiment purifies service data by identifying and eliminating abnormal data, thereby improving the accuracy of training AI digital people. Abnormal data may include errors, noise, outliers, or non-canonical data that may negatively impact the training model. By eliminating these outlier data, it can be ensured that the training model better captures the true and representative data patterns.
Example 3
The embodiment of the application provides a training method of an AI digital person, which comprises the following steps:
first, service data is acquired.
And then preprocessing the service data.
For example, a plurality of business data are acquired, and business scene features are extracted from the plurality of business data; performing cluster analysis on the service scene characteristics by adopting a clustering algorithm to obtain a plurality of service scene characteristic classes; determining a dimension matrix of each business scene feature class based on the business data under each business scene feature class in the plurality of business scene feature classes, wherein the dimension matrix represents the distribution condition of parameter values corresponding to different parameter dimensions of each business scene feature class; identifying whether abnormal service data exists in the service scene feature class corresponding to the dimension matrix based on the dimension matrix, and eliminating the abnormal service data under the condition that the abnormal service data exists. In this embodiment, the method for preprocessing the service data is similar to the methods in embodiments 1 and 2, and will not be described here again.
Finally, training the AI digital person based on the preprocessed business data.
According to the embodiment, the AI digital person can have more accurate understanding and response capability based on the preprocessed business data when training the AI digital person.
Example 4
The embodiment of the application provides a device for identifying abnormal service data, as shown in fig. 6, including: the extraction module 62, the clustering module 64, the determination module 66, and the processing module 68.
The extraction module 62 is configured to acquire a plurality of service data and extract service scene features from the plurality of service data; the clustering module 64 is configured to perform cluster analysis on the service scene features by adopting a clustering algorithm to obtain a plurality of service scene feature classes; the determining module 66 is configured to determine a dimension matrix of each of the plurality of service scene feature classes based on respective service data under the service scene feature classes, wherein the dimension matrix represents a distribution of parameter values corresponding to different parameter dimensions of the each service scene feature class; the processing module 68 is configured to identify, based on the dimension matrix, whether abnormal traffic data exists in the traffic scene feature class corresponding to the dimension matrix, and reject the abnormal traffic data if the abnormal traffic data exists.
It should be noted that: the device for identifying abnormal service data provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for identifying abnormal service data provided in the above embodiment and the method embodiment for identifying abnormal service data belong to the same concept, and detailed implementation processes of the device are shown in the method embodiment, which is not described herein.
The embodiment of the application also provides a training device for AI digital people, comprising: the acquisition module is configured to acquire service data; the device for identifying abnormal service data as described above is configured to pre-process the service data; a training module configured to train the AI digital person based on the preprocessed business data.
Example 5
Fig. 7 shows a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. It should be noted that the electronic device shown in fig. 7 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.
As shown in fig. 7, the electronic device includes a Central Processing Unit (CPU) 1001 that can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for system operation are also stored. The CPU1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.
In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by a Central Processing Unit (CPU) 1001, performs the various functions defined in the methods and apparatus of the present application. In some embodiments, the electronic device may further include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device.
The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps of the method embodiments described above, and so on.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided in the present application, it should be understood that the disclosed terminal device may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (10)

1. A method of identifying anomalous business data, comprising:
acquiring a plurality of service data, and extracting service scene characteristics from the plurality of service data;
performing cluster analysis on the service scene characteristics by adopting a clustering algorithm to obtain a plurality of service scene characteristic classes;
determining a dimension matrix of each business scene feature class based on the business data under each business scene feature class in the plurality of business scene feature classes, wherein the dimension matrix represents the distribution condition of parameter values corresponding to different parameter dimensions of each business scene feature class;
Identifying whether abnormal service data exists in the service scene feature class corresponding to the dimension matrix based on the dimension matrix, and eliminating the abnormal service data under the condition that the abnormal service data exists.
2. The method of claim 1, wherein performing cluster analysis on the traffic scene features using a clustering algorithm comprises:
acquiring feature vectors of each unclassified business scene feature in the business scene features, and calculating distances between a cluster center closest to the feature vector and the feature vector in a plurality of cluster centers of the business scene feature classes;
under the condition that the distance is greater than or equal to a preset distance threshold value, a new service scene feature class is established, unclassified service scene features corresponding to the feature vectors are classified into the new service scene feature class, and the feature vectors are used as clustering centers of the new service scene feature class;
and classifying the unclassified business scene features corresponding to the feature vectors into business scene feature classes corresponding to the cluster centers closest to the feature vectors under the condition that the distance is smaller than the preset distance threshold.
3. The method of claim 1, wherein determining a dimension matrix for each of the plurality of business scenario feature classes based on respective business data under each of the business scenario feature classes comprises:
acquiring different parameter dimensions of each service scene feature class, wherein the different parameter dimensions are used for describing the distribution condition of the parameter values from different dimensions;
for each parameter dimension, carrying out parameter distribution analysis on each business data under each business scene feature class, and determining the distribution condition of different parameter values of each parameter dimension under each business scene feature class;
constructing the dimension matrix of each business scenario feature class based on the distribution condition.
4. A method according to claim 3, wherein performing parameter distribution analysis on the respective service data under each service scene feature class, and determining the distribution of different parameter values of each parameter dimension under each service scene feature class comprises:
calculating the occurrence frequency of different parameter values of each service data under each service scene feature class under the corresponding parameter dimension;
And determining the distribution condition of different parameter values of each parameter dimension under each service scene characteristic class based on the occurrence frequency.
5. The method of claim 1, wherein identifying whether abnormal business data exists in a business scenario feature class corresponding to the dimension matrix based on the dimension matrix comprises:
mapping the service data to a dimension matrix of a service scene feature class corresponding to the service data aiming at each service data in the plurality of service data to obtain mapped service data;
and calculating the matching degree of the mapped service data and each parameter dimension in the dimension matrix, and identifying whether the service data corresponding to the mapped service data is abnormal service data or not based on the matching degree.
6. The method of claim 5, wherein calculating a degree of matching of the mapped traffic data with each parameter dimension in the dimension matrix and identifying whether traffic data corresponding to the mapped traffic data is the anomalous traffic data based on the degree of matching comprises:
Calculating cosine similarity between the mapped service data and each parameter dimension in the dimension matrix, calculating the mean value of the cosine similarity, and taking the mean value of the cosine similarity as the matching degree;
and when the average value of the cosine similarity is smaller than a preset similarity threshold value, identifying the business data corresponding to the mapped business data as the abnormal business data.
7. A method of training an AI digital person, comprising:
acquiring service data;
preprocessing the traffic data based on the method of any one of claims 1 to 6;
training the AI digital person based on the preprocessed business data.
8. An apparatus for identifying abnormal traffic data, comprising:
the extraction module is configured to acquire a plurality of business data and extract business scene features from the business data;
the clustering module is configured to perform clustering analysis on the service scene characteristics by adopting a clustering algorithm to obtain a plurality of service scene characteristic classes;
a determining module configured to determine a dimension matrix of each of the plurality of service scene feature classes based on respective service data under the service scene feature class, wherein the dimension matrix represents a distribution of parameter values corresponding to different parameter dimensions of the each service scene feature class;
The processing module is configured to identify whether abnormal business data exists in the business scene feature class corresponding to the dimension matrix based on the dimension matrix, and reject the abnormal business data when the abnormal business data exists.
9. An AI digital person training device, comprising:
the acquisition module is configured to acquire service data;
the apparatus for identifying anomalous business data according to claim 8, configured to pre-process the business data;
a training module configured to train the AI digital person based on the preprocessed business data.
10. A computer-readable storage medium, on which a program is stored, characterized in that the program, when run, causes a computer to perform the method of any one of claims 1 to 6.
CN202310735268.9A 2023-06-20 2023-06-20 Method for identifying abnormal business data and training method of AI digital person Active CN116484230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310735268.9A CN116484230B (en) 2023-06-20 2023-06-20 Method for identifying abnormal business data and training method of AI digital person

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310735268.9A CN116484230B (en) 2023-06-20 2023-06-20 Method for identifying abnormal business data and training method of AI digital person

Publications (2)

Publication Number Publication Date
CN116484230A true CN116484230A (en) 2023-07-25
CN116484230B CN116484230B (en) 2023-09-01

Family

ID=87219915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310735268.9A Active CN116484230B (en) 2023-06-20 2023-06-20 Method for identifying abnormal business data and training method of AI digital person

Country Status (1)

Country Link
CN (1) CN116484230B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492394A (en) * 2018-10-25 2019-03-19 平安科技(深圳)有限公司 The recognition methods of abnormal traffic request and terminal device
CN110457175A (en) * 2019-07-08 2019-11-15 阿里巴巴集团控股有限公司 Business data processing method, device, electronic equipment and medium
CN112035325A (en) * 2020-09-01 2020-12-04 中国银行股份有限公司 Automatic monitoring method and device for text robot
CN115170027A (en) * 2022-07-04 2022-10-11 上海东普信息科技有限公司 Data analysis method, device, equipment and storage medium
CN115238815A (en) * 2022-08-10 2022-10-25 中国工商银行股份有限公司 Abnormal transaction data acquisition method, device, equipment, medium and program product
US20230121044A1 (en) * 2021-10-15 2023-04-20 Nvidia Corporation Techniques for determining dimensions of data
CN116227989A (en) * 2023-01-05 2023-06-06 深圳市中京政通科技有限公司 Multidimensional business informatization supervision method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492394A (en) * 2018-10-25 2019-03-19 平安科技(深圳)有限公司 The recognition methods of abnormal traffic request and terminal device
CN110457175A (en) * 2019-07-08 2019-11-15 阿里巴巴集团控股有限公司 Business data processing method, device, electronic equipment and medium
CN112035325A (en) * 2020-09-01 2020-12-04 中国银行股份有限公司 Automatic monitoring method and device for text robot
US20230121044A1 (en) * 2021-10-15 2023-04-20 Nvidia Corporation Techniques for determining dimensions of data
CN115170027A (en) * 2022-07-04 2022-10-11 上海东普信息科技有限公司 Data analysis method, device, equipment and storage medium
CN115238815A (en) * 2022-08-10 2022-10-25 中国工商银行股份有限公司 Abnormal transaction data acquisition method, device, equipment, medium and program product
CN116227989A (en) * 2023-01-05 2023-06-06 深圳市中京政通科技有限公司 Multidimensional business informatization supervision method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUN-XIAO NIE: "Cluster structure in the correlation coefficient matrix can be characterized by abnormal eigenvalues", 《PHYSICA A:STATISTICAL MECHANICS AND ITS APPLICATIONS》 *
梁耘等: "基于分裂-合并策略改进多特征聚类算法的风电机组故障分析", 《可再生能源》, no. 10 *

Also Published As

Publication number Publication date
CN116484230B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
WO2021017679A1 (en) Address information parsing method and apparatus, system and data acquisition method
CN107818344B (en) Method and system for classifying and predicting user behaviors
CN110909165A (en) Data processing method, device, medium and electronic equipment
CN108573358A (en) A kind of overdue prediction model generation method and terminal device
CN114612251A (en) Risk assessment method, device, equipment and storage medium
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN113051291A (en) Work order information processing method, device, equipment and storage medium
CN113706291A (en) Fraud risk prediction method, device, equipment and storage medium
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN114997916A (en) Prediction method, system, electronic device and storage medium of potential user
CN109684198B (en) Method, device, medium and electronic equipment for acquiring data to be tested
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN111210332A (en) Method and device for generating post-loan management strategy and electronic equipment
CN112950347B (en) Resource data processing optimization method and device, storage medium and terminal
CN114328277A (en) Software defect prediction and quality analysis method, device, equipment and medium
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN115809796B (en) Project intelligent dispatching method and system based on user portrait
CN116484230B (en) Method for identifying abnormal business data and training method of AI digital person
CN116629423A (en) User behavior prediction method, device, equipment and storage medium
CN107844874A (en) Enterprise operation problem analysis system and its method
CN114722789B (en) Data report integrating method, device, electronic equipment and storage medium
CN114138743A (en) ETL task automatic configuration method and device based on machine learning
CN109919811B (en) Insurance agent culture scheme generation method based on big data and related equipment
CN110909777A (en) Multi-dimensional feature map embedding method, device, equipment and medium
CN116561540B (en) Service data correction method and device and training method and device for AI digital person

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant