CN115811735A - Information identification method, model training method, related device and electronic equipment - Google Patents

Information identification method, model training method, related device and electronic equipment Download PDF

Info

Publication number
CN115811735A
CN115811735A CN202111085163.0A CN202111085163A CN115811735A CN 115811735 A CN115811735 A CN 115811735A CN 202111085163 A CN202111085163 A CN 202111085163A CN 115811735 A CN115811735 A CN 115811735A
Authority
CN
China
Prior art keywords
fraud
feature
dimension
call
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111085163.0A
Other languages
Chinese (zh)
Inventor
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile IoT Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile IoT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile IoT Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111085163.0A priority Critical patent/CN115811735A/en
Publication of CN115811735A publication Critical patent/CN115811735A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention provides an information identification method, a model training method, a related device and electronic equipment. The method comprises the following steps: acquiring M call characteristics of a number to be identified in M dimensions; inputting the M call characteristics into a target model to execute a first detection operation to obtain an identification result, wherein the identification result is used for representing whether the number to be identified is a fraud number; wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions. The embodiment of the invention can improve the identification accuracy of the fraudulent number.

Description

Information identification method, model training method, related device and electronic equipment
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to an information identification method, a model training method, a related device and electronic equipment.
Background
With the rapid development of the communication industry, telecommunication calls are widely applied, and users can carry out telecommunication calls through telephone numbers. Currently, there is a fraud activity for telecommunication using telephone numbers, and it is an important task to detect the telephone number to determine whether the telephone number is a fraudulent number.
In the related technology, the traffic, voice and short message are generally classified by collecting the historical user call tickets and based on the identification rule of telecommunication fraud, so that the telephone numbers with the telecommunication fraud are identified, and the risk warning is carried out on the user. However, the recognition accuracy of the detection method is low.
Disclosure of Invention
The embodiment of the invention provides an information identification method, a model training method, a related device and electronic equipment, and aims to solve the problem that the identification accuracy rate of a fraud number is low in the prior art.
In a first aspect, an embodiment of the present invention provides an information identification method, where the method includes:
acquiring M conversation characteristics of a number to be identified on M dimensions, wherein M is a positive integer;
inputting the M call characteristics into a target model to execute a first detection operation to obtain an identification result, wherein the identification result is used for representing whether the number to be identified is a fraud number;
wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions.
In a second aspect, an embodiment of the present invention provides a model training method, where the method includes:
acquiring a first training sample set, wherein the first training sample set comprises M call characteristic samples of number samples in M dimensions, and M is a positive integer;
performing feature statistics on the first training sample set to determine a fraud feature set in the first training sample set;
inputting the M call feature samples and the fraud feature set into a target model to perform a second detection operation, wherein the second detection operation comprises: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features existing in the dimension of the M call feature samples, determining feature distribution containing fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values and the feature distribution corresponding to the M dimensions;
updating model parameters of the target model based on the target estimation function.
In a third aspect, an embodiment of the present invention provides an information identifying apparatus, where the apparatus includes:
the first acquisition module is used for acquiring M conversation characteristics of the number to be identified on M dimensions, wherein M is a positive integer;
the first detection operation module is used for inputting the M call characteristics into a target model to execute a first detection operation to obtain an identification result, and the identification result is used for representing whether the number to be identified is a fraud number;
wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions.
In a fourth aspect, an embodiment of the present invention provides a model training apparatus, where the apparatus includes:
the second acquisition module is used for acquiring a first training sample set, wherein the first training sample set comprises M call characteristic samples of number samples in M dimensions, and M is a positive integer;
the characteristic counting module is used for carrying out characteristic counting on the first training sample set so as to determine a fraud characteristic set in the first training sample set;
a second detection operation module, configured to input the M call feature samples and the fraud feature set to a target model to perform a second detection operation, where the second detection operation includes: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features of the M call feature samples in the dimension, determining feature distribution of fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values corresponding to the M dimensions and the feature distribution;
and the updating module is used for updating the model parameters of the target model based on the target estimation function.
In a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: comprising a processor, a memory, a computer program stored on said memory and executable on said processor, said computer program realizing the steps of the above described data updating method when executed by said processor or the steps of the above described model training method when executed.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, implements the steps of the information identification method or, when executed, implements the steps of the model training method.
In the embodiment of the invention, M conversation characteristics of the number to be identified on M dimensions are obtained; inputting the M call characteristics into a target model to execute a first detection operation to obtain an identification result, wherein the identification result is used for representing whether the number to be identified is a fraud number; wherein the first detection operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions. Therefore, the call characteristics of the number to be identified can be detected from the M dimensions by adopting the target model so as to determine whether the number to be identified is a fraud number, so that more and more user characteristics can be identified, and the identification accuracy of the fraud number can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flowchart of an information identification method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram of a model training method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an information recognition apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
First, an information identification method provided by an embodiment of the present invention is explained below.
It should be noted that the information identification method provided by the embodiment of the invention relates to the technical field of artificial intelligence. The method can be executed by the information identification device of the embodiment of the invention. The information identification apparatus may be configured in any electronic device to execute the information identification method, and the electronic device may be a server or a terminal, which is not limited specifically here.
Referring to fig. 1, a schematic flow chart of an information identification method provided by an embodiment of the present invention is shown.
As shown in fig. 1, the method may include the steps of:
step 101, obtaining M call characteristics of a number to be identified in M dimensions.
Wherein M is a positive integer.
In this step, the to-be-identified number may be a Subscriber number used for telecommunication, may be a unique identification number corresponding to a Subscriber Identity Module (SIM) card, which is a Subscriber Identity card, or may be a unique identification number corresponding to a User Identity Module (UIM) card, or a unique identification number corresponding to another card, which is not specifically limited herein.
The number to be identified may be a subscriber number of a new generation communication, such as a subscriber number of 5G, or may be a subscriber number of another generation communication, such as a subscriber number of 2G, 3G, or 4G, which is not limited herein.
The M dimensions may refer to the dimensions involved in the call of the user number, including but not limited to the location where the user number is in the call, the length of time spent in that location, the length of the call, the frequency of the dialed call, the traffic consumption during the call, and the call time.
The M call features of the number to be identified in the M dimensions may be features identified from the M dimensions based on call data of the number to be identified, which may be specifically represented by table 1.
M call feature lists of user numbers in M dimensions
Location of call Length of dwell time Duration of call Dialing call frequency Consumption of flow Communication time
Position A 1 day 1 hour 5 times per 1 day 50M 12 o' clock in the evening
Optionally, the obtaining M call features of the number to be identified in M dimensions includes:
and carrying out feature coding on the call data of the number to be identified on M dimensions to obtain the M call features.
The feature expression of the call data of the number to be recognized can be performed by adopting a plurality of feature coding modes, including but not limited to lexical features, syntactic features and specific content features, for example, for data on a position dimension where the user number is in a call, the feature coding can be performed on the position data in the call data by adopting the lexical features to obtain the call features on the position dimension.
And step 102, inputting the M call characteristics to a target model to execute a first detection operation, and obtaining an identification result, wherein the identification result is used for representing whether the number to be identified is a fraud number. Wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions.
In this step, the M call features may be input to the target model to perform a first detection operation, so as to analyze the M call features through the target model, and obtain an identification result of whether the number to be identified is a fraudulent number.
The target model may be a machine learning model, which may include M gaussian mixture models, each gaussian mixture model may be a model established for one dimension of call data, that is, one column of data, and for each dimension, the data in the corresponding column of the dimension may include multiple components, for example, data of three components, that is, 0 to 20 minutes, 20 to 60 minutes, and more than 60 minutes, may be included for the dimension of call duration, and thus, each gaussian mixture model may include multiple components, each of which may be an independent gaussian univariate distribution.
For each dimension, a weight of the dimension-corresponding component in the gaussian mixture model is assigned, and M weights of the M dimension-corresponding components constitute feature weight information corresponding to the M dimensions. The characteristic weight information can be preset, and the characteristic weight information can also be obtained by counting the weights of components corresponding to different dimensions through preprocessing conversation data of a large number of user numbers.
Specifically, for each dimension, fraud analysis may be performed on the M call features based on the model parameters of the target model in the dimension to obtain a fraud probability value of the number to be identified in the dimension, that is, for each dimension, a gaussian mixture model corresponding to the dimension may be obtained, the M call features are input into the gaussian mixture model to serve as variable values of the gaussian mixture model, and based on the model parameters and the variable values of the gaussian mixture model, output values of the gaussian mixture model may be calculated, where the output values may represent the fraud probability value of the number to be identified in the dimension. Finally, M fraud probability values of the number to be identified on M dimensions can be obtained.
Then, the M fraud probability values and the characteristic weight information can be weighted to obtain a weighted probability value, the weighted probability value is a probability value representing that the number to be identified belongs to the fraud number, and under the condition that the weighted probability value is larger than a preset threshold value, the number to be identified can be determined to be the fraud number, and risk prompt is carried out on the fraud number.
It should be noted that, before the target model is used for prediction, pre-training is required, that is, model parameters of each gaussian mixture model in the target model are determined, so that the model parameters of the gaussian mixture model in the target model can accurately represent the distribution of call data on each component in the corresponding dimension of the gaussian mixture model.
In the embodiment, M conversation characteristics of the number to be identified on M dimensions are obtained; inputting the M call characteristics into a target model to execute a first detection operation to obtain an identification result, wherein the identification result is used for representing whether the number to be identified is a fraud number; wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions. Therefore, the call characteristics of the number to be identified can be detected from M dimensions by adopting the target model so as to determine whether the number to be identified is a fraud number, so that more and more user characteristics can be identified, and the identification accuracy of the fraud number can be improved.
In addition, the target model is used for identifying the fraud behavior which occurs once in real time or identifying the new fraud behavior, so that the time spent on processing the problem can be reduced, the customer is protected from becoming a fraud victim, and the wind control capability of the operator can be improved.
The following describes a model training method provided in an embodiment of the present invention.
Referring to fig. 2, a schematic flow chart of a model training method provided in the embodiment of the present invention is shown.
As shown in fig. 2, the method may include the steps of:
step 201, a first training sample set is obtained, where the first training sample set includes M call feature samples of number samples in M dimensions.
Step 202, performing feature statistics on the first training sample set to determine a fraud feature set in the first training sample set.
Step 203, inputting the M call feature samples and the fraud feature set into a target model to perform a second detection operation, where the second detection operation includes: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features existing in the dimension of the M call feature samples, determining feature distribution containing fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values and the feature distribution corresponding to the M dimensions.
And 204, updating model parameters of the target model based on the target estimation function.
The embodiment mainly describes a training process of a target model, and the target model can be trained by adopting maximum likelihood estimation or maximum posterior probability estimation of an expectation maximization algorithm based on a probability parameter model containing hidden variables, so as to obtain a telecommunication fraud judgment model.
Specifically, first, a first training sample set may be obtained, where the first training sample set may include a plurality of number samples, and each number sample may correspond to M call feature samples in M dimensions.
In specific implementation, public data of an operator 5G user can be collected, the public data can include call ticket data, position information, dwell time and other data, and feature coding can be performed on each piece of public data to obtain M call feature samples of each number sample in M dimensions. And defining a tag for the public data to identify an attribute of the subscriber number in the public data, the attribute including a fraudulent number or a normal number.
The first training sample set may include number samples corresponding to all public data and M call feature samples of each number sample in M dimensions, or may divide all public data into K, and the first training sample set may include only one number sample corresponding to public data and M call feature samples of each number sample corresponding to the public data in M dimensions.
In an optional implementation, all known fraudulent features in the public data may be counted in advance, and based on all known fraudulent features, the fraudulent features existing in the first training sample set are selected from the call feature samples of the first training sample set, so as to obtain the fraudulent feature set.
In another optional embodiment, because the probability of occurrence of the fraud feature in all call features is a small probability event under normal conditions, the call features with a smaller occurrence probability can be screened out for each dimension, the call features with a smaller occurrence probability are determined as the fraud features under the dimension, and finally the fraud feature set in the first training sample set is obtained.
Then, M call feature samples of the number samples in M dimensions and a fraud feature set can be input to the target model to perform a second detection operation. Wherein, M call feature samples of one number sample in M dimensions can use x i ,x i May be the ith row of data in the first training sample set, where the ith row of data may include M call characteristic samples, and may be x i Each data point in the speechAn implicit variable, denoted z, is attached to characterize the fraudulent feature component to determine the probability value of each data point under the fraudulent feature component. The set of latent variables for each dimension is denoted by Z, which is the dimension.
The second detection operation may specifically include: for each dimension, fraud analysis can be performed on the M call feature samples based on model parameters of the target model in the dimension, namely model parameters of a Gaussian mixture model corresponding to the dimension and a fraud feature set, so as to obtain first probability values of fraud features existing in the dimension of the M call feature samples, and p (x) is used i Z; theta) is represented, wherein theta is a model parameter of the Gaussian mixture model corresponding to the dimension, and x is represented under the condition that the model parameter is theta i Probability value under z component.
That is, x is i Inputting the fraud feature set into the Gaussian mixture model corresponding to the dimension for fraud analysis, and obtaining x under the condition that the model parameter of the Gaussian mixture model corresponding to the dimension is theta i There is a first probability value of the fraud feature in the z dimension, and finally M first probability values corresponding to M dimensions may be obtained.
Meanwhile, the feature distribution containing the fraud features in the M call feature samples can be determined based on the fraud feature set, and Q is used i (Z) represents, Q i (Z) may be a feature distribution that contains a fraud feature set in the M call feature samples.
After determining the M first probability values and the feature distribution, an objective estimation function may be determined, which may be a likelihood estimation function or an a posteriori probability estimation function. Taking the likelihood estimation function as an example, the target estimation function can be determined by using the following equation (1).
Figure BDA0003265364580000091
Wherein in the above formula (1), E (z) is a likelihood estimation function, Q i (z) is Q i The value of the feature in the Z dimension of (Z), the symbol [ 2 ]]Representing the ln function.
The target model may be trained using a maximum likelihood estimation function, i.e. the maximum likelihood estimation function is used to estimate θ, i.e. to maximize E (z).
In an optional embodiment, the updating the model parameters of the target model based on the target estimation function includes:
the target estimation function is subjected to derivation aiming at a target variable to obtain a derivation function, wherein the target variable is a variable representing a model parameter of the target model;
in case the value of the derivative function is equal to zero, model parameters of the target model are determined and updated based on the derivative function.
That is, the derivative function obtained by differentiating the above formula (1) is expressed by the following formula (2).
Figure BDA0003265364580000092
In the case where the derivative function is 0, i.e., delta =0, it can be said that there is a maximum extremum for E (z) in equation (1) above, and in this iteration, the model parameters of the target model can be determined and updated based on the derivative function, which is expressed by equation (3) below.
Figure BDA0003265364580000093
After the iteration is finished, on the basis of the updated model parameters, the other row of data in the first training sample set can be input into the target model to start the next iteration, and the model parameters of the target model are further determined and updated, so that the model parameters of the target model are more and more optimized.
In a new iteration, if the obtained likelihood estimation value is reduced relative to the previous iteration, or the iteration converges to a preset threshold value, the iteration is ended, and the training of the target model is completed.
In the embodiment, a first training sample set is obtained, wherein the first training sample set comprises M call characteristic samples of number samples in M dimensions; performing feature statistics on the first training sample set to determine a fraud feature set in the first training sample set; inputting the M call feature samples and the fraud feature set into a target model to perform a second detection operation, wherein the second detection operation comprises: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features existing in the dimension of the M call feature samples, determining feature distribution containing fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values and the feature distribution corresponding to the M dimensions; updating model parameters of the target model based on the target estimation function. Therefore, the training of the target model can be realized, the trained target model can detect the call characteristics of the number to be identified from M dimensions, whether the number to be identified is a fraud number is determined, more and more user characteristics can be identified, and the identification accuracy of the fraud number can be improved.
Optionally, the performing feature statistics on the first training sample set to determine a fraud feature set in the first training sample set includes:
performing feature statistics on call feature samples on a first dimension in the first training sample set based on category labels in the first training sample set to obtain a probability set corresponding to the first dimension, wherein the probability set comprises a second probability value, a third probability value and a fourth probability value, the second probability value is the prior probability of fraudulent features in the call feature samples on the first dimension, the third probability value is the prior probability of non-fraudulent features in the call feature samples on the first dimension, and the fourth probability value is the conditional probability of fraud occurring after the non-fraudulent features in the call feature samples on the first dimension;
determining a variance of the feature variable in the first dimension based on the set of probabilities;
removing the target call feature samples on the first dimension in the first training sample set to obtain target fraud features on the first dimension in the first training sample set;
the fraud feature set comprises a target fraud feature in the first dimension, the target call feature sample is a call feature sample corresponding to the feature variable when the variance is greater than a preset threshold, and the first dimension is any one of the M dimensions.
In this embodiment, by using the existing characteristics of the fraud features, that is, the probability of the fraud features appearing in all the call features is a small probability event, the call features with smaller occurrence probability can be screened out for each dimension, the call features with smaller occurrence probability are determined as the fraud features in the dimension, and finally, the fraud feature set in the first training sample set is obtained.
Specifically, based on a category label in a first training sample set (that is, whether a number sample is a fraudulent number or a normal number), a probability statistics method may be used to perform feature statistics on call feature samples in a first dimension in the first training sample set, so as to obtain a probability set corresponding to the first dimension.
The probability set comprises the prior probability of the fraud characteristics in the call characteristic sample in the first dimension, the prior probability of the non-fraud characteristics in the call characteristic sample in the first dimension, the conditional probability of the fraud after the non-fraud characteristics in the call characteristic sample in the first dimension, and the conditional probability of the non-fraud after the known fraud characteristics in the call characteristic sample in the first dimension.
Then, based on the set of probabilities, the variance of the feature variable in the first dimension may be determined using equation (4) below.
Figure BDA0003265364580000111
In the above equation (4), i and j each represent a dimension, P (a), i.e., the second probability value, is a prior probability or a marginal probability of a fraudulent feature, and the probability of any non-fraudulent feature is not considered, P (B), i.e., the third probability value, is a prior probability or a marginal probability of a non-fraudulent feature, P (a | B), i.e., the fourth probability value, is a conditional probability that fraud occurs after the non-fraudulent feature, which is referred to as a posterior probability of a, and P (B | a) is a conditional probability that non-fraud occurs after the known fraudulent feature occurs.
If P [ X ] is used]Characterizing the variance of the feature variable in the first dimension, then P [ X ]]=P(B i A) or P [ X ]]=1-P(B i A), characterizing the degree of dispersion of the features in the first dimension, the smaller the variance of the feature variable, the more discrete the features are, i.e. the less probability of the features occurring, and therefore, features less than or equal to the target threshold may be retained as a fraud feature set. The target threshold may be set according to an actual situation, and is usually set to be relatively small, for example, 5%, and correspondingly, the call feature samples corresponding to the feature variables when the variance in the first training sample set is greater than a preset threshold (the preset threshold is equal to 1 minus the target threshold) may be removed, for example, the call feature samples with a feature value of 0 or 1 in a proportion exceeding 95% in the first training sample set are removed, so as to obtain a minimum fraud feature set.
Therefore, the minimum characteristic feature set is selected for training the target model through the variance of the characteristic variables, so that the accuracy of extracting the cheating features can be improved, and the training effect of the target model can be improved.
Optionally, the method further includes:
acquiring a second training sample set and a third training sample set;
training the target model based on the second training sample set;
respectively performing performance tests on the target model obtained by training the first training sample set and the target model obtained by training the second training sample set based on the third training sample set to obtain a first performance index of the target model obtained by training the first training sample set and a second performance index of the target model obtained by training the second training sample set;
and selecting a target model for information recognition from the trained target models based on the first performance index and the second performance index.
In this embodiment, all public data may be divided into K parts to obtain K sample sets, where K may be greater than 2, the first training sample set may be any one sample set in the K sample sets, and the first training sample set may only include one number sample corresponding to the public data, and M call feature samples of each number sample corresponding to the public data in M dimensions. K-1 sample sets can be randomly selected, the K-1 sample sets can comprise a first training sample set and a second training sample set, each sample set serves as a training sample set to train the target model to obtain K-1 trained target models, and meanwhile the rest sample sets serve as test sample sets, namely third training sample sets, and are used for evaluating which trained target model is optimal in the aspect of information identification.
Specifically, for each trained target model, each line of data in the third training sample set may be input to the target model to perform the first detection operation, so as to obtain a plurality of recognition results, and an average value of the plurality of recognition results is used as an estimation of the accuracy of the target model and is used as a performance index of the target model under the k-fold cross validation, so as to obtain the performance index of the target model.
For example, for a target model trained by a first training sample set, the target model is evaluated based on a third training sample set to obtain 5 recognition results, which are respectively a fraud number, a normal number and a normal number, and category labels corresponding to the 5 recognition results are the fraud number, the normal number, the fraud number, the normal number and the normal number, the target model recognizes 3 numbers normally and 2 numbers incorrectly, and a performance index, that is, a first performance index, can be obtained based on the recognition condition. And (3) evaluating the target model trained by the second training sample set based on the third training sample set to obtain 5 recognition results which are respectively a fraud number, a normal number and a normal number, wherein the target model recognizes 5 numbers normally and 0 number wrongly, and one performance index, namely a second performance index, can be obtained based on the recognition condition.
Therefore, from the recognition situation, the second performance index is superior to the first performance index, and a target model with the optimal performance index can be selected from the trained target models to serve as a final target model for information recognition, so that the accuracy of information recognition can be improved.
The following describes an information recognition apparatus according to an embodiment of the present invention.
Referring to fig. 3, a schematic structural diagram of an information recognition apparatus according to an embodiment of the present invention is shown.
As shown in fig. 3, the information recognition apparatus 300 includes:
a first obtaining module 301, configured to obtain M call features of a number to be identified in M dimensions, where M is a positive integer;
a first detection operation module 302, configured to input the M call features into a target model to perform a first detection operation, so as to obtain an identification result, where the identification result is used to characterize whether the number to be identified is a fraudulent number;
wherein the first detection operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions.
Optionally, the first obtaining module 301 is specifically configured to:
and carrying out feature coding on the call data of the number to be identified on M dimensions to obtain the M call features.
The information identification apparatus 300 can implement each process implemented in the above-described information identification method embodiment, and can achieve the same technical effect, and is not described here again to avoid repetition.
The following describes a model training apparatus provided in an embodiment of the present invention.
Referring to fig. 4, a schematic structural diagram of a model training apparatus according to an embodiment of the present invention is shown.
As shown in fig. 4, the model training apparatus 400 includes:
a second obtaining module 401, configured to obtain a first training sample set, where the first training sample set includes M call feature samples of number samples in M dimensions, and M is a positive integer;
a feature statistics module 402, configured to perform feature statistics on the first training sample set to determine a fraud feature set in the first training sample set;
a second detecting operation module 403, configured to input the M call feature samples and the fraud feature set into a target model to perform a second detecting operation, where the second detecting operation includes: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features of the M call feature samples in the dimension, determining feature distribution of fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values corresponding to the M dimensions and the feature distribution;
an updating module 404, configured to update the model parameters of the target model based on the target estimation function.
Optionally, the first training sample set includes a category label of the number sample, and the feature statistics module 402 is specifically configured to:
performing feature statistics on call feature samples on a first dimension in the first training sample set based on category labels in the first training sample set to obtain a probability set corresponding to the first dimension, wherein the probability set comprises a second probability value, a third probability value and a fourth probability value, the second probability value is the prior probability of fraudulent features in the call feature samples on the first dimension, the third probability value is the prior probability of non-fraudulent features in the call feature samples on the first dimension, and the fourth probability value is the conditional probability of fraud occurring after the non-fraudulent features in the call feature samples on the first dimension;
determining a variance of the feature variable in the first dimension based on the set of probabilities;
removing the target call feature samples on the first dimension in the first training sample set to obtain target fraud features on the first dimension in the first training sample set;
the fraud feature set comprises target fraud features on the first dimension, the target call feature samples are call feature samples corresponding to the feature variables when the variance is larger than a preset threshold, and the first dimension is any one of the M dimensions.
Optionally, the apparatus further comprises:
a third obtaining module, configured to obtain a second training sample set and a third training sample set;
a training module to train the target model based on the second set of training samples;
the performance testing module is used for respectively performing performance testing on the target model obtained by training the first training sample set and the target model obtained by training the second training sample set based on the third training sample set to obtain a first performance index of the target model obtained by training the first training sample set and a second performance index of the target model obtained by training the second training sample set;
and the selection module is used for selecting the target model for information recognition from the target models obtained by training based on the first performance index and the second performance index.
Optionally, the updating module 404 is specifically configured to:
the target estimation function is subjected to derivation aiming at a target variable to obtain a derivation function, wherein the target variable is a variable representing a model parameter of the target model;
in case the value of the derivative function is equal to zero, model parameters of the target model are determined and updated based on the derivative function.
The model training device 400 can implement each process implemented in the above-described embodiment of the model training method, and can achieve the same technical effects, and is not described herein again to avoid repetition.
The following describes an electronic device provided in an embodiment of the present invention.
Referring to fig. 5, a schematic structural diagram of an electronic device provided by an embodiment of the present invention is shown. As shown in fig. 5, the electronic device 500 includes: a processor 501, a memory 502, a user interface 503, and a bus interface 504.
The processor 501, which is used to read the program in the memory 502, executes the following processes:
acquiring M conversation characteristics of a number to be identified on M dimensions, wherein M is a positive integer;
inputting the M call characteristics into a target model to execute a first detection operation to obtain an identification result, wherein the identification result is used for representing whether the number to be identified is a fraud number;
wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions.
In fig. 5, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 501, and various circuits, represented by memory 502, linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. Bus interface 504 provides an interface. For different user devices, the user interface 503 may also be an interface capable of interfacing with a desired device externally, including but not limited to a keypad, display, speaker, microphone, joystick, etc.
The processor 501 is responsible for managing the bus architecture and general processing, and the memory 502 may store data used by the processor 501 in performing operations.
Optionally, the processor 501 is further configured to:
and carrying out feature coding on the call data of the number to be identified on M dimensions to obtain the M call features.
The processor 501, which is also used to read the program in the memory 502, executes the following processes:
acquiring a first training sample set, wherein the first training sample set comprises M call characteristic samples of number samples in M dimensions, and M is a positive integer;
performing feature statistics on the first training sample set to determine a fraud feature set in the first training sample set;
inputting the M call feature samples and the fraud feature set into a target model to perform a second detection operation, wherein the second detection operation comprises: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features of the M call feature samples in the dimension, determining feature distribution of fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values corresponding to the M dimensions and the feature distribution;
updating model parameters of the target model based on the target estimation function.
Optionally, the first training sample set includes a class label of the number sample, and the processor 501 is further configured to:
performing feature statistics on call feature samples on a first dimension in the first training sample set based on class labels in the first training sample set to obtain a probability set corresponding to the first dimension, wherein the probability set comprises a second probability value, a third probability value and a fourth probability value, the second probability value is the prior probability of a fraud feature in the call feature samples on the first dimension, the third probability value is the prior probability of a non-fraud feature in the call feature samples on the first dimension, and the fourth probability value is the conditional probability of fraud occurring after the non-fraud feature in the call feature samples on the first dimension;
determining a variance of the feature variable in the first dimension based on the set of probabilities;
removing the target call feature samples on the first dimension in the first training sample set to obtain target fraud features on the first dimension in the first training sample set;
the fraud feature set comprises target fraud features on the first dimension, the target call feature samples are call feature samples corresponding to the feature variables when the variance is larger than a preset threshold, and the first dimension is any one of the M dimensions.
Optionally, the processor 501 is further configured to:
acquiring a second training sample set and a third training sample set;
training the target model based on the second set of training samples;
respectively performing performance tests on the target model obtained by training the first training sample set and the target model obtained by training the second training sample set based on the third training sample set to obtain a first performance index of the target model obtained by training the first training sample set and a second performance index of the target model obtained by training the second training sample set;
and selecting a target model for information recognition from the trained target models based on the first performance index and the second performance index.
Optionally, the processor 501 is further configured to:
the target estimation function is subjected to derivation aiming at a target variable to obtain a derivation function, wherein the target variable is a variable representing a model parameter of the target model;
in case the value of the derivative function is equal to zero, model parameters of the target model are determined and updated based on the derivative function.
Preferably, an embodiment of the present invention further provides an electronic device, which includes a processor 501, a memory 502, and a computer program that is stored in the memory 502 and is executable on the processor 501, and when the computer program is executed by the processor 501, the computer program implements each process of the above-mentioned embodiment of the information identification method, or implements each process of the above-mentioned embodiment of the model training method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned information identification method embodiment or implements each process of the above-mentioned model training method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk, and various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. An information identification method, characterized in that the method comprises:
acquiring M call characteristics of a number to be identified in M dimensions, wherein M is a positive integer;
inputting the M call characteristics into a target model to execute a first detection operation to obtain an identification result, wherein the identification result is used for representing whether the number to be identified is a fraud number;
wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions.
2. The method according to claim 1, wherein the obtaining M call features of the number to be identified in M dimensions comprises:
and carrying out feature coding on the call data of the number to be identified on M dimensions to obtain the M call features.
3. A method of model training, the method comprising:
acquiring a first training sample set, wherein the first training sample set comprises M call characteristic samples of number samples in M dimensions, and M is a positive integer;
performing feature statistics on the first training sample set to determine a fraud feature set in the first training sample set;
inputting the M call feature samples and the fraud feature set into a target model to execute a second detection operation, wherein the second detection operation comprises the following steps: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features existing in the dimension of the M call feature samples, determining feature distribution containing fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values and the feature distribution corresponding to the M dimensions;
updating model parameters of the target model based on the target estimation function.
4. The method of claim 3, wherein the first set of training samples includes class labels for the number samples, and wherein performing feature statistics on the first set of training samples to determine a set of fraud features in the first set of training samples comprises:
performing feature statistics on call feature samples on a first dimension in the first training sample set based on class labels in the first training sample set to obtain a probability set corresponding to the first dimension, wherein the probability set comprises a second probability value, a third probability value and a fourth probability value, the second probability value is the prior probability of a fraud feature in the call feature samples on the first dimension, the third probability value is the prior probability of a non-fraud feature in the call feature samples on the first dimension, and the fourth probability value is the conditional probability of fraud occurring after the non-fraud feature in the call feature samples on the first dimension;
determining a variance of the feature variable in the first dimension based on the set of probabilities;
removing the target call feature samples on the first dimension in the first training sample set to obtain target fraud features on the first dimension in the first training sample set;
the fraud feature set comprises a target fraud feature in the first dimension, the target call feature sample is a call feature sample corresponding to the feature variable when the variance is greater than a preset threshold, and the first dimension is any one of the M dimensions.
5. The method of claim 3, further comprising:
acquiring a second training sample set and a third training sample set;
training the target model based on the second set of training samples;
respectively performing performance tests on the target model obtained by training the first training sample set and the target model obtained by training the second training sample set based on the third training sample set to obtain a first performance index of the target model obtained by training the first training sample set and a second performance index of the target model obtained by training the second training sample set;
and selecting a target model for information recognition from the trained target models based on the first performance index and the second performance index.
6. The method of claim 3, wherein updating model parameters of the object model based on the object estimation function comprises:
the target estimation function is subjected to derivation aiming at a target variable to obtain a derivation function, wherein the target variable is a variable representing a model parameter of the target model;
determining and updating model parameters of the target model based on the derivative function if the value of the derivative function is equal to zero.
7. An information recognition apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring M call characteristics of the number to be identified in M dimensions, wherein M is a positive integer;
the first detection operation module is used for inputting the M call characteristics to a target model to execute a first detection operation to obtain an identification result, and the identification result is used for representing whether the number to be identified is a fraud number;
wherein the first detecting operation comprises: for each dimension, carrying out fraud analysis on the M call features based on model parameters of the target model in the dimension to obtain fraud probability values of the number to be identified in the dimension, and determining the identification result based on the M fraud probability values of the number to be identified in the M dimensions and feature weight information corresponding to the M dimensions.
8. A model training apparatus, the apparatus comprising:
the second acquisition module is used for acquiring a first training sample set, wherein the first training sample set comprises M call characteristic samples of number samples in M dimensions, and M is a positive integer;
the characteristic counting module is used for carrying out characteristic counting on the first training sample set so as to determine a fraud characteristic set in the first training sample set;
a second detection operation module, configured to input the M call feature samples and the fraud feature set to a target model to perform a second detection operation, where the second detection operation includes: for each dimension, performing fraud analysis on the M call feature samples based on model parameters of the target model in the dimension and the fraud feature set to obtain first probability values of fraud features of the M call feature samples in the dimension, determining feature distribution of fraud features in the M call feature samples based on the fraud feature set, and determining a target estimation function based on the M first probability values corresponding to the M dimensions and the feature distribution;
and the updating module is used for updating the model parameters of the target model based on the target estimation function.
9. An electronic device, characterized in that the electronic device comprises: comprising a processor, a memory, a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the information recognition method as claimed in any one of claims 1 to 2 or implementing the steps of the model training method as claimed in any one of claims 3 to 6.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of an information recognition method as claimed in one of the claims 1 to 2 or of a model training method as claimed in one of the claims 3 to 6.
CN202111085163.0A 2021-09-16 2021-09-16 Information identification method, model training method, related device and electronic equipment Pending CN115811735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111085163.0A CN115811735A (en) 2021-09-16 2021-09-16 Information identification method, model training method, related device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111085163.0A CN115811735A (en) 2021-09-16 2021-09-16 Information identification method, model training method, related device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115811735A true CN115811735A (en) 2023-03-17

Family

ID=85481010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111085163.0A Pending CN115811735A (en) 2021-09-16 2021-09-16 Information identification method, model training method, related device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115811735A (en)

Similar Documents

Publication Publication Date Title
CN110009174B (en) Risk recognition model training method and device and server
CN110910901B (en) Emotion recognition method and device, electronic equipment and readable storage medium
CN107222865B (en) Communication swindle real-time detection method and system based on suspicious actions identification
CN110147823B (en) Wind control model training method, device and equipment
CN110992169A (en) Risk assessment method, device, server and storage medium
CN109819126B (en) Abnormal number identification method and device
CN109903053B (en) Anti-fraud method for behavior recognition based on sensor data
CN110728323A (en) Target type user identification method and device, electronic equipment and storage medium
CN111626754B (en) Card-keeping user identification method and device
CN113205403A (en) Method and device for calculating enterprise credit level, storage medium and terminal
CN114841705B (en) Anti-fraud monitoring method based on scene recognition
CN108076032B (en) Abnormal behavior user identification method and device
CN112269937B (en) Method, system and device for calculating user similarity
CN107871213B (en) Transaction behavior evaluation method, device, server and storage medium
CN111612366B (en) Channel quality assessment method, channel quality assessment device, electronic equipment and storage medium
CN107222319B (en) Communication operation analysis method and device
CN111428963B (en) Data processing method and device
CN112598326A (en) Model iteration method and device, electronic equipment and storage medium
CN111353015B (en) Crowd-sourced question recommendation method, device, equipment and storage medium
CN115811735A (en) Information identification method, model training method, related device and electronic equipment
CN110909753B (en) Data classification method, system and equipment
CN110570301B (en) Risk identification method, device, equipment and medium
KR102332997B1 (en) Server, method and program that determines the risk of financial fraud
CN111652713B (en) Equity wind control modeling method and device
CN114707420A (en) Credit fraud behavior identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination