CN117390533A - Information identification method, storage medium and electronic device - Google Patents

Information identification method, storage medium and electronic device Download PDF

Info

Publication number
CN117390533A
CN117390533A CN202210770069.7A CN202210770069A CN117390533A CN 117390533 A CN117390533 A CN 117390533A CN 202210770069 A CN202210770069 A CN 202210770069A CN 117390533 A CN117390533 A CN 117390533A
Authority
CN
China
Prior art keywords
features
account
target
feature
behavior information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210770069.7A
Other languages
Chinese (zh)
Inventor
樊鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210770069.7A priority Critical patent/CN117390533A/en
Publication of CN117390533A publication Critical patent/CN117390533A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses an information identification method, a storage medium and electronic equipment, and relates to technologies such as big data, a database and account management in a cloud technology scene. Wherein the method comprises the following steps: acquiring account behavior information corresponding to a target account to be identified; constructing basic portrait features of the target account based on the historical behavior information in the account behavior information, and constructing business vertical features of the target account based on the business behavior information in the account behavior information; performing aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain at least two aggregation portrait features with different time dimensions; under the condition that the target features are acquired based on the polymer portrait features, inputting the target features into an academic recognition model; and obtaining an output result of the academic recognition model, wherein the output result comprises account academic information associated with the target account. The method and the device solve the technical problem of low information identification accuracy.

Description

Information identification method, storage medium and electronic device
Technical Field
The present application relates to the field of computers, and in particular, to an information identification method, a storage medium, and an electronic device.
Background
In order to ensure the accuracy of information identification, the quantity and quality of reference information need to be ensured. However, along with the gradual popularization of information security, the account number attaches importance to the privacy of the information, but information islands of reference information are also easy to form, and the problem of low information identification accuracy is caused. Therefore, there is a problem that the recognition accuracy of the information is low.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides an information identification method, a storage medium and electronic equipment, which are used for at least solving the technical problem of low information identification accuracy.
According to an aspect of the embodiments of the present application, there is provided an information identifying method, including: acquiring account behavior information corresponding to a target account to be identified; constructing basic portrait features of the target account based on historical behavior information in the account behavior information, and constructing business vertical features of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range; performing aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain aggregation portrait features of at least two different time dimensions; inputting the target features into an academic recognition model under the condition that the target features are acquired based on the aggregate portrait features, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing account academy, and the target sample features are sample features obtained by processing initial sample features by using a measurement learning algorithm; and obtaining an output result of the academic recognition model, wherein the output result comprises account number academic information related to the target account number.
According to another aspect of the embodiments of the present application, there is provided another information identifying method, including: acquiring account behavior information corresponding to each of a plurality of sample accounts; acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of the plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one; processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features; inputting the characteristics of the plurality of target samples into an initial academic recognition model for training to obtain a trained academic recognition model; and identifying the account number academic information associated with the account number to be identified based on the trained academic identification model.
According to another aspect of the embodiments of the present application, there is also provided an information identifying apparatus, including: the first acquisition unit is used for acquiring account behavior information corresponding to the target account to be identified; the construction unit is used for constructing basic portrait characteristics of the target account based on historical behavior information in the account behavior information and constructing business vertical characteristics of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range; an aggregation unit, configured to perform aggregation processing of at least two time dimensions on the basic portrait feature and the service vertical feature, to obtain at least two aggregated portrait features with different time dimensions; the first input unit is used for inputting the target features into an academic recognition model under the condition that the target features are acquired based on the aggregate portrait features, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing account academy, and the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm; and the second acquisition unit is used for acquiring an output result of the academic recognition model, wherein the output result comprises account number academic information related to the target account number.
As an alternative, the above polymerization unit includes: the first acquisition module is used for acquiring basic portrait characteristics in a first time period and business vertical characteristics in the first time period, wherein the preset time range comprises the first time period, the basic portrait characteristics in the first time period are behavior information of the target account number executed in the first time period, and the business behavior information is behavior information of the target account number executed in the first time period on the target object in the preset range; the first aggregation module is used for aggregating the basic portrait features in the first time period and the business vertical features in the first time period to obtain aggregated portrait features in a first time dimension, wherein the aggregated portrait features in at least two different time dimensions comprise the aggregated portrait features in the first time dimension; the second acquisition module is used for acquiring basic portrait features in a second time period and business vertical features in the second time period, wherein the preset time range comprises the second time period, the basic portrait features in the second time period are behavior information of the target account number executed in the second time period, and the business behavior information is behavior information of the target account number executed on a target object in the preset range in the second time period; and the second aggregation module is used for aggregating the basic portrait features in the second time period and the business vertical features in the second time period to obtain aggregation portrait features in a second time dimension, wherein the aggregation portrait features in at least two different time dimensions comprise the aggregation portrait features in the second time dimension.
As an alternative, the first aggregation module includes: the aggregation sub-module is used for aggregating the basic portrait features of the data in the first time period and the business vertical features in the first time period through an aggregation function, wherein the aggregation mode corresponding to the aggregation function comprises at least one of the following modes: summing, median, standard deviation.
As an alternative, the above device comprises at least one of: a first processing unit configured to perform feature processing of normalizing the aggregate image feature before the target feature is input into the academic recognition model; and a second processing unit configured to perform a discretized non-numerical feature processing on the aggregate image feature before the target feature is input into the academic recognition model.
As an alternative, the second processing unit includes at least one of: the first processing module is used for carrying out feature digitizing processing on the features belonging to the classification value in the aggregate image features; the second processing module is used for replacing the category of the classification feature in the aggregate portrait feature with the frequency of occurrence of the classification feature; the third processing module is used for converting the high-dimensional sparse classification variable in the aggregate portrait features into a continuous variable with low-dimensional density; a fourth processing module, configured to select, for missing values of continuous features in the aggregate image feature, an average value of feature values of all continuous features in the aggregate image feature, and fill the missing values of the continuous features; or selecting the median value of the feature values of all the continuous features in the polymer image features, and filling the missing values of the continuous features; a fifth processing module, configured to select, for missing values of discrete features in the aggregate image feature, feature values of all the discrete features that occur most frequently in the aggregate image feature, and fill the missing values of the continuous features; and a sixth processing module, configured to sum multiple values under a variable belonging to a category in the aggregate image feature into the same information.
As an optional solution, the third processing module includes: and the processing sub-module is used for carrying out characteristic embedding on the classification variable used for representing the account number behavior track of the target account number based on the DNN model to obtain a continuous variable used for representing the behavior mode of the account number behavior track.
As an alternative, the above construction unit includes: the third acquisition module is used for acquiring the media information under the specific type related to the account academic information; and the construction module is used for acquiring the behavior information of the target account number, which is executed on the media information under the specific type, from the account number behavior information, and constructing the business vertical feature based on the behavior information of the target account number, which is executed on the media information under the specific type.
According to another aspect of the embodiments of the present application, there is also provided another information identifying apparatus, including: the third acquisition unit is used for acquiring account behavior information corresponding to each of the plurality of sample accounts; a fourth obtaining unit, configured to obtain a plurality of initial sample features based on account behavior information corresponding to each of the plurality of sample accounts, where each of the plurality of initial sample features corresponds to each of the plurality of sample accounts one-to-one; the third processing unit is used for processing the plurality of initial sample characteristics by using a metric learning algorithm to obtain a plurality of target sample characteristics; the second input unit is used for inputting the characteristics of the plurality of target samples into an initial academic recognition model for training to obtain a trained academic recognition model; the first recognition unit is used for recognizing the account number academic information related to the account number to be recognized based on the trained academic recognition model.
As an alternative, the third processing unit includes: the dividing module is used for dividing the plurality of initial sample characteristics into a training set and a testing set; the first training module is used for training the initial sample characteristics in the training set by using the measurement learning algorithm to obtain a mapping matrix W and a nuclear matrix M; a first calculation module, configured to calculate an original distance of each initial sample feature in the training set based on the mapping matrix W and the kernel matrix M; the clustering module is used for clustering the initial sample characteristics in the training set by utilizing the original distance to obtain K clustering centers, wherein K is a natural number; the second calculation module is used for calculating a first distance from each initial sample feature in the training set to the clustering center, and sorting the initial sample features in the training set based on feature similarity corresponding to the first distance to obtain a first sorting result; and the adding module is used for taking the initial sample characteristics in the training set as the target sample characteristics and adding the target sample characteristics into a target characteristic space according to the first sequencing result.
As an alternative, the second input unit includes: the second training module is used for inputting the target sample characteristics into an initial academic recognition model for training until reaching a training convergence condition: acquiring a current academic recognition model, and inputting the target feature space into the current academic recognition model; clustering each target sample feature in the target feature space by adopting a first distance unit to obtain M first clustering results, wherein M is a natural number; clustering each target sample feature in the target feature space by adopting a second distance unit to obtain M second clustering results; calculating the M first clustering results to obtain a first co-relation matrix; calculating the M second polymerization results to obtain a second co-ordination matrix; obtaining a target clustering result of each target sample feature in the target feature space according to the first co-relation matrix and the second co-relation matrix, wherein the target clustering result is used for indicating the probability that the target sample feature belongs to a feature corresponding to a target academy; and under the condition that the target clustering result reaches the training convergence condition, determining the current academic recognition model as the trained academic recognition model.
As an optional solution, the obtaining, according to the first co-relationship matrix and the second co-relationship matrix, a target clustering result of each target sample feature in the target feature space includes: judging whether elements in a first distance matrix corresponding to the first co-relation matrix need to be adjusted or not by using a first threshold value corresponding to the first co-relation matrix; judging whether elements in a second distance matrix corresponding to the second co-ordination matrix need to be adjusted or not by using a second threshold value corresponding to the second co-ordination matrix; when the elements in the first distance matrix and/or the elements in the second distance matrix need to be adjusted and the current iteration number is less than or equal to a target threshold value, judging whether the elements in the second distance matrix need to be adjusted or not by using the second threshold value until the elements in the first distance matrix and/or the elements in the second distance matrix do not need to be adjusted or the current iteration number is greater than the target threshold value; and under the condition that the elements in the first distance matrix and the elements in the second distance matrix do not need to be adjusted, acquiring the target clustering result by using a hierarchical clustering algorithm.
As an alternative, the apparatus further includes: a fifth obtaining unit, configured to obtain account behavior information corresponding to each of the plurality of sample accounts; a sixth obtaining unit, configured to obtain a plurality of first sample features based on account behavior information corresponding to each of the plurality of first sample accounts, where each of the plurality of first sample features corresponds to each of the plurality of first sample accounts one-to-one; a fourth processing unit, configured to process the plurality of first sample features by using the metric learning algorithm to obtain a plurality of second sample features; the third input unit is used for inputting the plurality of second sample characteristics into an initial gender identification model for training to obtain a trained gender identification model; and the second identification unit is used for identifying the account gender information related to the account to be identified based on the trained gender identification model.
According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the information identifying method as above.
According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above-mentioned information identification method through the computer program.
In the embodiment of the application, account behavior information corresponding to a target account to be identified is obtained; constructing basic portrait features of the target account based on historical behavior information in the account behavior information, and constructing business vertical features of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range; performing aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain aggregation portrait features of at least two different time dimensions; inputting the target features into an academic recognition model under the condition that the target features are acquired based on the aggregate portrait features, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing account academy, and the target sample features are sample features obtained by processing initial sample features by using a measurement learning algorithm; obtaining an output result of the learning identification model, wherein the output result comprises account learning information associated with the target account, the account behavior information is specifically divided into two types, namely comprehensive but not enough vertical historical behavior information, and incomplete but higher vertical business behavior information, and features of the two types are aggregated by utilizing a time dimension, so that a pointing feature belonging to a specific business scene of learning identification is constructed, the blank of reference information in the learning identification scene is made up, an initial sample feature is processed by utilizing a measurement learning algorithm, and the pointing feature is processed by utilizing an information identification model trained by the completed sample feature, so that the technical effect of improving the information identification accuracy is realized, and the technical problem of lower information identification accuracy is solved;
In the embodiment of the application, account behavior information corresponding to each of a plurality of sample accounts is acquired; acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of the plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one; processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features; inputting the characteristics of the plurality of target samples into an initial academic recognition model for training to obtain a trained academic recognition model; based on the trained academic recognition model, the account academic information related to the account to be recognized is recognized, and the sample data with smaller information quantity is expanded and prolonged by utilizing a mode of processing the initial sample characteristics by a metric learning algorithm, so that the technical aim of improving the training quality of the information recognition model on the basis of limited training resources is fulfilled, the technical effect of improving the information recognition accuracy is realized, and the technical problem of lower information recognition accuracy is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a schematic illustration of an application environment of an alternative information identification method according to an embodiment of the present application;
FIG. 2 is a schematic illustration of a flow of an alternative information identification method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an alternative information identification method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of another alternative information identification method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of another alternative information identification method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of another alternative information identification method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of another alternative information identification method according to an embodiment of the present application;
FIG. 8 is a schematic diagram of another alternative information identification method according to an embodiment of the present application;
FIG. 9 is a schematic illustration of a flow of another alternative information identification method according to an embodiment of the present application;
FIG. 10 is a schematic diagram of another alternative information identification method according to an embodiment of the present application;
FIG. 11 is a schematic diagram of an alternative information identification device according to an embodiment of the present application;
FIG. 12 is a schematic diagram of another alternative information identification apparatus according to an embodiment of the present application;
Fig. 13 is a schematic structural view of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.
Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
The Database (Database), which can be considered as an electronic filing cabinet, namely a place for storing electronic files, can be used for carrying out operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple accounts, with as little redundancy as possible, and independent of the application.
The database management system (Database Management System, abbreviated as DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by the query language used, such as SQL (structured query language (Structured Query Language), XQuery, or by the energy impact emphasis, such as maximum-scale, maximum-speed, or other classification means, regardless of which classification means is used, some DBMSs can cross-category, for example, while supporting multiple query languages.
Big data (Big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision-making ability, insight discovery ability and flow optimization ability. With the advent of the cloud age, big data has attracted more and more attention, and special techniques are required for big data to effectively process a large amount of data within a tolerant elapsed time. Technologies applicable to big data include massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the internet, and scalable storage systems.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of the embodiments of the present application, there is provided an information identifying method, optionally, as an optional implementation manner, the above information identifying method may be applied, but not limited to, in the environment as shown in fig. 1. Including but not limited to account device 102 and server 112, including but not limited to display 108, processor 106, and memory 1004, server 112 including database 114 and processing engine 116.
The specific process comprises the following steps:
step S102, the account device 102 obtains an account identifier of the target account 1002 to be identified;
step S104-S106, the account identification is sent to the server 112 through the network 110;
step S108-S110, the server 112 obtains account behavior information corresponding to the account identification from the database 114, and then calculates account learning information corresponding to the account behavior information through the processing engine 116;
in steps S112-S114, the account number learning information is sent to the account number device 102 through the network 110, and the account number device 102 displays the account number learning information on the display 108 through the processor 106, and stores the account number learning information in the memory 104.
In addition to the example shown in fig. 1, the above steps may be assisted by a server, that is, the server performs steps of obtaining a candidate probe point set of the target virtual model, obtaining a blocking degree, screening probe points, and light rendering, so as to reduce processing pressure of the server. The account device 102 includes, but is not limited to, a handheld device (e.g., a cell phone), a notebook computer, a desktop computer, a vehicle-mounted device, etc., and the present application is not limited to a particular implementation of the account device 102.
Optionally, as an optional embodiment, as shown in fig. 2, the information identifying method includes:
S202, acquiring account behavior information corresponding to a target account to be identified;
s204, constructing basic portrait features of the target account based on historical behavior information in the account behavior information and constructing business vertical features of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range;
s206, carrying out aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain at least two aggregation portrait features with different time dimensions;
s208, under the condition that the target feature is acquired based on the aggregate portrait feature, inputting the target feature into an academic recognition model, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing an account number academy, and the target sample features are sample features obtained by processing initial sample features by using a measurement learning algorithm;
s210, obtaining an output result of the academic recognition model, wherein the output result comprises account academic information associated with the target account.
Optionally, in this embodiment, the above information identification method may be, but not limited to, applied to a service scenario of highest academic recognition of an account, where, if the highest academic of the account needs to be identified for accurate product pushing, first, basic behavior information of the account that is not private is obtained, but because behavior features associated with the academic itself are complex, the basic behavior information that is not private is not usually directly associated with behavior features associated with the academic itself, so if the basic behavior information that is not private is directly adopted for subsequent academic recognition, a problem that accuracy of the academic recognition is lower may occur; in this embodiment, in order to improve accuracy of the learning identification, feature extraction and feature processing are performed on the non-private basic behavior information in a diversified manner, so as to obtain a relevant feature with a higher degree of association between behavior features associated with the learning itself, and then the relevant feature is used to perform subsequent learning identification, so as to improve accuracy of the learning identification.
Optionally, in this embodiment, the above information identification method may be, but not limited to, applied to other service scenarios, for example, the above information identification method is applied to a service scenario of account gender identification, firstly, basic behavior information of account non-privacy is obtained, then, feature extraction and feature processing with diversification are performed on the basic behavior information of the non-privacy, so as to obtain relevant features with higher association degree with the service scenario of the account gender identification, and then, subsequent gender identification is performed by using the relevant features (for example, an academic identification model is adjusted to be a gender identification model, and then the relevant features are input into the gender identification model), thereby improving accuracy of gender identification.
Optionally, in this embodiment, the account behavior information may be, but is not limited to, acquired behavior information of the account executed in a preset time range or a preset space range, including historical behavior information and service behavior information, where the historical behavior information is behavior information of the target account executed in the preset time range (such as purchasing behavior of the virtual prop by the account through the target account), the service behavior information is behavior information of the target account executed on the target object in the preset range (such as viewing behavior of the media resource by the account through the target account), and so on.
Optionally, in this embodiment, the constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information may be understood as, but not limited to, constructing the basic portrait characteristic of the target account based only on the historical behavior information in the account behavior information, and may also be understood as, but not limited to, constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information and other account information, where the other account information may include, but is not limited to, account basic information, account related information (such as related information of other accounts having a relationship with the target account), and so on;
Similarly, in this embodiment, the construction of the service vertical feature of the target account based on the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on only the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on the other account information and the service behavior information in the account behavior information may also be understood, but not limited to.
Optionally, in this embodiment, for aggregating features of different time spans, aggregation processing of at least two time dimensions is performed on the basic portrait feature and the service vertical feature, so as to obtain at least two aggregated portrait features of different time dimensions.
Optionally, in this embodiment, the information recognition model is a neural network model that is trained by using target sample features for recognizing account learning, where the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm, and the object of metric learning is usually the distance of a sample feature vector, and the purpose of metric learning may be, but is not limited to, training and learning, to reduce or limit the distance between similar samples, and at the same time to increase the distance between samples of different types;
Optionally, in this embodiment, the generating the seed account portrait feature includes: account basic attributes (such as gender), etc.; filtering the abnormal account based on the portrait, for example: filtering account numbers with the WeChat using time longer than 24 hours, and the like; then, based on a clustering measurement learning frame, obtaining reordered features of the samples in a measurement space so as to improve the information expression capability of the features; then using a measurement learning frame method after the co-matrix optimization, combining an analytic hierarchy process, carrying out pooling weighting on values of the samples in measurement dimensions of different distances, and finally fitting out the probability that the samples belong to positive examples;
further by way of example, optionally as shown in fig. 3, the specific steps are as follows:
step S302, raw data preparation: based on manual labeling and service experience, positive and negative training samples which are strongly related to the service, normal in data distribution and reasonable in account image are found out;
specifically, a seed account number with Label information is obtained based on manual labeling and business logic, for example. A batch of seed account numbers are recalled roughly based on rules, then filtering is carried out based on a manual screening mode, and finally verification is carried out based on business logic; wherein the basic portraits can include, but are not limited to, some non-privacy behavioral data of account numbers within a specific App, such as whether to install a target application, whether to use a target application harassment interception function, a listening assistant function, etc.; and calculating the abnormal account type index. In a real business scene, a false account number and a situation that a computer controls a mobile phone exist. In order to eliminate the influence of the non-real account on modeling analysis, abnormal account detection indexes such as flow use condition of the account in a target application program, time distribution generated by flow and the like are set based on service experience; and filtering the abnormal seed account based on the distribution abnormality theorem. Furthermore, outlier determination criteria may be made using, but not limited to, "Laida criteria"; and storing the filtered normal seed account in the HDFS offline. Storing the filtered clean data in a distributed file system (The Hadoop Distributed File System, HFDS for short) so as to facilitate the rapid access of the subsequent flow;
Step S304, offline feature processing: constructing image features of training samples, and generating high-dimensional feature vectors based on vertical characteristics of the features and combining time dimensions and different feature processing methods;
specifically, for example, a basic portrait feature is constructed, and a rich account portrait can be constructed based on account historical behavior data, wherein the account portrait comprises at least one of the following: account basic attributes, equipment basic attributes, network connection attributes and the like; constructing a vertical type feature of the service based on the service characteristics, wherein the vertical type feature can include, but is not limited to, click rate, conversion rate and the like of an account on a specific type of media resource; combining the time dimension, aggregating the portrait features and business features of different time spans, such as computing aggregate portraits of account numbers of nearly half year, nearly 3 months, nearly 1 month and nearly 1 week, wherein the aggregation method can be selected from any one or more of summation, median and standard deviation; further adopting a normalization numerical mode and a discretization non-numerical mode to perform feature processing, wherein the normalization method can be used for but not limited to Gaussian normalization; furthermore, the processed features are combined and stored in the HDFS in an off-line manner, so that the rapid access of the subsequent flow is facilitated; and solidifying the feature processing logic, timing offline automatic calculation, and pushing an offline calculation result to an online storage engine;
Step S306, clustering metric learning framework: based on training samples and feature vectors and K-Means clustering and metric learning, obtaining reordering features of account numbers belonging to positive examples;
specifically, for example, the low-order and high-order characteristic results are read in and spliced in columns; dividing the characteristics into a training set probFea and a test set galFea; further training the training set probFea by using a metric learning algorithm to calculate a mapping matrix W and a kernel matrix M, wherein the calculation mode can refer to the following formula (1), formula (2) and formula (3):
E W=J(W)∑ I W (2)
further using the mapping matrix W and the kernel matrix M, calculating the distance dist from each sample in the test set to the test set, wherein the calculation method can refer to the following formula (4), formula (5) and formula (6):
furthermore, as shown in fig. 4, clustering is performed based on a K-Means algorithm to obtain K clustering centers, which specifically includes the following steps:
step S402, gathering samples to be clustered into 3 types;
step S404, selecting 3 center points;
step S406, finding the nearest center point for each sample, and completing one-time clustering;
step S408, judging whether the clustering conditions of the sample points before and after the primary clustering are the same, if so, stopping (step S422), and if not, continuing the next step (step S420);
Step S410, updating the center point according to the clustering result;
step S412, finding the center point closest to each sample to complete secondary clustering;
step S414, determining whether the clustering conditions of the sample points before and after the secondary clustering are the same, if so, terminating (step S422), and if not, continuing the next step (step S420);
step S416, updating the center point according to the clustering result;
step S411, finding the nearest center point for each sample, and completing multiple clustering;
step S420, judging whether the clustering conditions of the sample points before and after the multiple clustering are the same, if so, stopping (step S422), and if not, continuing the next step (step S420);
step S422, assuming that the algorithm is terminated in the previous step (step S420), presenting the final clustering result;
further calculating the distance from each sample to the clustering center, and reordering based on similarity matching; the reordered result is then added to the new feature space, for example as shown in fig. 5, with the following steps:
step S502, inputting data;
step S504, extracting features;
step S506, learning training data by using the metrics to obtain W and M;
step S508, obtaining a training aggregation class center by using a K-means algorithm;
Step S510, calculating a distance based on a clustering center by using W and M;
in step S512, a reorder matching matrix is calculated.
Step S308, a co-ordinated matrix framework: obtaining a co-ordination matrix by using K-Means clustering, and using the co-ordination matrix in optimization of metric learning;
specifically, for example, reading in low-order and high-order characteristic results, and performing column-wise splicing (including reordering the results); obtaining M base clustering results by using K-Means, calculating to obtain a co-ordination matrix A, replacing the distance measurement of the K-Means algorithm with the Mahalanobis distance, obtaining M clustering results again, calculating a co-ordination matrix B, and further defining the Mahalanobis distance between d-dimensional samples x and y as follows:
further, by using a metric learning algorithm of the co-ordination matrix, the distance matrix a and the threshold e are calculated, and the calculation mode can refer to the following formula (7):
and a final clustering result is obtained by using an analytic hierarchy process, wherein a calculation flow is shown in fig. 6, and the specific steps are as follows:
step S602, constructing a judgment matrix;
step S604, calculating a single-layer weight subset;
step S606, checking the consistency of the single layer, executing step S608 when the single layer passes, and executing step S602 when the single layer does not pass;
Step S608, calculating a single-layer weight subset;
step S610, checking the consistency of the single layer, executing step S612 when passing, and executing step S602 when not passing;
step S612, obtaining the index weight.
Further calculate the original distance and mahalanobis distance of the sample, perform the average pooling operation, normalize the result, and look the normalized result approximately as a probability of belonging to the positive example, for example, as shown in fig. 7, assume first that the input dataset: x= { X, X2, … …, xn }, number of classes in the clustering result: k, maximum number of iterations: i, outputting a clustering result: pi is the following specific steps:
step S702, obtaining M base clustering results by using a k-means algorithm, and calculating to obtain a co-ordination matrix;
step S704, a distance matrix A and a threshold value c are obtained by using a measurement learning algorithm based on a co-ordination matrix;
step S706, setting the corresponding element of the S matrix as 1 and the corresponding element of the D matrix as 1 when the corresponding element is greater than or equal to c in the co-ordination matrix;
step S708, replacing the k-means algorithm distance measurement with the Markov distance to obtain M base clustering results again, and calculating to obtain a co-ordination matrix;
step S710, obtaining a distance matrix A and a threshold e by using a measurement learning algorithm based on a co-ordination matrix;
Step S712, setting the corresponding element of the S matrix as 1 and the corresponding element of the D matrix as 1 when the corresponding element is greater than or equal to c in the co-ordination matrix;
step S714, judging whether the S matrix and the D matrix are changed or not, if so, going to step S716, and if not, going to step S711;
step S716, judging whether the iteration number is greater than i, if yes, jumping to step S711, and if not, jumping to step S708;
step S711, obtaining a final clustering result pi by using a hierarchical clustering algorithm.
Alternatively, in the present embodiment, the account number learning information may refer to, but is not limited to, the current highest learning level of the target account number, such as below, in, above, etc.; the account number history information may be, but is not limited to, probability information of the history of the target account number, such as 30% of the probability that the target account number is not more than the family, 60% of the probability that the target account number is more than the family, and 10% of the probability that the target account number is more than the family.
It should be noted that, the account behavior information is specifically divided into two types, namely, comprehensive but not enough-perpendicularity historical behavior information, and incomplete but higher-perpendicularity business behavior information, and features of the two are aggregated by utilizing a time dimension, so that the directional feature belonging to a specific business scene of the academic recognition is constructed, the blank of reference information in the academic recognition scene is made up, the initial sample feature is processed by utilizing a measurement learning algorithm, and the directional feature is processed by utilizing an information recognition model trained by the completed sample feature, so that the information recognition accuracy is improved.
Further illustratively, as shown in fig. 8, optionally, account behavior information 804 corresponding to the target account 802 to be identified is obtained; constructing a basic image feature 806-1 of the target account 802 based on the historical behavior information 04-1 in the account behavior information 804, and constructing a business vertical feature 806-2 of the target account 802 based on the business behavior information 804-2 in the account behavior information 804; performing aggregation processing of at least two time dimensions on the basic image feature 806-1 and the business vertical feature 806-2 to obtain at least two aggregated image features 808 with different time dimensions; under the condition that the target features are acquired based on the aggregate portrait features 808, the target features are input into an academic recognition model 810, wherein the information recognition model 810 is a neural network model which is trained by using target sample features and is used for recognizing account academy, and the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm; an output result of the learning identification model 810 is obtained, where the output result includes account learning information 812 associated with the target account 802, such as 30% of the probability that the target account is below the family, 60% of the probability that the target account is above the family, 10% of the probability that the target account is above the family, and so on.
According to the method and the device, account behavior information corresponding to the target account to be identified is obtained; constructing basic portrait features of the target account based on historical behavior information in the account behavior information and constructing business vertical features of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range; performing aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain at least two aggregation portrait features with different time dimensions; under the condition that the target features are acquired based on the polymer portrait features, the target features are input into an academic recognition model, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing account academic, and the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm; the method comprises the steps of obtaining an output result of an academic recognition model, wherein the output result comprises account academic information related to a target account, dividing the account behavior information into two types specifically, namely comprehensive but not enough-vertical historical behavior information, and incomplete but not high-vertical business behavior information, and aggregating the characteristics of the two types by utilizing a time dimension, constructing a pointing characteristic belonging to a specific business scene of academic recognition in the mode, making up the blank of reference information in the academic recognition scene, further processing initial sample characteristics by utilizing a measurement learning algorithm, and processing the pointing characteristic by utilizing an information recognition model trained by the completed sample characteristics, thereby realizing the technical effect of improving the information recognition accuracy.
As an alternative, the aggregation processing of at least two time dimensions is performed on the basic portrait features and the business vertical features to obtain at least two aggregated portrait features with different time dimensions, including:
s1, acquiring basic portrait features in a first time period and business vertical features in the first time period, wherein a preset time range comprises the first time period, the basic portrait features in the first time period are behavior information of a target account number executed in the first time period, and the business behavior information is behavior information of the target account number executed on a target object in the preset range in the first time period; the method comprises the steps of carrying out aggregation processing on basic portrait features in a first time period and business vertical features in the first time period to obtain aggregated portrait features in a first time dimension, wherein at least two aggregated portrait features in different time dimensions comprise the aggregated portrait features in the first time dimension;
s2, acquiring basic portrait features in a second time period and business vertical features in the second time period, wherein the preset time range comprises the second time period, the basic portrait features in the second time period are behavior information of a target account number executed in the second time period, and the business behavior information is behavior information of the target account number executed on a target object in the preset range in the second time period; and carrying out aggregation processing on the basic portrait features in the second time period and the business vertical features in the second time period to obtain aggregation portrait features in the second time dimension, wherein the aggregation portrait features in at least two different time dimensions comprise aggregation portrait features in the second time dimension.
Alternatively, in the present embodiment, the first and second are exemplary descriptions, and may be understood as a plurality, but not limited to a plurality, such as the first time period, the second time period, but not limited to acquiring the basic portrait features of only two time periods, and the business vertical features.
As an alternative, the aggregation processing is performed on the basic portrait features in the first time period and the business vertical features in the first time period to obtain aggregated portrait features in the first time dimension, which comprises:
and carrying out aggregation processing on the basic portrait features in the first time period of the data and the business vertical features in the first time period through an aggregation function, wherein a summarization mode corresponding to the aggregation function comprises at least one of the following steps: summing, median, standard deviation.
Optionally, in this embodiment, the basic portrait features and the service vertical features of different time spans are aggregated by combining the time dimension, and aggregate portraits (aggregate portrait features) of the account number of nearly half a year, nearly 3 months, nearly 1 month and nearly 1 week are calculated, and three methods of summation, median and standard deviation are selected for aggregation.
Alternatively, in this embodiment, the aggregate function, also called a group function, may be, but is not limited to, counting and calculating data in a table, and is typically used in conjunction with grouping (group by) for counting and calculating grouping data, such as count (col) representing the total number of rows in a designated column, max (col) representing the maximum value in a designated column, min (col) representing the minimum value in a designated column, sum (col) representing the sum of columns in a designated column, avg (col) representing the average value in a designated column, and so on.
As an alternative, before inputting the target feature into the academic recognition model, the method includes at least one of:
s1, carrying out normalized numerical type feature processing on the aggregate image features;
s2, performing discretization non-numerical type feature processing on the aggregate image features.
Alternatively, in the present embodiment, the feature processing method of the normalized numerical type may be, but not limited to, a feature processing method of selecting gaussian normalization.
As an alternative, discretized non-numerical feature processing is performed on the aggregate image features, including at least one of:
performing feature digitization on the features belonging to the classification value in the aggregate image features;
replacing the category of the classification feature in the aggregate portrait feature with the frequency of occurrence of the classification feature;
converting the high-dimensional sparse classification variable in the aggregate portrait features into a low-dimensional dense continuous variable;
selecting an average value of feature values of all continuous features in the aggregate image features for missing values of the continuous features in the aggregate image features, and filling the missing values of the continuous features; or selecting the median value of the feature values of all the continuous features in the aggregate image features, and filling the missing values of the continuous features;
Selecting the most frequently occurring feature value of all the discrete features in the polymeric portrait features for the missing value of the discrete features in the polymeric portrait features, and filling the missing value of the continuous features;
a plurality of values under a variable belonging to the same category in the aggregate image feature are integrated into the same information.
Optionally, in this embodiment, feature digitizing is performed on features belonging to the classification value in the aggregate image feature, for example, feature digitizing is performed on features belonging to the classification value, such as account gender, by using One-Hot Encoding (One-Hot Encoding).
Optionally, in this embodiment, the classification feature of the aggregate image feature is replaced by the frequency of occurrence of the classification feature, for example, the classification feature is replaced by the number of occurrences of the classification feature by using a Count Encoding mode, for example, 10 occurrences of 'Peking' in a certain classification are replaced by 10 occurrences of 'Peking', and specifically for the WiFi POI feature of the account, the account and the interest level of the POI are identified by using the Count Encoding mode, for example, the account spends 3 times when the POI is "food-Chinese-canteen".
Alternatively, based on data analysis, many categories of features are found to have strong sparsity. In order to avoid model overfitting and improve model stability, in this embodiment, a high-dimensional sparse classification variable in the aggregate portrait features is converted into a continuous variable with low-dimensional density, for example, a Category Embedding mode is adopted, and a neural network is introduced to convert the high-dimensional sparse classification variable into an Embedding variable with low-dimensional density.
Optionally, for the missing value processing of the feature, in this embodiment, for the missing values of the continuous features in the aggregate image feature, an average value of the feature values of all the continuous features in the aggregate image feature is selected, and the missing values of the continuous features are filled; or selecting the median value of the feature values of all the continuous features in the aggregate image features, and filling the missing values of the continuous features; selecting the most frequently occurring feature values of all the discrete features in the polymeric portrait features for the missing values of the discrete features in the polymeric portrait features, filling the missing values of the continuous features and the like; in addition, a mode of converting the missing value into an encoding expression can be adopted to process the missing value of the characteristic.
Alternatively, in this embodiment, multiple values under the same category of variable in the aggregate image feature are summarized into the same information, for example, consolidation Encoding mode is adopted, and multiple values under some categories of variable can be summarized into the same information. For example, three values of the version characteristic of the target system include "4.2", "4.4", and "5.0", which can be generalized to a "low-version target system" based on experience. Experiments prove that the Consolidation Encoding treatment mode can bring more forward benefits than the direct implementation of the 'target system version' feature one-hot.
As an alternative, converting a high-dimensional sparse classification variable in a polymeric representation feature into a low-dimensional dense continuous variable, comprising:
based on the DNN model, feature embedding is carried out on the classification variable used for representing the account behavior track of the target account, so as to obtain the continuous variable used for representing the behavior mode of the account behavior track.
Optionally, in this embodiment, the Category feature is input to the DNN model, and the Embedding feature is trained.
Optionally, in this embodiment, based on the MST-CNN deep learning network, the WiFi connection track data of the account is subjected to embedded, and the Wi-Fi behavior Pattern information of the account is captured.
Optionally, in this embodiment, based on the List-Embedding manner, embedding extraction is performed on the account using the flow usage behavior sequence of the application program of different categories under the same system, for example, traffic Embedding of the target type App, so as to obtain the low-dimensional dense account behavior feature.
As an alternative solution, constructing the service vertical feature of the target account based on the service behavior information in the account behavior information includes:
s1, acquiring media information under a specific type associated with account number learning information;
s2, acquiring behavior information of the target account, which is executed on the media information under the specific type, from the account behavior information, and constructing a service vertical feature based on the behavior information of the target account, which is executed on the media information under the specific type.
Optionally, in this embodiment, to improve the adaptation degree of the service scenario identified by the academic institution, media information under a specific type associated with the academic institution information is first obtained, then behavior information of the target account, which is executed on the media information under the specific type, is obtained from the account behavior information, and a service vertical feature is constructed based on the behavior information of the target account, which is executed on the media information under the specific type.
Alternatively, as an alternative embodiment, as shown in fig. 9, the information identifying method includes:
s902, acquiring account behavior information corresponding to each of a plurality of sample accounts;
s904, acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of a plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one;
s906, processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features;
s908, inputting a plurality of target sample features into an initial academic recognition model for training to obtain a trained academic recognition model;
s910, identifying account number learning information associated with the account number to be identified based on the trained learning identification model.
Optionally, in this embodiment, the above information identification method may be, but not limited to, applied to a service scenario of highest academic recognition of an account, where, if the highest academic of the account needs to be identified for accurate product pushing, first, basic behavior information of the account that is not private is obtained, but because behavior features associated with the academic itself are complex, the basic behavior information that is not private is not usually directly associated with behavior features associated with the academic itself, so if the basic behavior information that is not private is directly adopted for subsequent academic recognition, a problem that accuracy of the academic recognition is lower may occur; in this embodiment, in order to improve accuracy of the learning identification, feature extraction and feature processing are performed on the non-private basic behavior information in a diversified manner, so as to obtain a relevant feature with a higher degree of association between behavior features associated with the learning itself, and then the relevant feature is used to perform subsequent learning identification, so as to improve accuracy of the learning identification.
Optionally, in this embodiment, the above information identification method may be, but not limited to, applied to other service scenarios, for example, the above information identification method is applied to a service scenario of account gender identification, firstly, basic behavior information of account non-privacy is obtained, then, feature extraction and feature processing with diversification are performed on the basic behavior information of the non-privacy, so as to obtain relevant features with higher association degree with the service scenario of the account gender identification, and then, subsequent gender identification is performed by using the relevant features (for example, an academic identification model is adjusted to be a gender identification model, and then the relevant features are input into the gender identification model), thereby improving accuracy of gender identification.
Optionally, in this embodiment, the account behavior information may be, but is not limited to, acquired behavior information of the account executed in a preset time range or a preset space range, including historical behavior information and service behavior information, where the historical behavior information is behavior information of the target account executed in the preset time range (such as purchasing behavior of the virtual prop by the account through the target account), the service behavior information is behavior information of the target account executed on the target object in the preset range (such as viewing behavior of the media resource by the account through the target account), and so on.
Optionally, in this embodiment, the constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information may be understood as, but not limited to, constructing the basic portrait characteristic of the target account based only on the historical behavior information in the account behavior information, and may also be understood as, but not limited to, constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information and other account information, where the other account information may include, but is not limited to, account basic information, account related information (such as related information of other accounts having a relationship with the target account), and so on;
Similarly, in this embodiment, the construction of the service vertical feature of the target account based on the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on only the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on the other account information and the service behavior information in the account behavior information may also be understood, but not limited to.
Optionally, in this embodiment, for aggregating features of different time spans, aggregation processing of at least two time dimensions is performed on the basic portrait feature and the service vertical feature, so as to obtain at least two aggregated portrait features of different time dimensions.
Optionally, in this embodiment, the information recognition model is a neural network model that is trained by using target sample features for recognizing account learning, where the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm, and the object of metric learning is usually the distance of a sample feature vector, and the purpose of metric learning may be, but is not limited to, training and learning, to reduce or limit the distance between similar samples, and at the same time to increase the distance between samples of different types;
Optionally, in this embodiment, the generating the seed account portrait feature includes: account basic attributes (such as gender), etc.; filtering the abnormal account based on the portrait, for example: filtering account numbers with the WeChat using time longer than 24 hours, and the like; then, based on a clustering measurement learning frame, obtaining reordered features of the samples in a measurement space so as to improve the information expression capability of the features; then using a measurement learning frame method after the co-matrix optimization, combining an analytic hierarchy process, carrying out pooling weighting on values of the samples in measurement dimensions of different distances, and finally fitting out the probability that the samples belong to positive examples;
further by way of example, optionally as shown in fig. 3, the specific steps are as follows:
step S302, raw data preparation: based on manual labeling and service experience, positive and negative training samples which are strongly related to the service, normal in data distribution and reasonable in account image are found out;
specifically, a seed account number with Label information is obtained based on manual labeling and business logic, for example. A batch of seed account numbers are recalled roughly based on rules, then filtering is carried out based on a manual screening mode, and finally verification is carried out based on business logic; obtaining a seed account basic portrait, wherein the basic portrait can include some non-privacy behavior data of an account in a specific App, such as whether a target application is installed, whether a target application disturbance interception function is used, an answer assistant function and the like; and calculating the abnormal account type index. In a real business scene, a false account number and a situation that a computer controls a mobile phone exist. In order to eliminate the influence of the non-real account on modeling analysis, abnormal account detection indexes such as flow use condition of the account in a target application program, time distribution generated by flow and the like are set based on service experience; and filtering the abnormal seed account based on the distribution abnormality theorem. Furthermore, outlier determination criteria may be made using, but not limited to, "Laida criteria"; and storing the filtered normal seed account in the HDFS offline. Storing the filtered clean data in a distributed file system (TheHadoop Distributed File System, HFDS for short) so as to facilitate the rapid access of the subsequent flow;
Step S304, offline feature processing: constructing image features of training samples, and generating high-dimensional feature vectors based on vertical characteristics of the features and combining time dimensions and different feature processing methods;
specifically, for example, a basic portrait feature is constructed, and a rich account portrait can be constructed based on account historical behavior data, wherein the account portrait comprises at least one of the following: account basic attributes, equipment basic attributes, network connection attributes and the like; constructing a vertical type feature of the service based on the service characteristics, wherein the vertical type feature can include, but is not limited to, click rate, conversion rate and the like of an account on a specific type of media resource; combining the time dimension, aggregating the portrait features and business features of different time spans, such as computing aggregate portraits of account numbers of nearly half year, nearly 3 months, nearly 1 month and nearly 1 week, wherein the aggregation method can be selected from any one or more of summation, median and standard deviation; further adopting a normalization numerical mode and a discretization non-numerical mode to perform feature processing, wherein the normalization method can be used for but not limited to Gaussian normalization; furthermore, the processed features are combined and stored in the HDFS in an off-line manner, so that the rapid access of the subsequent flow is facilitated; and solidifying the feature processing logic, timing offline automatic calculation, and pushing an offline calculation result to an online storage engine;
Step S306, clustering metric learning framework: based on training samples and feature vectors and K-Means clustering and metric learning, obtaining reordering features of account numbers belonging to positive examples;
specifically, for example, the low-order and high-order characteristic results are read in and spliced in columns; dividing the characteristics into a training set probFea and a test set galFea; further training a training set probFea by using a metric learning algorithm, and calculating a mapping matrix W and a nuclear matrix M, wherein the calculation mode can refer to the formula (1), the formula (2) and the formula (3);
further using the mapping matrix W and the kernel matrix M to calculate the original distance dist from each sample in the test set to the test set, wherein the calculation mode can refer to the formula (4), the formula (5) and the formula (6);
furthermore, as shown in fig. 4, clustering is performed based on a K-Means algorithm to obtain K clustering centers, which specifically includes the following steps:
step S402, gathering samples to be clustered into 3 types;
step S404, selecting 3 center points;
step S406, finding the nearest center point for each sample, and completing one-time clustering;
step S408, judging whether the clustering conditions of the sample points before and after the primary clustering are the same, if so, stopping (step S422), and if not, continuing the next step (step S420);
Step S410, updating the center point according to the clustering result;
step S412, finding the center point closest to each sample to complete secondary clustering;
step S414, determining whether the clustering conditions of the sample points before and after the secondary clustering are the same, if so, terminating (step S422), and if not, continuing the next step (step S420);
step S416, updating the center point according to the clustering result;
step S411, finding the nearest center point for each sample, and completing multiple clustering;
step S420, judging whether the clustering conditions of the sample points before and after the multiple clustering are the same, if so, stopping (step S422), and if not, continuing the next step (step S420);
step S422, assuming that the algorithm is terminated in the previous step (step S420), presenting the final clustering result;
further calculating the distance from each sample to the clustering center, and reordering based on similarity matching; the re-reordered result is added to the new feature space, as shown in fig. 5, with the following steps:
step S502, inputting data;
step S504, extracting features;
step S506, learning training data by using the metrics to obtain W and M;
step S508, obtaining a training aggregation class center by using a K-means algorithm;
Step S510, calculating a distance based on a clustering center by using W and M;
in step S512, a reorder matching matrix is calculated.
Step S308, a co-ordinated matrix framework: obtaining a co-ordination matrix by using K-Means clustering, and using the co-ordination matrix in optimization of metric learning;
specifically, for example, reading in low-order and high-order characteristic results, and performing column-wise splicing (including reordering the results); obtaining M base clustering results by using K-Means, calculating to obtain a co-ordination matrix A, replacing the distance measurement of the K-Means algorithm with the Mahalanobis distance, obtaining M clustering results again, calculating a co-ordination matrix B, and further defining the Mahalanobis distance between d-dimensional samples x and y as follows:
further utilizing a measurement learning algorithm of the co-ordination matrix to calculate a distance matrix A and a threshold e, wherein the calculation mode can refer to the formula (7);
and a final clustering result is obtained by using an analytic hierarchy process, wherein a calculation flow is shown in fig. 6, and the specific steps are as follows:
step S602, constructing a judgment matrix;
step S604, calculating a single-layer weight subset;
step S606, checking the consistency of the single layer, executing step S608 when the single layer passes, and executing step S602 when the single layer does not pass;
Step S608, calculating a single-layer weight subset;
step S610, checking the consistency of the single layer, executing step S612 when passing, and executing step S602 when not passing;
step S612, obtaining the index weight.
Further calculate the original distance and mahalanobis distance of the sample, perform the average pooling operation, normalize the result, and look the normalized result approximately as a probability of belonging to the positive example, for example, as shown in fig. 7, assume first that the input dataset: x= { X, X2, … …, xn }, number of classes in the clustering result: k, maximum number of iterations: i, outputting a clustering result: pi is the following specific steps:
step S702, obtaining M base clustering results by using a k-means algorithm, and calculating to obtain a co-ordination matrix;
step S704, a distance matrix A and a threshold value c are obtained by using a measurement learning algorithm based on a co-ordination matrix;
step S706, setting the corresponding element of the S matrix as 1 and the corresponding element of the D matrix as 1 when the corresponding element is greater than or equal to c in the co-ordination matrix;
step S708, replacing the k-means algorithm distance measurement with the Markov distance to obtain M base clustering results again, and calculating to obtain a co-ordination matrix;
step S710, obtaining a distance matrix A and a threshold e by using a measurement learning algorithm based on a co-ordination matrix;
Step S712, setting the corresponding element of the S matrix as 1 and the corresponding element of the D matrix as 1 when the corresponding element is greater than or equal to c in the co-ordination matrix;
step S714, judging whether the S matrix and the D matrix are changed or not, if so, going to step S716, and if not, going to step S711;
step S716, judging whether the iteration number is greater than i, if yes, jumping to step S711, and if not, jumping to step S708;
step S711, obtaining a final clustering result pi by using a hierarchical clustering algorithm.
Alternatively, in the present embodiment, the account number learning information may refer to, but is not limited to, the current highest learning level of the target account number, such as below, in, above, etc.; the account number history information may be, but is not limited to, probability information of the history of the target account number, such as 30% of the probability that the target account number is not more than the family, 60% of the probability that the target account number is more than the family, and 10% of the probability that the target account number is more than the family.
The method is characterized in that the method for processing the initial sample features by using a metric learning algorithm expands and extends sample data with smaller information quantity, and improves the training quality of an information identification model on the basis of limited training resources, thereby improving the information identification accuracy.
Further by way of example, an alternative is shown in fig. 10, with the following specific steps:
step S1002, acquiring a seed account with Label information based on manual labeling and business logic;
step S1004, obtaining a basic image of a seed account, wherein the characteristics of the account in a plurality of time periods need to be calculated in consideration of the fact that the highest academic label of the account belongs to the long-term stability requirement of the account, and then performing characteristic compression by using sum pulling;
step S1006, calculating abnormal account type evaluation indexes, such as flow use condition of the account in the target product, time distribution generated by the flow, and the like;
step S1008, filtering the abnormal seed account based on the distribution abnormality theorem, such as using the "Laida criterion" to perform abnormal value judgment standard;
step S1010, judging whether the magnitude of the seed account meets the standard or not, if the minimum magnitude of the positive and negative samples is 10 ten thousand, executing step S1002 if yes, and executing step S1012 if no;
step S1012, account number features such as basic portrait features, business vertical type features and the like are constructed;
step S1014, constructing an aggregation feature by combining the time dimension;
step S1016, adopting a normalized numerical mode and a discretized non-numerical mode to perform characteristic processing;
Step S1011, inputting Category characteristics into a DNN model, and training an Embedding characteristic;
step S1020, combining the processed features and storing the combined features in an HDFS offline;
step S1022, solidifying feature processing logic, timing offline automatic calculation, and pushing offline calculation results to an online storage engine;
step S1024, reading in low-order and high-order characteristic results, and splicing according to columns;
step S1026, dividing the features into a training set probFea and a test set galFea;
step 1028, training the training set probFea by using a metric learning algorithm, and calculating a mapping matrix W and a kernel matrix M;
step S1030, calculating the original distance dist from each sample in the test set to the test set by using the mapping matrix W and the kernel matrix M;
s1032, clustering based on a K-Means algorithm to obtain K clustering centers;
step S1034, calculating the distance from each sample to the clustering center, and reordering based on similarity matching;
step S1036, adding the reordered result to the new feature space;
step S1038, reading in low-order and high-order characteristic results, and performing column-wise splicing (including reordering results);
s1040, M base clustering results are obtained by using K-Means, and a co-ordination matrix A is obtained through calculation;
Step S1042, replacing the K-Means algorithm distance measurement with the Mahalanobis distance, obtaining M clustering results again, and calculating a co-ordination matrix B;
step S1044, calculating a distance matrix A and a threshold e by using a measurement learning algorithm of the co-ordination matrix;
step S1046, obtaining a final clustering result by using an analytic hierarchy process;
step S1048, calculating the original distance and the Mahalanobis distance of the sample, carrying out an average pooling operation and normalizing the result, and regarding the normalized result as the probability of belonging to the positive example.
According to the embodiment provided by the application, account behavior information corresponding to each of a plurality of sample accounts is obtained; acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of a plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one; processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features; inputting a plurality of target sample characteristics into an initial academic recognition model for training to obtain a trained academic recognition model; based on the trained academic recognition model, the account academic information related to the account to be recognized is recognized, and the sample data with smaller information quantity is expanded and prolonged by utilizing a mode of processing the initial sample characteristics by a metric learning algorithm, so that the technical aim of improving the training quality of the information recognition model on the basis of limited training resources is achieved, and the technical effect of improving the information recognition accuracy is achieved.
As an alternative, the processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features includes:
s1, dividing a plurality of initial sample characteristics into a training set and a testing set;
s2, training initial sample features in a training set by using a metric learning algorithm to obtain a mapping matrix W and a nuclear matrix M;
s3, calculating the original distance of each initial sample feature in the training set based on the mapping matrix W and the nuclear matrix M;
s4, clustering initial sample features in a training set by using an original distance to obtain K clustering centers, wherein K is a natural number;
s5, calculating a first distance from each initial sample feature in the training set to the clustering center, and sorting the initial sample features in the feature similarity training set based on the first distance to obtain a first sorting result;
s6, taking the initial sample characteristics in the training set as target sample characteristics, and adding the target sample characteristics into a target characteristic space according to a first sequencing result.
It should be noted that, dividing the plurality of initial sample features into a training set and a testing set; training the initial sample characteristics in the training set by using a metric learning algorithm to obtain a mapping matrix W and a kernel matrix M; calculating the original distance of each initial sample feature in the training set based on the mapping matrix W and the kernel matrix M; clustering initial sample features in a training set by using an original distance to obtain K clustering centers, wherein K is a natural number; calculating a first distance from each initial sample feature in the training set to the clustering center, and sorting the initial sample features in the training set based on feature similarity corresponding to the first distance to obtain a first sorting result; and taking the initial sample characteristics in the training set as target sample characteristics, and adding the target sample characteristics into a target characteristic space according to a first sequencing result.
Further by way of example, optionally, low-order and high-order feature results are read in, for example, and spliced in columns; dividing the characteristics into a training set probFea and a test set galFea; further training the training set probFea by using a metric learning algorithm to calculate a mapping matrix W and a kernel matrix M, wherein the calculation mode can refer to the above formula (1), formula (2) and formula (3):
further, by using the mapping matrix W and the kernel matrix M, a distance dist from each sample in the test set to the test set is calculated, wherein the calculation method may refer to the above formula (4), formula (5) and formula (6):
furthermore, as shown in fig. 4, clustering is performed based on a K-Means algorithm to obtain K clustering centers, which specifically includes the following steps:
step S402, gathering samples to be clustered into 3 types;
step S404, selecting 3 center points;
step S406, finding the nearest center point for each sample, and completing one-time clustering;
step S408, judging whether the clustering conditions of the sample points before and after the primary clustering are the same, if so, stopping (step S422), and if not, continuing the next step (step S420);
step S410, updating the center point according to the clustering result;
step S412, finding the center point closest to each sample to complete secondary clustering;
Step S414, determining whether the clustering conditions of the sample points before and after the secondary clustering are the same, if so, terminating (step S422), and if not, continuing the next step (step S420);
step S416, updating the center point according to the clustering result;
step S411, finding the nearest center point for each sample, and completing multiple clustering;
step S420, judging whether the clustering conditions of the sample points before and after the multiple clustering are the same, if so, stopping (step S422), and if not, continuing the next step (step S420);
step S422, assuming that the algorithm is terminated in the previous step (step S420), presenting the final clustering result;
further calculating the distance from each sample to the clustering center, and reordering based on similarity matching; the reordered result is then added to the new feature space, for example as shown in fig. 5, with the following steps:
step S502, inputting data;
step S504, extracting features;
step S506, learning training data by using the metrics to obtain W and M;
step S508, obtaining a training aggregation class center by using a K-means algorithm;
step S510, calculating a distance based on a clustering center by using W and M;
in step S512, a reorder matching matrix is calculated.
As an alternative, inputting a plurality of target sample features into an initial academic recognition model for training, to obtain a trained academic recognition model, including:
S1, inputting a plurality of target sample features into an initial academic recognition model for training until reaching a training convergence condition:
s2, acquiring a current academic recognition model, and inputting a target feature space into the current academic recognition model;
s3, clustering each target sample feature in the target feature space by adopting a first distance unit to obtain M first clustering results, wherein M is a natural number; clustering each target sample feature in the target feature space by adopting a second distance unit to obtain M second clustering results;
s4, calculating M first clustering results to obtain a first co-relation matrix; calculating M second aggregation results to obtain a second co-ordination matrix;
s5, obtaining a target clustering result of each target sample feature in the target feature space according to the first co-relation matrix and the second co-relation matrix, wherein the target clustering result is used for indicating the probability that the target sample feature belongs to the feature corresponding to the target academy;
and S6, determining the current academic recognition model as a trained academic recognition model under the condition that the target clustering result reaches the training convergence condition.
Alternatively, in the present embodiment, the first distance unit may be, but is not limited to, a distance unit that is an original distance, and the second distance unit may be, but is not limited to, a distance unit that is a mahalanobis distance;
further by way of example, optionally reading in low-order and high-order feature results, performing column-wise stitching (including reordering the results); and obtaining M base clustering results by using the K-Means, calculating to obtain a co-ordination matrix A, replacing the K-Means algorithm distance measurement with the Markov distance, obtaining M clustering results again, and calculating a co-ordination matrix B.
As an optional solution, according to the first co-relationship matrix and the second co-relationship matrix, obtaining a target clustering result of each target sample feature in the target feature space includes:
s1, judging whether elements in a first distance matrix corresponding to a first co-relation matrix need to be adjusted or not by using a first threshold corresponding to the first co-relation matrix; judging whether elements in a second distance matrix corresponding to the second co-ordination matrix need to be adjusted or not by using a second threshold value corresponding to the second co-ordination matrix;
s2, judging whether the elements in the second distance matrix need to be adjusted by utilizing the second threshold value when the elements in the first distance matrix and/or the elements in the second distance matrix need to be adjusted and the current iteration number is smaller than or equal to the target threshold value until the elements in the first distance matrix and/or the elements in the second distance matrix do not need to be adjusted or the current iteration number is larger than the target threshold value;
S3, under the condition that the elements in the first distance matrix and the elements in the second distance matrix do not need to be adjusted, a hierarchical clustering algorithm is utilized to obtain a target clustering result.
It should be noted that, using the first threshold value corresponding to the first co-relation matrix to determine whether the element in the first distance matrix corresponding to the first co-relation matrix needs to be adjusted; judging whether elements in a second distance matrix corresponding to the second co-ordination matrix need to be adjusted or not by using a second threshold value corresponding to the second co-ordination matrix; judging whether the elements in the second distance matrix need to be adjusted or not by utilizing the second threshold value under the condition that the elements in the first distance matrix and/or the elements in the second distance matrix need to be adjusted and the current iteration number is smaller than or equal to the target threshold value until the elements in the first distance matrix and/or the elements in the second distance matrix do not need to be adjusted or the current iteration number is larger than the target threshold value; and under the condition that the elements in the first distance matrix and the elements in the second distance matrix do not need to be adjusted, acquiring a target clustering result by using a hierarchical clustering algorithm.
Further by way of example, the original distance and mahalanobis distance of the sample are calculated, and the result is normalized by performing an averaging pooling operation, and the normalized result is approximately regarded as a probability of belonging to a positive example, for example, as shown in fig. 7, and first assume that the input data set: x= { X, X2, … …, xn }, number of classes in the clustering result: k, maximum number of iterations: i, outputting a clustering result: pi is the following specific steps:
Step S702, obtaining M base clustering results by using a k-means algorithm, and calculating to obtain a co-ordination matrix;
step S704, a distance matrix A and a threshold value c are obtained by using a measurement learning algorithm based on a co-ordination matrix;
step S706, setting the corresponding element of the S matrix as 1 and the corresponding element of the D matrix as 1 when the corresponding element is greater than or equal to c in the co-ordination matrix;
step S708, replacing the k-means algorithm distance measurement with the Markov distance to obtain M base clustering results again, and calculating to obtain a co-ordination matrix;
step S710, obtaining a distance matrix A and a threshold e by using a measurement learning algorithm based on a co-ordination matrix;
step S712, setting the corresponding element of the S matrix as 1 and the corresponding element of the D matrix as 1 when the corresponding element is greater than or equal to c in the co-ordination matrix;
step S714, judging whether the S matrix and the D matrix are changed or not, if so, going to step S716, and if not, going to step S711;
step S716, judging whether the iteration number is greater than i, if yes, jumping to step S711, and if not, jumping to step S708;
step S711, obtaining a final clustering result pi by using a hierarchical clustering algorithm.
As an alternative, the method further comprises:
s1, acquiring account behavior information corresponding to each of a plurality of sample accounts;
s2, acquiring a plurality of first sample characteristics based on account behavior information corresponding to each of a plurality of first sample accounts, wherein each first sample characteristic in the plurality of first sample characteristics corresponds to each sample account in the plurality of first sample accounts one by one;
s3, processing the plurality of first sample features by using a metric learning algorithm to obtain a plurality of second sample features;
s4, inputting a plurality of second sample features into an initial gender identification model for training to obtain a trained gender identification model;
s5, identifying the account gender information associated with the account to be identified based on the trained gender identification model.
Optionally, in this embodiment, the learning identification model has strong reusability: firstly, changing the account type of a positive sample, such as prediction of an account gender label, then accumulating corresponding log data by a server, and finally producing a result by using the same characteristic splicing, characteristic processing and model training methods, wherein the result is replaced by account calendar information.
According to the embodiment provided by the application, account behavior information corresponding to each of a plurality of sample accounts is obtained; acquiring a plurality of first sample characteristics based on account behavior information corresponding to each of a plurality of first sample accounts, wherein each first sample characteristic in the plurality of first sample characteristics corresponds to each sample account in the plurality of first sample accounts one by one; processing the first sample features by using a metric learning algorithm to obtain second sample features; inputting a plurality of second sample features into an initial gender identification model for training to obtain a trained gender identification model; the method and the device have the advantages that the account sex information related to the account to be identified is identified based on the trained sex identification model, so that the aim of identifying the identification information of the corresponding scene by adjusting samples input into other scenes is fulfilled, and the technical effect of improving the reusability of information identification is achieved.
It will be appreciated that in the specific embodiments of the present application, related data such as account information is referred to, and when the embodiments of the present application are applied to specific products or technologies, account permission or consent is required to be obtained, and the collection, use and processing of related data is required to comply with related laws and regulations and standards of related countries and regions.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
According to another aspect of the embodiments of the present application, there is also provided an information identifying apparatus for implementing the above information identifying method. As shown in fig. 11, the apparatus includes:
the first obtaining unit 1102 is configured to obtain account behavior information corresponding to a target account to be identified;
a construction unit 1104, configured to construct a basic portrait characteristic of the target account based on historical behavior information in the account behavior information, and construct a business vertical characteristic of the target account based on business behavior information in the account behavior information, where the historical behavior information is behavior information executed by the target account in a preset time range, and the business behavior information is behavior information executed by the target account on a target object in the preset range;
An aggregation unit 1106, configured to perform aggregation processing of at least two time dimensions on the basic portrait features and the service vertical features, so as to obtain at least two aggregated portrait features with different time dimensions;
a first input unit 1108, configured to input, when a target feature is acquired based on the aggregate portrait feature, the target feature into an academic recognition model, where the information recognition model is a neural network model that is trained by using a target sample feature and is used to recognize an account number academy, and the target sample feature is a sample feature obtained by processing an initial sample feature by using a metric learning algorithm;
the second obtaining unit 1110 is configured to obtain an output result of the learning identification model, where the output result includes account learning information associated with the target account.
Optionally, in this embodiment, the information identifying apparatus may be, but not limited to, applied to a service scenario of highest-learning identification of an account, where, if the highest-learning of the account needs to be identified for accurate product pushing, first, basic behavior information of the account that is not private is obtained, but because behavior features associated with the learning itself are complex, the basic behavior information of the non-privacy is not generally directly associated with behavior features associated with the learning itself, so if the basic behavior information of the non-privacy is directly adopted for subsequent learning identification, a problem that accuracy of learning identification is low may occur; in this embodiment, in order to improve accuracy of the learning identification, feature extraction and feature processing are performed on the non-private basic behavior information in a diversified manner, so as to obtain a relevant feature with a higher degree of association between behavior features associated with the learning itself, and then the relevant feature is used to perform subsequent learning identification, so as to improve accuracy of the learning identification.
Optionally, in this embodiment, the information identifying apparatus may be, but not limited to, applied to other service scenarios, for example, the information identifying apparatus is applied to a service scenario of account gender identification, firstly, basic behavior information of account non-privacy is obtained, then, feature extraction and feature processing with diversification are performed on the basic behavior information of the non-privacy, so as to obtain relevant features with higher association degree with the service scenario of the account gender identification, and then, subsequent gender identification is performed by using the relevant features (for example, an academic identification model is adjusted to be a gender identification model, and then the relevant features are input into the gender identification model), thereby improving accuracy of gender identification.
Optionally, in this embodiment, the account behavior information may be, but is not limited to, acquired behavior information of the account executed in a preset time range or a preset space range, including historical behavior information and service behavior information, where the historical behavior information is behavior information of the target account executed in the preset time range (such as purchasing behavior of the virtual prop by the account through the target account), the service behavior information is behavior information of the target account executed on the target object in the preset range (such as viewing behavior of the media resource by the account through the target account), and so on.
Optionally, in this embodiment, the constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information may be understood as, but not limited to, constructing the basic portrait characteristic of the target account based only on the historical behavior information in the account behavior information, and may also be understood as, but not limited to, constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information and other account information, where the other account information may include, but is not limited to, account basic information, account related information (such as related information of other accounts having a relationship with the target account), and so on;
similarly, in this embodiment, the construction of the service vertical feature of the target account based on the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on only the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on the other account information and the service behavior information in the account behavior information may also be understood, but not limited to.
Optionally, in this embodiment, for aggregating features of different time spans, aggregation processing of at least two time dimensions is performed on the basic portrait feature and the service vertical feature, so as to obtain at least two aggregated portrait features of different time dimensions.
Optionally, in this embodiment, the information recognition model is a neural network model that is trained by using target sample features for recognizing account learning, where the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm, and the object of metric learning is usually the distance of a sample feature vector, and the purpose of metric learning may be, but is not limited to, training and learning, to reduce or limit the distance between similar samples, and at the same time to increase the distance between samples of different types;
optionally, in this embodiment, the generating the seed account portrait feature includes: account basic attributes (such as gender), etc.; filtering the abnormal account based on the portrait, for example: filtering account numbers with the WeChat using time longer than 24 hours, and the like; then, based on a clustering measurement learning frame, obtaining reordered features of the samples in a measurement space so as to improve the information expression capability of the features; and then using a measurement learning frame method after the co-matrix optimization, combining an analytic hierarchy process, carrying out pooling weighting on values of the samples in different distance measurement dimensions, and finally fitting out the probability that the samples belong to the positive example.
Alternatively, in the present embodiment, the account number learning information may refer to, but is not limited to, the current highest learning level of the target account number, such as below, in, above, etc.; the account number history information may be, but is not limited to, probability information of the history of the target account number, such as 30% of the probability that the target account number is not more than the family, 60% of the probability that the target account number is more than the family, and 10% of the probability that the target account number is more than the family.
It should be noted that, the account behavior information is specifically divided into two types, namely, comprehensive but not enough-perpendicularity historical behavior information, and incomplete but higher-perpendicularity business behavior information, and features of the two are aggregated by utilizing a time dimension, so that the directional feature belonging to a specific business scene of the academic recognition is constructed, the blank of reference information in the academic recognition scene is made up, the initial sample feature is processed by utilizing a measurement learning algorithm, and the directional feature is processed by utilizing an information recognition model trained by the completed sample feature, so that the information recognition accuracy is improved.
Specific embodiments may refer to the examples shown in the above information identifying apparatus, and in this example, details are not described herein.
According to the method and the device, account behavior information corresponding to the target account to be identified is obtained; constructing basic portrait features of the target account based on historical behavior information in the account behavior information and constructing business vertical features of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range; performing aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain at least two aggregation portrait features with different time dimensions; under the condition that the target features are acquired based on the polymer portrait features, the target features are input into an academic recognition model, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing account academic, and the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm; the method comprises the steps of obtaining an output result of an academic recognition model, wherein the output result comprises account academic information related to a target account, dividing the account behavior information into two types specifically, namely comprehensive but not enough-vertical historical behavior information, and incomplete but not high-vertical business behavior information, and aggregating the characteristics of the two types by utilizing a time dimension, constructing a pointing characteristic belonging to a specific business scene of academic recognition in the mode, making up the blank of reference information in the academic recognition scene, further processing initial sample characteristics by utilizing a measurement learning algorithm, and processing the pointing characteristic by utilizing an information recognition model trained by the completed sample characteristics, thereby realizing the technical effect of improving the information recognition accuracy.
As an alternative, the aggregation unit 1106 includes:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring basic portrait characteristics in a first time period and business vertical characteristics in the first time period, the preset time range comprises the first time period, the basic portrait characteristics in the first time period are behavior information of a target account number executed in the first time period, and the business behavior information is behavior information of the target account number executed on a target object in the preset range in the first time period; the first aggregation module is used for conducting aggregation processing on the basic portrait features in the first time period and the business vertical features in the first time period to obtain aggregated portrait features in the first time dimension, wherein the aggregated portrait features in at least two different time dimensions comprise the aggregated portrait features in the first time dimension;
the second acquisition module is used for acquiring basic portrait characteristics in a second time period and business vertical characteristics in the second time period, wherein the preset time range comprises the second time period, the basic portrait characteristics in the second time period are behavior information of the target account number executed in the second time period, and the business behavior information is behavior information of the target account number executed on the target object in the preset range in the second time period; and the second aggregation module is used for aggregating the basic portrait features in the second time period and the business vertical features in the second time period to obtain the aggregate portrait features in the second time dimension, wherein the aggregate portrait features in at least two different time dimensions comprise the aggregate portrait features in the second time dimension.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an alternative, the first aggregation module includes:
the aggregation sub-module is used for carrying out aggregation processing on the basic portrait features in the first time period of the data and the business vertical features in the first time period through an aggregation function, wherein the aggregation mode corresponding to the aggregation function comprises at least one of the following modes: summing, median, standard deviation.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an alternative, the apparatus comprises at least one of:
the first processing unit is used for carrying out normalized numerical type feature processing on the aggregate image features before inputting the target features into the academic recognition model;
and the second processing unit is used for performing discretized non-numerical feature processing on the aggregate image features before inputting the target features into the academic recognition model.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an alternative, the second processing unit includes at least one of:
The first processing module is used for carrying out feature digitizing processing on the features belonging to the classification value in the aggregate image features;
the second processing module is used for replacing the category of the classification feature in the aggregate portrait feature with the occurrence frequency of the classification feature;
the third processing module is used for converting the high-dimensional sparse classification variable in the aggregate portrait characteristic into a continuous variable with low-dimensional dense;
the fourth processing module is used for selecting an average value of the feature values of all the continuous features in the aggregate image features for the missing values of the continuous features in the aggregate image features and filling the missing values of the continuous features; or selecting the median value of the feature values of all the continuous features in the aggregate image features, and filling the missing values of the continuous features;
a fifth processing module, configured to select, for missing values of discrete features in the aggregate portrait feature, feature values of all the discrete features that occur most frequently in the aggregate portrait feature, and fill missing values of continuous features;
and the sixth processing module is used for summarizing a plurality of values under the variable belonging to the same category in the aggregate portrait characteristic into the same information.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an alternative, the third processing module includes:
and the processing sub-module is used for carrying out characteristic embedding on the classification variable used for representing the account behavior track of the target account based on the DNN model to obtain the continuous variable used for representing the behavior mode of the account behavior track.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an alternative, the construction unit 1104 includes:
the third acquisition module is used for acquiring the media information under the specific type associated with the account number academic information;
the construction module is used for acquiring the behavior information of the target account, which is executed on the media information under the specific type, from the account behavior information, and constructing the business vertical feature based on the behavior information of the target account, which is executed on the media information under the specific type.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
According to another aspect of the embodiments of the present application, there is also provided an information identifying apparatus for implementing the above information identifying method. As shown in fig. 12, the apparatus includes:
a third obtaining unit 1202, configured to obtain account behavior information corresponding to each of the plurality of sample accounts;
A fourth obtaining unit 1204, configured to obtain a plurality of initial sample features based on account behavior information corresponding to each of the plurality of sample accounts, where each of the plurality of initial sample features corresponds to each of the plurality of sample accounts one-to-one;
a third processing unit 1206, configured to process the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features;
a second input unit 1208, configured to input a plurality of target sample features into an initial learning identification model for training, so as to obtain a trained learning identification model;
the first identifying unit 1210 is configured to identify account learning information associated with an account to be identified based on the trained learning identification model.
Optionally, in this embodiment, the above information identification method may be, but not limited to, applied to a service scenario of highest academic recognition of an account, where, if the highest academic of the account needs to be identified for accurate product pushing, first, basic behavior information of the account that is not private is obtained, but because behavior features associated with the academic itself are complex, the basic behavior information that is not private is not usually directly associated with behavior features associated with the academic itself, so if the basic behavior information that is not private is directly adopted for subsequent academic recognition, a problem that accuracy of the academic recognition is lower may occur; in this embodiment, in order to improve accuracy of the learning identification, feature extraction and feature processing are performed on the non-private basic behavior information in a diversified manner, so as to obtain a relevant feature with a higher degree of association between behavior features associated with the learning itself, and then the relevant feature is used to perform subsequent learning identification, so as to improve accuracy of the learning identification.
Optionally, in this embodiment, the above information identification method may be, but not limited to, applied to other service scenarios, for example, the above information identification method is applied to a service scenario of account gender identification, firstly, basic behavior information of account non-privacy is obtained, then, feature extraction and feature processing with diversification are performed on the basic behavior information of the non-privacy, so as to obtain relevant features with higher association degree with the service scenario of the account gender identification, and then, subsequent gender identification is performed by using the relevant features (for example, an academic identification model is adjusted to be a gender identification model, and then the relevant features are input into the gender identification model), thereby improving accuracy of gender identification.
Optionally, in this embodiment, the account behavior information may be, but is not limited to, acquired behavior information of the account executed in a preset time range or a preset space range, including historical behavior information and service behavior information, where the historical behavior information is behavior information of the target account executed in the preset time range (such as purchasing behavior of the virtual prop by the account through the target account), the service behavior information is behavior information of the target account executed on the target object in the preset range (such as viewing behavior of the media resource by the account through the target account), and so on.
Optionally, in this embodiment, the constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information may be understood as, but not limited to, constructing the basic portrait characteristic of the target account based only on the historical behavior information in the account behavior information, and may also be understood as, but not limited to, constructing the basic portrait characteristic of the target account based on the historical behavior information in the account behavior information and other account information, where the other account information may include, but is not limited to, account basic information, account related information (such as related information of other accounts having a relationship with the target account), and so on;
similarly, in this embodiment, the construction of the service vertical feature of the target account based on the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on only the service behavior information in the account behavior information may be understood, but not limited to, that the construction of the service vertical feature of the target account based on the other account information and the service behavior information in the account behavior information may also be understood, but not limited to.
Optionally, in this embodiment, for aggregating features of different time spans, aggregation processing of at least two time dimensions is performed on the basic portrait feature and the service vertical feature, so as to obtain at least two aggregated portrait features of different time dimensions.
Optionally, in this embodiment, the information recognition model is a neural network model that is trained by using target sample features for recognizing account learning, where the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm, and the object of metric learning is usually the distance of a sample feature vector, and the purpose of metric learning may be, but is not limited to, training and learning, to reduce or limit the distance between similar samples, and at the same time to increase the distance between samples of different types;
optionally, in this embodiment, the generating the seed account portrait feature includes: account basic attributes (such as gender), etc.; filtering the abnormal account based on the portrait, for example: filtering account numbers with the WeChat using time longer than 24 hours, and the like; then, based on a clustering measurement learning frame, obtaining reordered features of the samples in a measurement space so as to improve the information expression capability of the features; and then using a measurement learning frame method after the co-matrix optimization, combining an analytic hierarchy process, carrying out pooling weighting on values of the samples in different distance measurement dimensions, and finally fitting out the probability that the samples belong to the positive example.
The method is characterized in that the method for processing the initial sample features by using a metric learning algorithm expands and extends sample data with smaller information quantity, and improves the training quality of an information identification model on the basis of limited training resources, thereby improving the information identification accuracy.
Specific embodiments may refer to the examples shown in the above information identifying apparatus, and in this example, details are not described herein.
According to the embodiment provided by the application, account behavior information corresponding to each of a plurality of sample accounts is obtained; acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of a plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one; processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features; inputting a plurality of target sample characteristics into an initial academic recognition model for training to obtain a trained academic recognition model; based on the trained academic recognition model, the account academic information related to the account to be recognized is recognized, and the sample data with smaller information quantity is expanded and prolonged by utilizing a mode of processing the initial sample characteristics by a metric learning algorithm, so that the technical aim of improving the training quality of the information recognition model on the basis of limited training resources is achieved, and the technical effect of improving the information recognition accuracy is achieved.
As an alternative, the third processing unit 1206 includes:
the dividing module is used for dividing the plurality of initial sample characteristics into a training set and a testing set;
the first training module is used for training the initial sample characteristics in the training set by using a metric learning algorithm to obtain a mapping matrix W and a nuclear matrix M;
the first calculation module is used for calculating the original distance of each initial sample feature in the training set based on the mapping matrix W and the nuclear matrix M;
the clustering module is used for clustering initial sample features in the training set by utilizing the original distance to obtain K clustering centers, wherein K is a natural number;
the second calculation module is used for calculating a first distance from each initial sample feature in the training set to the clustering center, and sorting the initial sample features in the feature similarity training set based on the first distance to obtain a first sorting result;
and the adding module is used for taking the initial sample characteristics in the training set as target sample characteristics and adding the target sample characteristics into a target characteristic space according to the first sequencing result.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an alternative, the second input unit 1208 includes:
the second training module is used for inputting a plurality of target sample characteristics into the initial academic recognition model for training until reaching the training convergence condition:
acquiring a current academic recognition model, and inputting a target feature space into the current academic recognition model;
clustering each target sample feature in a target feature space by adopting a first distance unit to obtain M first clustering results, wherein M is a natural number; clustering each target sample feature in the target feature space by adopting a second distance unit to obtain M second clustering results;
calculating M first clustering results to obtain a first co-relation matrix; calculating M second aggregation results to obtain a second co-ordination matrix;
obtaining a target clustering result of each target sample feature in the target feature space according to the first co-ordination matrix and the second co-ordination matrix, wherein the target clustering result is used for indicating the probability that the target sample feature belongs to the feature corresponding to the target academy;
and under the condition that the target clustering result reaches the training convergence condition, determining the current academic recognition model as a trained academic recognition model.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an optional solution, according to the first co-relationship matrix and the second co-relationship matrix, obtaining a target clustering result of each target sample feature in the target feature space includes:
judging whether elements in a first distance matrix corresponding to the first co-relation matrix need to be adjusted or not by using a first threshold corresponding to the first co-relation matrix; judging whether elements in a second distance matrix corresponding to the second co-ordination matrix need to be adjusted or not by using a second threshold value corresponding to the second co-ordination matrix;
judging whether the elements in the second distance matrix need to be adjusted or not by utilizing the second threshold value under the condition that the elements in the first distance matrix and/or the elements in the second distance matrix need to be adjusted and the current iteration number is smaller than or equal to the target threshold value until the elements in the first distance matrix and/or the elements in the second distance matrix do not need to be adjusted or the current iteration number is larger than the target threshold value;
and under the condition that the elements in the first distance matrix and the elements in the second distance matrix do not need to be adjusted, acquiring a target clustering result by using a hierarchical clustering algorithm.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
As an alternative, the apparatus further includes:
a fifth obtaining unit, configured to obtain account behavior information corresponding to each of the plurality of sample accounts;
a sixth obtaining unit, configured to obtain a plurality of first sample features based on account behavior information corresponding to each of the plurality of first sample accounts, where each of the plurality of first sample features corresponds to each of the plurality of first sample accounts one-to-one;
the fourth processing unit is used for processing the plurality of first sample features by using a metric learning algorithm to obtain a plurality of second sample features;
the third input unit is used for inputting a plurality of second sample characteristics into the initial gender identification model for training to obtain a trained gender identification model;
the second recognition unit is used for recognizing the account academic information related to the account to be recognized based on the trained gender recognition model.
Specific embodiments may refer to examples shown in the above information identification method, and this example is not described herein.
According to a further aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above-described information identification method, as shown in fig. 13, the electronic device comprising a memory 1302 and a processor 1304, the memory 1302 having stored therein a computer program, the processor 1304 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.
Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, acquiring account behavior information corresponding to a target account to be identified;
s2, constructing basic portrait features of the target account based on historical behavior information in account behavior information and constructing business vertical features of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range;
s3, carrying out aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain at least two aggregation portrait features with different time dimensions;
s4, under the condition that the target features are acquired based on the aggregate portrait features, inputting the target features into an academic recognition model, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing account academic, and the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm;
S5, obtaining an output result of the academic recognition model, wherein the output result comprises account academic information associated with the target account. Or alternatively, the first and second heat exchangers may be,
s1, acquiring account behavior information corresponding to each of a plurality of sample accounts;
s2, acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of a plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one;
s3, processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features;
s4, inputting a plurality of target sample features into an initial academic recognition model for training to obtain a trained academic recognition model;
s5, identifying the account number academic information related to the account number to be identified based on the trained academic identification model.
Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 13 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 13 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 13, or have a different configuration than shown in FIG. 13.
The memory 1302 may be used to store software programs and modules, such as program instructions/modules corresponding to the information identifying methods and apparatuses in the embodiments of the present application, and the processor 1304 executes the software programs and modules stored in the memory 1302, thereby performing various functional applications and data processing, that is, implementing the information identifying methods described above. Memory 1302 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1302 may further include memory located remotely from processor 1304, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1302 may be used for storing account behavior information, aggregate portrait features, and account history information, or information such as account behavior information, initial sample features, and account history information, but not limited thereto. As an example, as shown in fig. 13, the memory 1302 may include, but is not limited to, a first acquiring unit 1102, a constructing unit 1104, an aggregating unit 1106, a first input unit 1108, and a second acquiring unit 1110 (or a third acquiring unit 1202, a fourth acquiring unit 1204, a third processing unit 1206, a second input unit 1208, and a first identifying unit 1210) in the information identifying apparatus. In addition, other module units in the information identifying apparatus may be included, but are not limited to, and are not described in detail in this example.
Optionally, the transmission device 1306 is configured to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 1306 comprises a network adapter (Network Interface Controller, NIC) which can be connected to other network devices and routers via network lines so as to communicate with the internet or a local area network. In one example, the transmission device 1306 is a Radio Frequency (RF) module for communicating wirelessly with the internet.
In addition, the electronic device further includes: a display 1308 for displaying the account behavior information, the aggregate portrait feature, and the account history information, or the account behavior information, the initial sample feature, and the account history information; and a connection bus 1310 for connecting the respective module components in the above-described electronic device.
In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. Among them, the nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the Peer-To-Peer network.
According to one aspect of the present application, a computer program product is provided, comprising a computer program/instructions containing program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via a communication portion, and/or installed from a removable medium. When executed by a central processing unit, performs the various functions provided by the embodiments of the present application.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
It should be noted that the computer system of the electronic device is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
The computer system includes a central processing unit (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) or a program loaded from a storage section into a random access Memory (Random Access Memory, RAM). In the random access memory, various programs and data required for the system operation are also stored. The CPU, the ROM and the RAM are connected to each other by bus. An Input/Output interface (i.e., I/O interface) is also connected to the bus.
The following components are connected to the input/output interface: an input section including a keyboard, a mouse, etc.; an output section including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, and a speaker, and the like; a storage section including a hard disk or the like; and a communication section including a network interface card such as a local area network card, a modem, and the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the input/output interface as needed. Removable media such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, and the like are mounted on the drive as needed so that a computer program read therefrom is mounted into the storage section as needed.
In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via a communication portion, and/or installed from a removable medium. The computer program, when executed by a central processing unit, performs the various functions defined in the system of the present application.
According to one aspect of the present application, there is provided a computer-readable storage medium, from which a processor of a computer device reads the computer instructions, the processor executing the computer instructions, causing the computer device to perform the methods provided in the various alternative implementations described above.
Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring account behavior information corresponding to a target account to be identified;
s2, constructing basic portrait features of the target account based on historical behavior information in account behavior information and constructing business vertical features of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information of the target account executed in a preset time range, and the business behavior information is behavior information of the target account executed on a target object in the preset range;
s3, carrying out aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain at least two aggregation portrait features with different time dimensions;
S4, under the condition that the target features are acquired based on the aggregate portrait features, inputting the target features into an academic recognition model, wherein the information recognition model is a neural network model which is trained by using target sample features and is used for recognizing account academic, and the target sample features are sample features obtained by processing initial sample features by using a metric learning algorithm;
s5, obtaining an output result of the academic recognition model, wherein the output result comprises account academic information associated with the target account. Or alternatively, the first and second heat exchangers may be,
s1, acquiring account behavior information corresponding to each of a plurality of sample accounts;
s2, acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of a plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one;
s3, processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features;
s4, inputting a plurality of target sample features into an initial academic recognition model for training to obtain a trained academic recognition model;
s5, identifying the account number academic information related to the account number to be identified based on the trained academic identification model.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (15)

1. An information identification method, comprising:
acquiring account behavior information corresponding to a target account to be identified;
constructing a basic portrait characteristic of the target account based on historical behavior information in the account behavior information, and constructing a business vertical characteristic of the target account based on business behavior information in the account behavior information, wherein the historical behavior information is behavior information executed by the target account in a preset time range, and the business behavior information is behavior information executed by the target account on a target object in the preset range;
Performing aggregation processing of at least two time dimensions on the basic portrait features and the business vertical features to obtain at least two aggregation portrait features with different time dimensions;
under the condition that the target feature is acquired based on the aggregate portrait feature, inputting the target feature into an academic recognition model, wherein the information recognition model is a neural network model which is trained by using a target sample feature and is used for recognizing an account number academy, and the target sample feature is a sample feature obtained by processing an initial sample feature by using a measurement learning algorithm;
and obtaining an output result of the academic recognition model, wherein the output result comprises account number academic information associated with the target account number.
2. The method according to claim 1, wherein the aggregating the basic portrait features and the business vertical features in at least two time dimensions to obtain an aggregate portrait feature in at least two different time dimensions comprises:
acquiring basic portrait characteristics in a first time period and business vertical characteristics in the first time period, wherein the preset time range comprises the first time period, the basic portrait characteristics in the first time period are behavior information of the target account executed in the first time period, and the business behavior information is behavior information of the target account executed on a target object in the preset range in the first time period; the basic portrait features in the first time period and the business vertical features in the first time period are aggregated to obtain aggregated portrait features in a first time dimension, wherein the aggregated portrait features in at least two different time dimensions comprise the aggregated portrait features in the first time dimension;
Acquiring basic portrait characteristics in a second time period and business vertical characteristics in the second time period, wherein the preset time range comprises the second time period, the basic portrait characteristics in the second time period are behavior information of the target account number executed in the second time period, and the business behavior information is the behavior information of the target account number executed in the second time period on a target object in the preset range; and carrying out aggregation processing on the basic portrait features in the second time period and the business vertical features in the second time period to obtain aggregation portrait features in a second time dimension, wherein the aggregation portrait features in at least two different time dimensions comprise the aggregation portrait features in the second time dimension.
3. The method of claim 2, wherein the aggregating the basic portrait features in the first period and the business vertical features in the first period to obtain an aggregate portrait feature in a first time dimension includes:
and carrying out aggregation processing on the basic portrait features of the data in the first time period and the business vertical features in the first time period through an aggregation function, wherein a summarization mode corresponding to the aggregation function comprises at least one of the following steps: summing, median, standard deviation.
4. The method of claim 1, wherein prior to said inputting the target feature into an academic recognition model, the method comprises at least one of:
carrying out normalized numerical type feature processing on the aggregate image features;
and performing discretization non-numerical type feature processing on the polymerized image features.
5. The method of claim 4, wherein the discretizing the aggregate image features with non-numeric feature processing comprises at least one of:
performing feature digitization on the features belonging to the classification value in the aggregate image features;
replacing the category of the classification feature in the aggregate portrait feature with the frequency of occurrence of the classification feature;
converting the high-dimensional sparse classification variable in the polymer portrait features into a low-dimensional dense continuous variable;
selecting an average value of feature values of all continuous features in the polymer image features for missing values of the continuous features in the polymer image features, and filling the missing values of the continuous features; or selecting the median value of the feature values of all the continuous features in the polymer portrait features, and filling the missing values of the continuous features;
Selecting the most frequently occurring feature value of all the discrete features in the polymeric portrait features for the missing value of the discrete features in the polymeric portrait features, and filling the missing value of the continuous features;
and inducing a plurality of values under the variable belonging to the same category in the polymer portrait characteristic into the same information.
6. The method of claim 5, wherein said converting the classification variable of the high-dimensional sparsity in the aggregate portrait features into a continuous variable of low-dimensional dense comprises:
and based on a DNN model, performing feature embedding on the classification variable used for representing the account behavior track of the target account to obtain a continuous variable used for representing the behavior mode of the account behavior track.
7. The method according to any one of claims 1 to 6, wherein the constructing the business vertical feature of the target account based on the business behavior information in the account behavior information includes:
acquiring media information under a specific type associated with the account number learning information;
and acquiring the behavior information of the target account, which is executed on the media information under the specific type, from the account behavior information, and constructing the business vertical feature based on the behavior information of the target account, which is executed on the media information under the specific type.
8. An information identification method, comprising:
acquiring account behavior information corresponding to each of a plurality of sample accounts;
acquiring a plurality of initial sample characteristics based on account behavior information corresponding to each of the plurality of sample accounts, wherein each initial sample characteristic in the plurality of initial sample characteristics corresponds to each sample account in the plurality of sample accounts one by one;
processing the plurality of initial sample features by using a metric learning algorithm to obtain a plurality of target sample features;
inputting the characteristics of the plurality of target samples into an initial academic recognition model for training to obtain a trained academic recognition model;
and identifying the account number academic information associated with the account number to be identified based on the trained academic identification model.
9. The method of claim 8, wherein processing the plurality of initial sample features using a metric learning algorithm results in a plurality of target sample features, comprising:
dividing the plurality of initial sample features into a training set and a testing set;
training the initial sample characteristics in the training set by using the metric learning algorithm to obtain a mapping matrix W and a nuclear matrix M;
Calculating an original distance of each initial sample feature in the training set based on the mapping matrix W and the kernel matrix M;
clustering initial sample features in the training set by using the original distance to obtain K clustering centers, wherein K is a natural number;
calculating a first distance from each initial sample feature in the training set to the clustering center, and sorting the initial sample features in the training set based on feature similarity corresponding to the first distance to obtain a first sorting result;
and taking the initial sample characteristics in the training set as the target sample characteristics, and adding the initial sample characteristics into a target characteristic space according to the first sequencing result.
10. The method of claim 9, wherein inputting the plurality of target sample features into an initial academic recognition model for training to obtain a trained academic recognition model, comprises:
inputting the characteristics of the target samples into an initial academic recognition model for training until reaching a training convergence condition:
acquiring a current academic recognition model, and inputting the target feature space into the current academic recognition model;
Clustering each target sample feature in the target feature space by adopting a first distance unit to obtain M first clustering results, wherein M is a natural number; clustering each target sample feature in the target feature space by adopting a second distance unit to obtain M second clustering results;
calculating the M first clustering results to obtain a first co-relation matrix; calculating the M second polymerization results to obtain a second co-ordination matrix;
obtaining a target clustering result of each target sample feature in the target feature space according to the first co-relation matrix and the second co-relation matrix, wherein the target clustering result is used for indicating the probability that the target sample feature belongs to a feature corresponding to a target academy;
and under the condition that the target clustering result reaches the training convergence condition, determining the current academic recognition model as the trained academic recognition model.
11. The method of claim 10, wherein the obtaining the target clustering result of each target sample feature in the target feature space according to the first co-ordination matrix and the second co-ordination matrix comprises:
Judging whether elements in a first distance matrix corresponding to the first co-relation matrix need to be adjusted or not by using a first threshold value corresponding to the first co-relation matrix; judging whether elements in a second distance matrix corresponding to the second co-ordination matrix need to be adjusted or not by using a second threshold value corresponding to the second co-ordination matrix;
judging whether the elements in the second distance matrix need to be adjusted or not by utilizing the second threshold value under the condition that the elements in the first distance matrix and/or the elements in the second distance matrix need to be adjusted and the current iteration number is smaller than or equal to a target threshold value until the elements in the first distance matrix and/or the elements in the second distance matrix do not need to be adjusted or the current iteration number is larger than the target threshold value;
and under the condition that the elements in the first distance matrix and the elements in the second distance matrix do not need to be adjusted, acquiring the target clustering result by using a hierarchical clustering algorithm.
12. The method according to any one of claims 8 to 11, further comprising:
acquiring account behavior information corresponding to each of a plurality of sample accounts;
Acquiring a plurality of first sample characteristics based on account behavior information corresponding to each of the plurality of first sample accounts, wherein each of the plurality of first sample characteristics corresponds to each of the plurality of first sample accounts one to one;
processing the plurality of first sample features by using the metric learning algorithm to obtain a plurality of second sample features;
inputting the plurality of second sample features into an initial gender identification model for training to obtain a trained gender identification model;
and identifying the account gender information associated with the account to be identified based on the trained gender identification model.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program is executable by a terminal device or a computer to perform the method of any one of claims 1 to 7, or 8 to 12.
14. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7, or 8 to 12.
15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7, or 8 to 12 by means of the computer program.
CN202210770069.7A 2022-07-01 2022-07-01 Information identification method, storage medium and electronic device Pending CN117390533A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210770069.7A CN117390533A (en) 2022-07-01 2022-07-01 Information identification method, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210770069.7A CN117390533A (en) 2022-07-01 2022-07-01 Information identification method, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN117390533A true CN117390533A (en) 2024-01-12

Family

ID=89467149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210770069.7A Pending CN117390533A (en) 2022-07-01 2022-07-01 Information identification method, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN117390533A (en)

Similar Documents

Publication Publication Date Title
EP3985578A1 (en) Method and system for automatically training machine learning model
CN108280477B (en) Method and apparatus for clustering images
CN110636445B (en) WIFI-based indoor positioning method, device, equipment and medium
CN110991170B (en) Chinese disease name intelligent standardization method and system based on electronic medical record information
CN112528025A (en) Text clustering method, device and equipment based on density and storage medium
CN111898578B (en) Crowd density acquisition method and device and electronic equipment
CN109993627B (en) Recommendation method, recommendation model training device and storage medium
CN111698247A (en) Abnormal account detection method, device, equipment and storage medium
US20190073406A1 (en) Processing of computer log messages for visualization and retrieval
CN107545038B (en) Text classification method and equipment
CN111797320B (en) Data processing method, device, equipment and storage medium
CN113918753A (en) Image retrieval method based on artificial intelligence and related equipment
CN112800115B (en) Data processing method and data processing device
WO2023020214A1 (en) Retrieval model training method and apparatus, retrieval method and apparatus, device and medium
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
WO2023024408A1 (en) Method for determining feature vector of user, and related device and medium
CN114357184A (en) Item recommendation method and related device, electronic equipment and storage medium
CN113282433A (en) Cluster anomaly detection method and device and related equipment
WO2023051085A1 (en) Object recognition method and apparatus, device, storage medium and program product
CN116955788A (en) Method, device, equipment, storage medium and program product for processing content
CN116861226A (en) Data processing method and related device
CN117390533A (en) Information identification method, storage medium and electronic device
CN114330720A (en) Knowledge graph construction method and device for cloud computing and storage medium
CN114417982A (en) Model training method, terminal device and computer readable storage medium
Si [Retracted] Classification Method of Ideological and Political Resources of Broadcasting and Hosting Professional Courses Based on SOM Artificial Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination