CN115004191A

CN115004191A - Information identification method and device, storage medium and electronic equipment

Info

Publication number: CN115004191A
Application number: CN202080094848.9A
Authority: CN
Inventors: 李森林
Original assignee: Shenzhen Huantai Digital Technology Co ltd
Current assignee: Shenzhen Hefei Technology Co ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2022-09-02
Also published as: WO2021184363A1

Abstract

The embodiment of the application discloses an information identification method, an information identification device, a storage medium and electronic equipment, wherein a sample to be identified and behavior data corresponding to the sample to be identified are obtained, and attribute characteristics and behavior characteristics are extracted; constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample; and performing hierarchical clustering processing on the sample set according to the sample distance to determine a target sample and improve the information identification accuracy.

Description

Information identification method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to an information identification method, an information identification apparatus, a storage medium, and an electronic device.

Background

With the development of electronic devices, functions of the electronic devices are more diversified. This also can greatly increase the safety control degree of difficulty of electronic equipment when facilitating the use of user. For example, in the case of identifying and labeling a number of phones received by the electronic device, or identifying and labeling an application to be installed on the electronic device, etc., a large amount of data needs to be analyzed to identify a specific number or application, etc., for example, in the application scenario of identifying a specific phone number, the related art generally diffuses seed numbers by measuring statistics of different distributions through mutual information, but this approach does not take into account behavior characteristics of information to be identified, resulting in low identification accuracy.

Disclosure of Invention

The embodiment of the application provides an information identification method, an information identification device, a storage medium and electronic equipment, which can improve the information identification accuracy.

In a first aspect, an embodiment of the present application provides an information identification method, including:

acquiring a sample to be identified and behavior data corresponding to the sample to be identified, extracting attribute characteristics of the sample to be identified, and extracting the behavior characteristics corresponding to the sample to be identified according to the behavior data;

constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample;

performing hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a clustering tree, wherein the sample distance between the samples is calculated according to the attribute characteristics and the behavior characteristics of the samples;

determining a target sample category based on the clustering tree, wherein the target sample category is a sample category only containing the known positive sample and the sample to be identified;

and taking the sample to be identified in the target sample category as a target sample.

In a second aspect, an embodiment of the present application further provides an information identification apparatus, including:

the data acquisition unit is used for acquiring a sample to be identified and behavior data corresponding to the sample to be identified, extracting attribute characteristics of the sample to be identified and extracting the behavior characteristics corresponding to the sample to be identified according to the behavior data;

the sample construction unit is used for constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample;

the cluster analysis unit is used for carrying out hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a cluster tree, wherein the sample distance between the samples is calculated according to the attribute characteristics and the behavior characteristics of the samples;

the class dividing unit is used for determining a target sample class based on the clustering tree, wherein the target sample class is a sample class only containing the known positive sample and the sample to be identified;

and the sample identification unit is used for taking the sample to be identified in the target sample category as a target sample.

In a third aspect, embodiments of the present application further provide a storage medium having a computer program stored thereon, where the computer program is executed on a computer, so that the computer executes an information identification method as provided in any of the embodiments of the present application.

In a fourth aspect, an embodiment of the present application further provides an electronic device, including a processor and a memory, where the memory has a computer program, and the processor is configured to execute the information identification method provided in any embodiment of the present application by calling the computer program.

According to the technical scheme, the behavior data of the samples to be recognized are collected, the behavior characteristics are extracted from the behavior data, the known positive samples and the known negative samples are used as the seed samples, hierarchical clustering is carried out on the basis of the seed samples and the behavior characteristics of the samples to be recognized, the clustering tree is generated, the categories of the target samples are determined according to the clustering tree, and then the target samples are determined from the samples to be recognized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a first information identification method according to an embodiment of the present application.

Fig. 2 is a first schematic diagram of a cluster tree in an information identification method according to an embodiment of the present application.

Fig. 3 is a second schematic diagram of a cluster tree in an information identification method according to an embodiment of the present application.

Fig. 4 is a schematic flowchart of a second information identification method according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of an information identification device according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a first electronic device according to an embodiment of the present application.

Fig. 7 is a second structural schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are intended to be within the scope of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

The embodiment of the present application provides an information identification method, where an execution subject of the information identification method may be the information identification apparatus provided in the embodiment of the present application, or an electronic device integrated with the information identification apparatus, where the information identification apparatus may be implemented in a hardware or software manner. The electronic device may be a smart phone, a tablet computer, a palm computer, a notebook computer, or a desktop computer.

Referring to fig. 1, fig. 1 is a first flowchart illustrating an information identification method according to an embodiment of the present disclosure. The specific process of the information identification method provided by the embodiment of the application can be as follows:

101. the method comprises the steps of obtaining a sample to be identified and behavior data corresponding to the sample to be identified, extracting attribute features of the sample to be identified, and extracting the behavior features corresponding to the sample to be identified according to the behavior data.

102. And constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample.

The method of the embodiment of the application is applied to a scene for identifying target information with specific characteristics from a large amount of information, and the semi-supervised identification scheme is not limited by computing power and development period and can be applied even under the condition that a small number of positive samples are known. For example, for a credit default scenario, a collection-urging number needs to be mined from a large amount of end-to-end call records, and then a collection-urging number dialing object can serve as a strong-entry feature of a default model, wherein the collection-urging number refers to a telephone number used by some financial institutions or other institutions for money urging. For the application administration scenario, it may be identified whether the application is a particular type of application, such as a loan-type application, based on the usage for the application.

The electronic equipment acquires one or more samples to be identified which are required to be identified objects. Meanwhile, the electronic device needs to obtain at least one known positive sample and at least one known negative sample, and the sample set is formed by the sample to be identified, the known positive sample and the known negative sample together. For example, for a scenario in which an induced number is identified, if there are 500 telephone numbers to be identified, 20 telephone numbers known as the induced number may be set as known positive samples, while 20 telephone numbers known not to be the induced number may be set as known negative samples. A sample set is formed from these 540 samples.

It will be appreciated that providing a reasonable number of known positive and known negative examples can improve the accuracy of information identification. For example, when the known positive example and the known negative example are relatively large, the accuracy of information identification can be improved. Therefore, the user can reasonably set the number of the known positive samples and the known negative samples according to the total number of the samples to be identified and the requirement on the accuracy of the identification result.

103. And performing hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a clustering tree, wherein the sample distance between the samples is calculated according to the attribute characteristics and the behavior characteristics of the samples.

After the sample set is obtained, the semi-supervised algorithm, for example, the hierarchical clustering algorithm, is used to perform hierarchical clustering processing on the sample set through the seed samples (including known positive samples and known negative samples) in the sample set to obtain a clustering tree. Hierarchical clustering algorithms typically first calculate the distance between samples. The closest samples are merged into the same class each time. Then, the distance between the classes is calculated, and the classes with the closest distance are combined into a large class. And continuously merging until a class is synthesized.

In the present embodiment, the distance between samples is calculated based on the attribute features and behavior features of the samples. The attribute features can be extracted from the samples, and the behavior features are extracted from the behavior data corresponding to the samples to be recognized. The behavior data generally refers to relevant data generated by a user by using a sample to be identified, for example, if the sample to be identified is a telephone number, the behavior data may be a call record. For another example, if the sample to be identified is an application, the behavior data may be a usage time point, a usage duration, a usage frequency, and the like of the application.

In the embodiment, in the hierarchical clustering, in order to ensure the accuracy of sample identification, only the distance between the sample to be identified and the known positive sample and the known negative sample is calculated.

Referring to fig. 2, fig. 2 is a first schematic diagram of a cluster tree in an information identification method according to an embodiment of the present disclosure. In the clustering tree, from bottom to top, the points with the closest distance are merged into the same class, are connected together, gradually go up and are merged continuously until a class is synthesized, and the clustering tree is obtained.

104. And determining a target sample class based on the cluster tree, wherein the target sample class is a sample class only containing the known positive sample and the sample to be identified.

After the clustering tree is obtained, different categories can be obtained through different truncation heights. Referring to fig. 3, fig. 3 is a second schematic diagram of a cluster tree in an information identification method according to an embodiment of the present application. For example, two categories can be obtained by truncating the cluster tree according to the first truncation height. And truncating the clustering tree according to the second truncation height to obtain three categories. And truncating the clustering tree according to the third truncation height to obtain four categories. Thus, different truncation heights, the number of target samples that may be identified, and the accuracy are not the same.

In some embodiments, the number of categories obtained after truncation may be preset. And after the clustering tree is obtained, performing truncation processing on the clustering tree according to the preset category number to obtain a classification result.

And after the classification result is obtained, determining a sample class only containing the known positive sample and the sample to be identified from the obtained multiple classes as a target sample class.

Because the sample set contains the sample to be identified, the known positive sample and the known negative sample, no matter how many classes are obtained through hierarchical clustering, each sample class can be divided into one of the following three classes: category a: only contains at least one known positive sample, and greater than or equal to zero samples to be identified, and no known negative samples. Class b: only contains at least one known negative sample, and greater than or equal to zero samples to be identified, there being no known positive samples. And a class c: the method comprises the steps of obtaining at least one known positive sample, at least one known negative sample and more than or equal to zero samples to be identified.

Wherein, the category a is the category of the target sample required to be determined by the scheme. The target sample may be one or more in category according to the difference in truncation height.

Alternatively, in another embodiment, determining the target sample category based on the cluster tree may include: according to the top-down sequence, performing truncation processing on the clustering tree at a plurality of different positions to obtain a plurality of candidate classification results; determining a target classification result meeting a preset condition from a plurality of candidate classification results; and taking the sample class which only contains the known positive sample and the sample to be identified in the target classification result as a target sample class.

In this embodiment, the cluster tree may be truncated at a plurality of different positions to obtain a plurality of candidate classification results, and the plurality of candidate classification results are screened to determine an optimal classification result as a final classification result. For example, the preset condition may be: and taking the candidate classification result with the highest occupied target sample in all the to-be-identified samples as the final classification result. For another example, the preset condition may be: the candidate classification results which cannot well distinguish the positive samples from the negative samples are eliminated through a certain rule, and then the candidate classification result which can be identified and has the highest occupied target sample in all the samples to be identified is determined from the rest classification results and is used as the final classification result.

105. And taking the sample to be identified in the target sample category as a target sample.

After the target sample category is determined, the sample to be identified in the target sample category is determined as the target sample.

In particular implementation, the present application is not limited by the execution sequence of the described steps, and some steps may be performed in other sequences or simultaneously without conflict.

Therefore, according to the information identification method provided by the embodiment of the application, the behavior data of the samples to be identified are collected, the behavior characteristics are extracted from the behavior data, the known positive and negative samples are used as the seed samples, hierarchical clustering is performed on the basis of the seed samples and the behavior characteristics of the samples to be identified, the clustering tree is generated, the target sample category is determined according to the clustering tree, and then the target samples are determined from the samples to be identified.

The method according to the preceding embodiment is illustrated in further detail below by way of example.

Referring to fig. 4, fig. 4 is a second flow chart of the information identification method according to the embodiment of the invention. The method comprises the following steps:

201. and acquiring a sample to be identified and behavior data corresponding to the sample to be identified.

202. And extracting the attribute characteristics of the sample to be identified, and extracting the behavior characteristics corresponding to the sample to be identified according to the behavior data.

This embodiment will explain the method proposed in the present application in detail, taking the example of identifying a collection number from a large number of telephone numbers to be identified. The following sample to be identified is a number to be identified, the behavior data is call information, the first behavior feature is dialing times, the second behavior feature is refusing times, and the attribute feature is number contact ratio. The known positive samples are known as the known hasten numbers and the known negative samples are known as the known no-hasten numbers. The distance between the number pairs is calculated through the self-defined number behavior characteristics to depict the similarity between the numbers, and a supervision signal can be introduced into a data source.

The electronic equipment collects a plurality of numbers to be identified, acquires at least one known number to be collected and at least one known number to be collected, and forms a number set by the numbers to be identified, the known number to be collected and the known number to be collected. And acquiring the call record of each number as behavior data.

Wherein, the attribute feature is obtained according to the number to be identified, for example, for a collection number with 11 digits, the first 7 digits can be taken as the attribute feature. The behavior characteristics are obtained according to the call records, for example, for a certain number to be identified, a time window can be set, for example, one month, and the dialing times and the rejection times of the number are obtained as the behavior characteristics of the number according to the call records to be identified in one month. Because the behavior characteristics of the number to be accepted and the number not to be accepted are obviously different, hierarchical clustering is carried out according to the characteristics, and the accuracy of number identification can be ensured.

203. And constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample.

204. And calculating the sample distance between the sample to be identified and the known positive sample and the known negative sample according to the attribute characteristics and the behavior characteristics.

In this embodiment, a distance may be calculated based on different features, and the three distances may be combined to obtain a final sample distance between samples.

For example, a first distance between samples is calculated according to a first behavior feature of the samples, a second distance between samples is calculated according to a second behavior feature of the samples, and a third distance between samples is calculated according to an attribute feature of the samples; the sample distance is calculated from the first distance, the second distance, and the third distance.

For the identification of the number to be received, a distance can be calculated respectively based on the number of dialing, the number of refusing to receive and the coincidence degree of the number, and then the distance of the sample can be calculated by integrating the three distances.

For example, as an embodiment, the sample distance between the number to be identified and the known hasty code or the known non-hasty code may be calculated as follows.

E.g. s _i To be a number to be identified, s _j Known collection numbers. The two numbers s are calculated according to the following formula _i And s _j Sample distance d(s) therebetween _i ,s _j )：

Wherein d is _dial Is a first distance, d _reject Is a second distance, d _overlap Is the third distance:

wherein d is _i Representing s within a preset time window _i The number of dialing of r _i Representing s within a preset time window _i Number of rejections, r _j Representing s within a predetermined time window _j Number of rejections, k ^* Representing the first N-digit coincident number of the number in the preset time window, and k representing the total number digit.

The refusing times of the number refers to the refusing times when the number dials other telephone numbers. The value of N may be set as desired, for example, N is 7, the total number of digits k is 11, and if the number of digits in the first 7 digits of Si and Sj coincides with the number is 4, k is 4/11.

The above formula considers the contributions of the three features to be the same when calculating the sample distance, that is, their weights are all equal to 1/3. It is understood that in other embodiments, different weight values may be set for them as desired. Or, in other embodiments, more other behavior features may be extracted according to the call information, the two behavior features are merely examples, and the present disclosure is not limited thereto.

205. And performing hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a clustering tree.

And performing hierarchical clustering processing on the sample set based on the distance between the samples, and merging the samples with the shortest distance into the same class each time. Then, calculating the distance between the classes, combining the classes with the closest distance into a large class, combining continuously until a class is synthesized, and finally obtaining the clustering tree.

206. And according to the top-down sequence, performing truncation processing on the clustering tree at a plurality of different positions to obtain a plurality of candidate classification results.

After the clustering tree is obtained, different categories can be obtained through different truncation heights. Referring to fig. 3, fig. 3 is a second schematic diagram of a cluster tree in an information identification method according to an embodiment of the present application. Where H1 through Hn are the n samples in the sample set.

For example, two categories can be obtained by truncating the cluster tree according to the first truncation height. And truncating the clustering tree according to the second truncation height to obtain three categories. And truncating the clustering tree according to the third truncation height to obtain four categories. Thus, different truncation heights, the number of target samples that may be identified, and the accuracy are not the same.

207. And determining a target classification result meeting a preset condition from the candidate classification results.

In one embodiment, 207 may comprise:

for each candidate classification result, classifying a sample class only containing the known positive sample and the sample to be identified into a first sample class, and classifying a sample class only containing the known negative sample and the sample to be identified into a second sample class; when the ratio of the number of known positive samples in the first sample category to the number of all known positive samples is smaller than a first preset threshold, or the ratio of the number of known negative samples in the second sample category to the number of all known negative samples is smaller than a second preset threshold, removing the candidate classification result; and taking the candidate classification result with the largest ratio of the number of the samples to be recognized in the first sample category to the number of all the samples to be recognized in the rest candidate classification results as a target classification result.

The first sample class is the class a in the previous embodiment, and the second sample class is the class b in the previous embodiment. The ratio of the number of known positive samples to the number of all known positive samples in the first sample category is recorded as Lc, the ratio of the number of known negative samples to the number of all known negative samples in the second sample category is recorded as Lf, the first preset threshold is Qc, and the second preset threshold is Qf. The ratio of the number of samples to be identified in the first sample category to the number of all samples to be identified is recorded as Lu.

And excluding the clustering results of Lc < theta c and Lf < theta f, and removing any clustering results of the number which is not well distinguished from the number which is not induced to be accepted in the mode. And selecting a candidate classification result corresponding to the truncation height of the maximum Lu value from the rest clustering results as a target classification result. Namely, the clustering result is filtered by a screening method based on the business rule, and the expert experience is introduced into the model by the self-defined rule threshold value, so that the identification accuracy of the model is improved.

208. And taking the sample class which only comprises the known positive sample and the sample to be identified in the target classification result as a target sample class.

209. And judging the sample to be identified in the target sample category as a collection urging number.

And after obtaining a target classification result, determining a sample class only containing the known positive sample and the sample to be identified from the obtained multiple classes as a target sample class. After the target sample category is determined, determining the sample to be identified in the target sample category as the target sample. By adopting the semi-supervised clustering mode based on the rules and the seed numbers, the number to be induced to be received and the number not to be induced to be received are separated to the greatest extent, and the accuracy of the number to be induced to be received is improved.

As can be seen from the above, the information identification method provided in the embodiment of the present invention collects behavior data of a number to be identified, extracts behavior features from the behavior data, uses known collection-promoting numbers and non-collection-promoting numbers as seed samples, performs hierarchical clustering based on the seed samples and the behavior features of the number to be identified, generates a cluster tree, further determines a target sample category according to the cluster tree, and determines a target sample from the samples to be identified.

An information recognition apparatus is also provided in an embodiment. Referring to fig. 5, fig. 5 is a schematic structural diagram of an information identification device 300 according to an embodiment of the present disclosure. The information identification apparatus 300 is applied to an electronic device, and the information identification apparatus 300 includes a data acquisition unit 301, a sample construction unit 302, a cluster analysis unit 303, a category classification unit 304, and a sample identification unit 305, as follows:

the data acquisition unit 301 is configured to acquire a sample to be identified and behavior data corresponding to the sample to be identified, extract attribute features of the sample to be identified, and extract behavior features corresponding to the sample to be identified according to the behavior data;

a sample construction unit 302, configured to construct a sample set according to the to-be-identified sample, the known positive sample, and the known negative sample;

a cluster analysis unit 303, configured to perform hierarchical clustering on the sample set according to sample distances between the to-be-identified sample and the known positive sample and the known negative sample to obtain a cluster tree, where the sample distance between samples is calculated according to the attribute feature and the behavior feature of the sample;

a category classification unit 304, configured to determine a target sample category based on the cluster tree, where the target sample category is a sample category that only includes the known positive sample and the sample to be identified;

a sample identification unit 305, configured to use the sample to be identified in the target sample category as a target sample.

In some embodiments, the category classification unit 304 is further configured to: according to the top-down sequence, the cluster tree is cut off at a plurality of different positions to obtain a plurality of candidate classification results;

determining a target classification result meeting a preset condition from a plurality of candidate classification results;

and taking the sample class which only comprises the known positive sample and the sample to be identified in the target classification result as a target sample class.

In some embodiments, the category classification unit 304 is further configured to:

for each candidate classification result, classifying a sample class only containing the known positive sample and the sample to be identified into a first sample class, and classifying a sample class only containing the known negative sample and the sample to be identified into a second sample class;

when the ratio of the number of known positive samples in the first sample category to the number of all known positive samples is smaller than a first preset threshold, or the ratio of the number of known negative samples in the second sample category to the number of all known negative samples is smaller than a second preset threshold, removing the candidate classification result;

and taking the candidate classification result with the largest ratio of the number of the samples to be recognized in the first sample category to the number of all the samples to be recognized in the rest candidate classification results as a target classification result.

In some embodiments, the behavioral characteristic comprises a first behavioral characteristic and a second behavioral characteristic; the cluster analysis unit 303 is further configured to:

in some embodiments, a first distance between the samples is calculated from the first behavior features of the samples, a second distance between the samples is calculated from the second behavior features of the samples, and a third distance between the samples is calculated from the attribute features of the samples;

calculating the sample distance from the first distance, the second distance, and the third distance.

In some embodiments, the sample to be identified is a number to be identified, the behavior data is call information, the first behavior characteristic is a number of dialing times, the second behavior characteristic is a number of rejecting calls, and the attribute characteristic is a number contact ratio.

In some embodiments, the sample identification unit 305 determines the sample to be identified in the target sample category as the hasty code.

In specific implementation, the above modules may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and specific implementation of the above modules may refer to the foregoing method embodiments, which are not described herein again.

It should be noted that the information identification apparatus provided in the embodiment of the present application and the information identification method in the foregoing embodiment belong to the same concept, and any method provided in the embodiment of the information identification method may be run on the information identification apparatus, and the specific implementation process thereof is described in detail in the embodiment of the information identification method, and is not described again here.

As can be seen from the above, the information identification apparatus provided in the embodiment of the present application includes

The method comprises the steps of collecting behavior data of a sample to be identified, extracting behavior characteristics from the behavior data, taking known positive and negative samples as seed samples, carrying out hierarchical clustering based on the seed samples and the behavior characteristics of the sample to be identified to generate a clustering tree, further determining the category of the target sample according to the clustering tree, and then determining the target sample from the sample to be identified.

The embodiment of the application also provides the electronic equipment. The electronic device can be a smart phone, a tablet computer and the like. Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device 400 comprises a processor 401 and a memory 402. The processor 401 is electrically connected to the memory 402.

The processor 401 is a control center of the electronic device 400, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or calling a computer program stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device.

Memory 402 may be used to store computer programs and data. The memory 402 stores computer programs containing instructions executable in the processor. The computer program may constitute various functional modules. The processor 401 executes various functional applications and data processing by calling a computer program stored in the memory 402.

In this embodiment, the processor 401 in the electronic device 400 loads instructions corresponding to processes of one or more computer programs into the memory 402 according to the following steps, and the processor 401 runs the computer programs stored in the memory 402, so as to implement various functions:

In some embodiments, please refer to fig. 7, wherein fig. 7 is a second structural diagram of an electronic device according to an embodiment of the present application. The electronic device 400 further comprises: radio frequency circuit 403, display 404, control circuit 405, input unit 406, audio circuit 407, sensor 408, and power supply 409. The processor 401 is electrically connected to the radio frequency circuit 403, the display 404, the control circuit 405, the input unit 406, the audio circuit 407, the sensor 408, and the power source 409.

The rf circuit 403 is used for transceiving rf signals to communicate with network devices or other electronic devices through wireless communication.

The display screen 404 may be used to display information entered by or provided to the user as well as various graphical user interfaces of the electronic device, which may be comprised of images, text, icons, video, and any combination thereof.

The control circuit 405 is electrically connected to the display screen 404, and is configured to control the display screen 404 to display information.

The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control. The input unit 406 may include a fingerprint recognition module.

The audio circuit 407 may provide an audio interface between the user and the electronic device through a speaker, microphone. Wherein the audio circuit 407 comprises a microphone. The microphone is electrically connected to the processor 401. The microphone is used for receiving voice information input by a user.

The sensor 408 is used to collect external environmental information. The sensors 408 may include one or more of ambient light sensors, acceleration sensors, gyroscopes, etc.

The power supply 409 is used to power the various components of the electronic device 400. In some embodiments, power source 409 may be logically coupled to processor 401 via a power management system, such that functions of managing charging, discharging, and power consumption are performed via the power management system.

Although not shown in the drawings, the electronic device 400 may further include a camera, a bluetooth module, and the like, which are not described in detail herein.

In this embodiment, the processor 401 in the electronic device 400 loads instructions corresponding to processes of one or more computer programs into the memory 402 according to the following steps, and the processor 401 executes the computer programs stored in the memory 402, thereby implementing various functions:

In some embodiments, in determining the target sample class based on the cluster tree, processor 401 performs:

according to the top-down sequence, performing truncation processing on the clustering tree at a plurality of different positions to obtain a plurality of candidate classification results;

and taking the sample class which only contains the known positive sample and the sample to be identified in the target classification result as a target sample class.

In some embodiments, when a target classification result satisfying a preset condition is determined from the plurality of candidate classification results, the processor 401 performs:

when the ratio of the number of known positive samples in the first sample class to the number of all known positive samples is smaller than a first preset threshold, or the ratio of the number of known negative samples in the second sample class to the number of all known negative samples is smaller than a second preset threshold, removing the candidate classification result;

In some embodiments, in calculating the sample distance between samples from the attribute features and the behavior features of the samples, processor 401 performs:

calculating a first distance between the samples according to the first behavior features of the samples, calculating a second distance between the samples according to the second behavior features of the samples, and calculating a third distance between the samples according to the attribute features of the samples;

Therefore, the electronic equipment acquires the behavior data of the samples to be recognized, extracts the behavior features from the behavior data, uses known positive and negative samples as seed samples, performs hierarchical clustering based on the seed samples and the behavior features of the samples to be recognized to generate a clustering tree, further determines the category of the target samples according to the clustering tree, and then determines the target samples from the samples to be recognized.

An embodiment of the present application further provides a storage medium, where a computer program is stored, and when the computer program runs on a computer, the computer executes the information identification method described in any of the above embodiments.

It should be noted that, a person skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by the relevant hardware instructed by a computer program, and the computer program can be stored in a computer readable storage medium, which can include but is not limited to: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Furthermore, the terms "first", "second", and "third", etc. in this application are used to distinguish different objects, and are not used to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules listed, but rather, some embodiments may include other steps or modules not listed or inherent to such process, method, article, or apparatus.

The information identification method, the information identification device, the storage medium and the electronic device provided by the embodiments of the present application are described in detail above. The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

An information identification method, comprising:

acquiring a sample to be identified and behavior data corresponding to the sample to be identified, extracting attribute characteristics of the sample to be identified, and extracting the behavior characteristics corresponding to the sample to be identified according to the behavior data;

constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample;

performing hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a clustering tree, wherein the sample distance between the samples is calculated according to the attribute characteristics and the behavior characteristics of the samples;

determining a target sample category based on the clustering tree, wherein the target sample category is a sample category only containing the known positive sample and the sample to be identified;

and taking the sample to be identified in the target sample category as a target sample.
The information identification method of claim 1, wherein the determining a target sample class based on the cluster tree comprises:

according to the top-down sequence, performing truncation processing on the clustering tree at a plurality of different positions to obtain a plurality of candidate classification results;

determining a target classification result meeting a preset condition from a plurality of candidate classification results;

and taking the sample class which only contains the known positive sample and the sample to be identified in the target classification result as a target sample class.
The information identification method according to claim 2, wherein the determining a target classification result satisfying a preset condition from the plurality of candidate classification results comprises:

for each candidate classification result, classifying a sample class only containing the known positive sample and the sample to be identified into a first sample class, and classifying a sample class only containing the known negative sample and the sample to be identified into a second sample class;

when the ratio of the number of known positive samples in the first sample category to the number of all known positive samples is smaller than a first preset threshold, or the ratio of the number of known negative samples in the second sample category to the number of all known negative samples is smaller than a second preset threshold, removing the candidate classification result;

and taking the candidate classification result with the largest ratio of the number of the samples to be recognized in the first sample category to the number of all the samples to be recognized in the rest candidate classification results as a target classification result.
An information identification method as claimed in claim 1, wherein the behavior feature comprises a first behavior feature and a second behavior feature; the calculating the sample distance between samples according to the attribute features and the behavior features of the samples comprises:

calculating a first distance between the samples according to the first behavior features of the samples, calculating a second distance between the samples according to the second behavior features of the samples, and calculating a third distance between the samples according to the attribute features of the samples;

calculating the sample distance from the first distance, the second distance, and the third distance.
The information identification method according to claim 4, wherein the sample to be identified is a number to be identified, the behavior data is call information, the first behavior feature is a number of dialing times, the second behavior feature is a number of refusing to connect, and the attribute feature is a number coincidence degree.
The information identification method according to claim 5, wherein the taking the sample to be identified in the target sample category as the target sample comprises:

and judging the sample to be identified in the target sample category as a collection urging number.
An information recognition apparatus, comprising:

the data acquisition unit is used for acquiring a sample to be identified and behavior data corresponding to the sample to be identified, extracting attribute characteristics of the sample to be identified and extracting the behavior characteristics corresponding to the sample to be identified according to the behavior data;

the sample construction unit is used for constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample;

the cluster analysis unit is used for carrying out hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a cluster tree, wherein the sample distance between the samples is calculated according to the attribute characteristics and the behavior characteristics of the samples;

the class dividing unit is used for determining a target sample class based on the clustering tree, wherein the target sample class is a sample class only containing the known positive sample and the sample to be identified;

and the sample identification unit is used for taking the sample to be identified in the target sample category as a target sample.
The information identifying apparatus as claimed in claim 7, wherein the category dividing unit is further configured to:

according to the top-down sequence, performing truncation processing on the clustering tree at a plurality of different positions to obtain a plurality of candidate classification results;

determining a target classification result meeting a preset condition from a plurality of candidate classification results;

and taking the sample class which only contains the known positive sample and the sample to be identified in the target classification result as a target sample class.
The information identifying apparatus as claimed in claim 8, wherein the category dividing unit is further configured to:

for each candidate classification result, classifying a sample class only containing the known positive sample and the sample to be identified into a first sample class, and classifying a sample class only containing the known negative sample and the sample to be identified into a second sample class;

when the ratio of the number of known positive samples in the first sample category to the number of all known positive samples is smaller than a first preset threshold, or the ratio of the number of known negative samples in the second sample category to the number of all known negative samples is smaller than a second preset threshold, removing the candidate classification result;

and taking the candidate classification result with the largest ratio of the number of the samples to be recognized in the first sample category to the number of all the samples to be recognized in the rest candidate classification results as a target classification result.
The information recognition apparatus according to claim 7, wherein the behavior feature includes a first behavior feature and a second behavior feature; the cluster analysis unit is further configured to:

calculating a first distance between the samples according to the first behavior features of the samples, calculating a second distance between the samples according to the second behavior features of the samples, and calculating a third distance between the samples according to the attribute features of the samples;

calculating the sample distance from the first distance, the second distance, and the third distance.
The information recognition apparatus according to claim 10, wherein the sample to be recognized is a number to be recognized, the behavior data is call information, the first behavior feature is a number of dialing times, the second behavior feature is a number of refusal, and the attribute feature is a number coincidence degree.
The information identifying apparatus as recited in claim 11, wherein the sample identifying unit is further configured to:

and judging the sample to be identified in the target sample category as a collection urging number.
A storage medium having a computer program stored thereon, wherein the computer program, when run on a computer, causes the computer to perform:

acquiring a sample to be identified and behavior data corresponding to the sample to be identified, extracting attribute characteristics of the sample to be identified, and extracting the behavior characteristics corresponding to the sample to be identified according to the behavior data;

constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample;

performing hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a clustering tree, wherein the sample distance between the samples is calculated according to the attribute characteristics and the behavior characteristics of the samples;

determining a target sample category based on the clustering tree, wherein the target sample category is a sample category only containing the known positive sample and the sample to be identified;

and taking the sample to be identified in the target sample category as a target sample.
The storage medium of claim 13, wherein the computer program, when executed on a computer, causes the computer to further perform:

according to the top-down sequence, the cluster tree is cut off at a plurality of different positions to obtain a plurality of candidate classification results;

determining a target classification result meeting a preset condition from a plurality of candidate classification results;

and taking the sample class which only contains the known positive sample and the sample to be identified in the target classification result as a target sample class.
The storage medium of claim 14, wherein the computer program, when executed on a computer, causes the computer to further perform:

for each candidate classification result, classifying a sample class only containing the known positive sample and the sample to be identified into a first sample class, and classifying a sample class only containing the known negative sample and the sample to be identified into a second sample class;

when the ratio of the number of known positive samples in the first sample category to the number of all known positive samples is smaller than a first preset threshold, or the ratio of the number of known negative samples in the second sample category to the number of all known negative samples is smaller than a second preset threshold, removing the candidate classification result;

and taking the candidate classification result with the largest ratio of the number of the samples to be recognized in the first sample category to the number of all the samples to be recognized in the rest candidate classification results as a target classification result.
The storage medium of claim 13, wherein the computer program, when executed on a computer, causes the computer to further perform:

calculating a first distance between the samples according to the first behavior features of the samples, calculating a second distance between the samples according to the second behavior features of the samples, and calculating a third distance between the samples according to the attribute features of the samples;

calculating the sample distance from the first distance, the second distance, and the third distance.
An electronic device comprising a processor and a memory, the memory storing a computer program, wherein the processor, by invoking the computer program, is configured to perform:

acquiring a sample to be identified and behavior data corresponding to the sample to be identified, extracting attribute characteristics of the sample to be identified, and extracting the behavior characteristics corresponding to the sample to be identified according to the behavior data;

constructing a sample set according to the sample to be identified, the known positive sample and the known negative sample;

performing hierarchical clustering processing on the sample set according to the sample distance between the sample to be identified and the known positive sample and the known negative sample to obtain a clustering tree, wherein the sample distance between the samples is calculated according to the attribute characteristics and the behavior characteristics of the samples;

determining a target sample category based on the clustering tree, wherein the target sample category is a sample category only containing the known positive sample and the sample to be identified;

and taking the sample to be identified in the target sample category as a target sample.
The electronic device of claim 17, wherein the processor, by invoking the computer program, is further configured to perform:

according to the top-down sequence, performing truncation processing on the clustering tree at a plurality of different positions to obtain a plurality of candidate classification results;

determining a target classification result meeting a preset condition from a plurality of candidate classification results;

and taking the sample class which only contains the known positive sample and the sample to be identified in the target classification result as a target sample class.
The electronic device of claim 18, wherein the processor, by invoking the computer program, is further configured to perform:

for each candidate classification result, classifying a sample class only containing the known positive sample and the sample to be identified into a first sample class, and classifying a sample class only containing the known negative sample and the sample to be identified into a second sample class;

when the ratio of the number of known positive samples in the first sample category to the number of all known positive samples is smaller than a first preset threshold, or the ratio of the number of known negative samples in the second sample category to the number of all known negative samples is smaller than a second preset threshold, removing the candidate classification result;

and taking the candidate classification result with the largest ratio of the number of the samples to be recognized in the first sample category to the number of all the samples to be recognized in the rest candidate classification results as a target classification result.
The electronic device of claim 17, wherein the processor, by invoking the computer program, is further configured to perform:

calculating a first distance between the samples according to the first behavior features of the samples, calculating a second distance between the samples according to the second behavior features of the samples, and calculating a third distance between the samples according to the attribute features of the samples;

calculating the sample distance from the first distance, the second distance, and the third distance.