CN113360617A

CN113360617A - Abnormality recognition method, apparatus, device and storage medium

Info

Publication number: CN113360617A
Application number: CN202110633642.5A
Authority: CN
Inventors: 庞海龙; 岳江浩; 张玉东; 张文君; 张铮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-09-07
Anticipated expiration: 2041-06-07
Also published as: CN113360617B

Abstract

The disclosure provides an anomaly identification method, an anomaly identification device, anomaly identification equipment and a storage medium, which relate to the technical field of artificial intelligence, in particular to intelligent search, machine learning and deep learning technologies. The specific implementation scheme is as follows: extracting text characteristic data in target question-answer data of a target object; determining behavior characteristic data according to behavior state data when the target object generates answer data of the target question-answer data; determining a target score according to the text characteristic data and the behavior characteristic data; and performing abnormal recognition on the target object according to the target score. According to the technology disclosed by the invention, the accuracy of the target object abnormity identification result is improved.

Description

Abnormality recognition method, apparatus, device and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to intelligent search, machine learning, and deep learning techniques.

Background

The knowledge question-answering community is an interactive and open community which provides knowledge demand and knowledge supply for the public. Most of the community forms exist in a mode of question-answer among users so as to realize knowledge sharing.

However, due to the openness of the community, some users can promote and guide the community by means of community resources, and provide some cheating contents, which seriously affects the sustainable development of the community.

Disclosure of Invention

The disclosure provides an abnormality identification method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided an abnormality identification method including:

extracting text characteristic data in target question-answer data of a target object;

determining behavior characteristic data according to behavior state data when the target object generates answer data of the target question-answer data;

determining a target score according to the text characteristic data and the behavior characteristic data;

and performing abnormal recognition on the target object according to the target score.

According to another aspect of the present disclosure, there is also provided an abnormality recognition apparatus including:

the text characteristic data extraction module is used for extracting text characteristic data in target question answering data of the target object;

the behavior characteristic data determining module is used for determining behavior characteristic data according to behavior state data when the target object generates answer data of the target question-answer data;

the target score determining module is used for determining a target score according to the text characteristic data and the behavior characteristic data;

and the abnormality identification module is used for carrying out abnormality identification on the target object according to the target score.

According to another aspect of the present disclosure, there is also provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods of anomaly identification provided by embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform any one of the abnormality recognition methods provided by the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a computer program product including a computer program, which when executed by a processor implements any one of the anomaly identification methods provided by the embodiments of the present disclosure.

According to the technology disclosed by the invention, the accuracy of the target object abnormity identification result is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method for anomaly identification provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of another method for identifying anomalies provided by embodiments of the present disclosure;

FIG. 3 is a flow chart of another method for identifying anomalies provided by embodiments of the present disclosure;

FIG. 4 is a flow chart of another method for identifying anomalies provided by embodiments of the present disclosure;

FIG. 5 is a block diagram of an anomaly identification method provided by an embodiment of the present disclosure;

fig. 6 is a block diagram of an abnormality recognition apparatus according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of an electronic device for implementing an anomaly identification method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The various common knowledge identification methods and the abnormality recognition device provided by the disclosure are suitable for scenes of performing abnormality recognition on the question-answer condition of the target object in the knowledge question-answer community. The method for recognizing the abnormality in the present disclosure may be executed by an abnormality recognition apparatus, which may be implemented by software and/or hardware and is specifically configured in an electronic device. The electronic device may be a terminal device or a server.

For ease of understanding, the present disclosure first provides a detailed description of various common general knowledge methods.

Referring to fig. 1, an abnormality identification method includes:

s101, extracting text characteristic data in target question answering data of the target object.

The target object can be understood as a user account id uniquely characterizing the user. The question-answer data may include question data and/or answer data in a knowledge question-answer community. The target question-answer data includes answer data generated by the target object and/or question data corresponding to the answer data. It should be noted that the content in the target question and answer data may exist in the form of, but not limited to, pictures and texts.

The text characteristic data is useful information extracted based on image-text information of the target question-answering data and is used as a reference basis for carrying out abnormity identification on a target object when the target question-answering data is generated from content dimensions.

S102, determining behavior characteristic data according to behavior state data when the target object generates answer data of the target question-answer data.

The behavior state data is used for representing behavior attributes and state attributes of the target object when answer data of the target question answering data is generated. The behavior attribute corresponds to a user behavior when answer data of the target question-answer data is generated, and is used as a reference basis for performing anomaly identification on a target object when the target question-answer data is generated from a behavior dimension. The state attribute corresponds to a generation environment when answer data of the target question and answer data is generated, and is used as a reference basis for performing anomaly identification on a target object when the target question and answer data is generated from environment dimensions.

And S103, determining a target score according to the text characteristic data and the behavior characteristic data.

Optionally, the text characteristic data and the behavior characteristic data may be spliced and fused to obtain fused data; and determining a target score according to the fusion data. Or alternatively, the text characteristic data and the behavior characteristic data can be respectively used as parallel input to jointly determine the target score. Or optionally, respectively determining a text score according to the text characteristic data, and determining a behavior score according to the behavior characteristic data; and carrying out weighted average on the text score and the behavior score to obtain a target score. The weight in the weighted average can be set by a skilled person according to needs or empirical values, or determined through a large number of experiments.

In a specific implementation manner, the text characteristic data and the behavior characteristic data can be spliced and fused to obtain a fusion result; and inputting the fusion result into a first fusion evaluation model trained in advance, and outputting a target score. The first fusion evaluation model can be obtained by training in the following way: and taking a splicing result of sample text characteristic data and sample behavior characteristic data extracted from sample question and answer data of the sample object as a training sample, taking abnormal labeling data generated when the sample question and answer data is generated by the sample object as a label, and training a pre-constructed first machine learning model.

In another specific implementation manner, the text characteristic data and the behavior characteristic data may be input in parallel into a second fusion evaluation model with pre-trained values, and a target score may be output. The second fusion evaluation model can be obtained by training in the following way: and taking sample text characteristic data and sample behavior characteristic data extracted from the sample question-answer data of the sample object as a training sample pair, taking abnormal labeling data generated by the sample object when the sample question-answer data is generated as a label, and training a pre-constructed second machine learning model.

In another specific implementation manner, the text feature data may be input into a pre-trained text evaluation model to obtain a text score; inputting the behavior characteristic data into a pre-trained behavior evaluation model to obtain a behavior score; and carrying out weighted average on the text score and the behavior score to obtain a target score. The text evaluation model and the behavior evaluation model can be obtained by respectively and independently training: taking sample text characteristic data extracted from sample question and answer data of a sample object as a training sample, taking abnormal annotation data generated by the sample object when the sample question and answer data is generated as a label, and training a pre-constructed third machine learning model to obtain a text evaluation model; and taking sample behavior data of the sample object when the sample question and answer data are generated as training samples, taking abnormal labeling data of the sample object when the sample object generates the sample question and answer data as labels, and training a fourth machine learning model which is constructed in advance to obtain a behavior evaluation model. Of course, the text evaluation model and the behavior evaluation model can be trained simultaneously to obtain: respectively taking sample text characteristic data and sample behavior characteristic data extracted from sample question and answer data of a sample object as input data of a first sub-network and a second sub-network in a third machine learning model, taking abnormal marking data generated when the sample object generates the sample question and answer data as a label, and training a fusion training model; and taking the trained first sub-network as a text evaluation model, and taking the trained second sub-network as a behavior evaluation model.

It should be noted that, the present disclosure does not limit the specific structures of the first machine learning model, the second machine learning model, and the third machine learning model, and may be implemented based on at least one model in the prior art, and only needs to ensure that the trained models have corresponding functions.

And S104, carrying out abnormal recognition on the target object according to the target score.

Illustratively, determining whether the target score satisfies an exception condition; if yes, determining that the target object is abnormal when answer data of the target question-answer data are generated; otherwise, the target object is determined to be normal when answer data of the target question-answer data is generated.

In a specific implementation manner, if the target score is smaller than the abnormal score threshold, it is determined that the target object is abnormal when answer data of the target question-answer data is generated; and if the target score is not less than the abnormal score threshold value, determining that the target object is normal when answer data of the target question-answer data is generated. Wherein the anomaly score threshold can be set by a technician as desired or empirically, or determined iteratively through a number of experiments.

In the technical solution of the present disclosure, the acquisition, storage, and application of the target object, the target question-answer data, and the behavior state data all conform to the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

When the target object is subjected to abnormity detection, the text characteristic data and the behavior characteristic data of the target question-answering data are introduced to determine the target score, so that the information of different dimensions can be comprehensively considered in the target score determining process, the accuracy of the target score determining result is improved, and the accuracy of the abnormity identifying result is improved when the abnormity identification of the target object is carried out according to the target score.

On the basis of the above technical solutions, the present disclosure also provides another alternative embodiment. In the embodiment, the extraction operation of the text characteristic data is optimized and improved.

Referring to fig. 2, an abnormality identification method includes:

s201, performing abnormal data identification on the target question answering data under a preset dimension to obtain data abnormal probability.

The abnormal data may include, but is not limited to, prohibited text, junk text, political text, abusive text, advertising text, and the like. The forbidden graphics and texts can comprise graphics and texts carrying information related to yellow storm, the junk graphics and texts can comprise graphics and texts carrying information related to fraud, and the administrative graphics and texts can comprise graphics and texts carrying information related to administration or action.

The preset dimension may include at least one of contraband, garbage, political, abusive, advertising, and the like.

In an optional embodiment, the data anomaly recognition model may be trained in advance, and the target question and answer data is input into the trained data anomaly recognition model to obtain the data anomaly probability. The data anomaly identification model can be obtained by training a pre-constructed machine learning model by taking sample question-answer data of a sample object as a training sample and taking data anomaly marking data of the sample question-answer data under a preset dimension as a label value. The specific structure of the machine learning model is not limited, and only the abnormal data identification function can be realized.

It should be noted that, the number of the data anomaly identification models may be one, and after the target question and answer data is input to the data anomaly identification models, the data anomaly probabilities under each preset dimension are output.

However, the data anomaly probability under different preset dimensions is determined through one anomaly identification model, and the accuracy of the identification result is poor, so that the accuracy of the text characteristic data is influenced. In order to improve the accuracy of the text feature data and further lay a foundation for improving the accuracy of the anomaly identification, in an optional embodiment, the corresponding anomaly identification models can be trained respectively for each preset dimension, so that the target question and answer data is input into the anomaly identification models corresponding to the preset dimensions, and the data anomaly probability of the corresponding preset dimension is obtained.

S202, carrying out character statistics on the target question-answer data to obtain character statistical data.

The character statistical data is used for representing the content richness of the target question answering data. For example, the character statistics may include text statistics such as text length, number of words, and word ratio; the character statistics may also include punctuation statistics such as punctuation type, punctuation quantity, and punctuation ratio.

Optionally, the text may include question text; accordingly, the text statistical data may include at least one of a length of the question text, a number of the question words, and a length ratio or a number ratio of the question words in the answer words.

Optionally, the text may include answer text; accordingly, the text statistical data may include at least one of the length of the answer text, the number of answer words, and the length ratio or the number ratio of the question words in the answer words.

Optionally, if the text includes a question text and an answer text; accordingly, the text statistics may further include at least one of: the length of the question text in the answer text is in proportion to the number of the same characters of the question characters in the answer characters, the number of the same characters of the question characters in the answer characters is in proportion to the number of the question punctuations in the answer punctuations, and the like.

And S203, generating text characteristic data according to at least one of the data abnormal probability and the character statistical data.

And selecting at least one splicing fusion from the data abnormal probability and the character statistical data corresponding to each preset dimension to obtain text characteristic data.

It should be noted that the selection of the data anomaly probability and the character statistical data can be set by a skilled person according to needs or experience values, or determined repeatedly through a large number of experiments.

In order to avoid the difference between the character statistical data and the data abnormal probability caused by the dimension influence, the character statistical data can be normalized before the text feature data is generated. Correspondingly, text characteristic data is generated according to at least one of the abnormal probability of the data and the normalized character statistical data.

In order to improve the richness of the content carried by the text feature data and further improve the comprehensiveness of the text feature data, in an optional embodiment, the data anomaly probability and the character statistical data corresponding to each preset dimension may be spliced and fused according to a set sequence to obtain the text feature data. Wherein, the setting sequence can be uniformly set in advance by technicians.

And S204, determining behavior characteristic data according to the behavior state data when the target object generates answer data of the target question-answer data.

And S205, determining a target score according to the text characteristic data and the behavior characteristic data.

And S206, carrying out abnormal recognition on the target object according to the target score.

The method comprises the steps of detailing a text characteristic data generation process to carry out abnormal data identification on target question answering data under a preset dimensionality to obtain data abnormal probability; carrying out character statistics on the target question-answer data to obtain character statistical data; and generating text characteristic data according to at least one of the data abnormal probability and the character statistical data, thereby perfecting the generation mode of the text characteristic data. Meanwhile, the text characteristic data is generated through the data abnormal probability and the character statistical data of different preset dimensions, the diversity and the comprehensiveness of the text characteristic data are enriched, the improvement of the accuracy of the target score determination result is guaranteed, and a foundation is laid for the improvement of the accuracy of the abnormal recognition result of the target object.

On the basis of the above technical solutions, the present disclosure also provides an alternative embodiment. In the embodiment, the determination process of the behavior characteristic data is optimized and improved.

Referring to fig. 3, an abnormality recognition method includes:

s301, extracting text characteristic data in the target question answering data of the target object.

S302, determining abnormal behavior probability according to interactive behavior data when the target object generates answer data of the target question-answer data.

Wherein the interactive behavior data may include at least one type of data: whether answer data in the target question-answer data is answered at one time, whether a target question-answer data generation page enters from a set entry page, the input speed of the answer data of the target question-answer data and the like. And setting an entry page as a page where the aggregation entry of the target question and answer data generation page is located.

Exemplarily, probability values when various types of interaction behavior data correspond to different data values can be preset; and taking the statistical result of the probability values of various types of interactive behavior data as the abnormal behavior probability. The probability values corresponding to different data values may be determined by a skilled person according to needs or experience values, or determined repeatedly through a large number of experiments.

For example, if the target question-answer data is answered at one time, the probability of the corresponding abnormal range indicates that the answer data is high in possibility of being copied and pasted by the target object; if the target question-answer data is answered at least twice, the probability of the corresponding normal range is obtained, and the probability value is lower as the number of answering times is larger, so that the probability that the answer data is obtained by copying and pasting the target object is low. If the target question-answer data generation page enters from the set entry page, the probability of the corresponding normal range indicates that the target object normally enters the generation page with high possibility; if the target question-answer data generation page is not the set entry page, the probability of the corresponding abnormal range indicates that the target object is high in possibility of entering the generation page abnormally. If the input speed of the target question-answer data is greater than the set speed threshold, corresponding to the abnormal range probability, the probability that the answer data is obtained by copying and pasting the target object is high; if the input speed of the target question-answer data is not greater than the set speed threshold, the probability of the corresponding normal range indicates that the answer data is input normally by the target object with high possibility.

In a specific implementation manner, the corresponding probability may be set to 0.5 when the target question answering data is answered once, set to 0.2 when answering twice, and set to 0 when answering at least three times. If the target question answering data generation page enters from the set entry page, the corresponding probability is set to be 0.05, and if not, the corresponding probability is set to be 0.8. When the input speed of the target question answering data is more than 180 words/minute, the corresponding probability is set to be 0.8; when the word/min is more than 150 words/min and not more than 180 words/min, setting the corresponding probability to be 0.6; when the word/min is more than 120 and not more than 150, setting the corresponding probability to be 0.5; when the number of words/minute is not more than 120, the corresponding probability is set to 0.05.

In another optional embodiment, the pre-constructed behavior model may be trained by using the sample interaction behavior data of the sample object when the sample question-answer data is generated as a training sample, and using the abnormal labeling data of the sample object when the sample object generates the sample question-answer data as a label, so as to obtain a trained behavior model. The behavior model can be realized based on the existing machine learning model, and the model structure of the behavior model is not limited by the disclosure.

Correspondingly, interactive behavior data when the target object generates answer data of the target question-answer data are input into the trained behavior probability model, and the behavior abnormal probability is obtained.

S303, determining the abnormal probability of the environment according to the interactive environment information when the answer data of the target question answering data is generated.

The interactive environment is used to represent the surrounding environments with different dimensions when the target question and answer data is generated, and may include at least one of a device environment and a network environment, for example. The interactive environment information is used for representing the attribute value of the surrounding environment.

In an optional embodiment, if the interaction environment comprises a device environment, the environment anomaly probability comprises a device anomaly probability; accordingly, the device abnormality probability may be determined based on the input device information when the answer data of the target question-answer data is generated.

For example, the input device information may include at least one of usage system information when generating question and answer data, a device on/off state, whether it belongs to a simulator, a device usage posture, a device moving speed, and the like.

In a specific implementation manner, the device information of the sample object when generating the sample question-and-answer data may be used as training data in advance, and the abnormal labeling data of the device information may be used as a label to train a pre-constructed device safety discrimination model. Correspondingly, equipment information when the target object generates answer data of the target question-answer data is input into the trained equipment safety judgment model, and equipment abnormal probability is obtained. The equipment safety discrimination model can be realized based on the existing machine learning model, and the model structure of the equipment safety discrimination model is not limited by the disclosure.

In another optional embodiment, if the interactive environment comprises a network environment, the environment anomaly probability comprises a network anomaly probability; accordingly, the network anomaly probability can be determined according to the network environment information when the answer data of the target question-answer data is generated.

For example, the network information may include at least one of network type, network risk level, unsupervised cluster number of networks, frequency of occurrence of networks within a set time period, co-occurrence number of networks when question data is generated and networks when answer data is generated, and the like.

In a specific implementation manner, network information of a sample object when generating sample question-answer data may be used as training data in advance, and abnormal labeling data of the network information may be used as a label to train a pre-constructed network risk determination model. Correspondingly, the network information when the target object generates answer data of the target question-answer data is input into the trained network risk judgment model, and the network abnormal probability is obtained. The network risk judgment model can be realized based on the existing machine learning model, and the model structure of the network risk judgment model is not limited by the disclosure.

It can be understood that, in the technical scheme, the environment abnormal probability is refined to include the set abnormal probability and/or the network abnormal probability, so that the diversity of the environment abnormal probability is enriched, the determination mechanisms of different environment abnormal probabilities are perfected, a foundation is laid for the diversity and the comprehensiveness of behavior characteristic data, and further, data support is provided for the improvement of the accuracy of the target score determination result.

Of course, the embodiment of the present disclosure may also add interaction environment information of other dimensions according to actual requirements, and the above contents only exemplarily indicate the selectable dimensions of the interaction environment information, and should not be construed as a limitation on the interaction environment.

S304, determining the interaction activity according to historical interaction behavior data of the target question answering data associated historical time periods.

Here, the associated history period may be understood as a history period before the target question and answer data generation time, and may be, for example, an adjacent history period. The time length of the associated historical period is not limited, and can be set by a technician according to needs or empirical values, or determined repeatedly through a large number of tests.

The historical interaction behavior data is used for representing the corresponding activity condition of the target object in the associated historical period. The historical interactive behavior data can comprise at least one of access times, access duration, duration of accessing question data related to the target question-answering data, duration of accessing answer data corresponding to the relevant question data, accumulated question-asking times of the target object, dwell duration of an answer control and the like of the target object to a homepage of a knowledge question-answering community in a historical time period. The answer control may be understood as a control that needs to be triggered to enter an answer editing page or submit edited answer data when the answer data of the target question and answer data is generated by the target object.

In a specific implementation manner, historical behavior data of a history period associated with sample question and answer data generated by a sample object may be used as training data in advance, and a pre-constructed activity prediction model may be trained. Correspondingly, historical interactive behavior data of the target question answering data in the historical period are input into the trained activity degree prediction model, and interactive activity degree is obtained. The liveness prediction model can be realized based on the existing machine learning model, and the model structure of the liveness prediction model is not limited by the disclosure.

S305, generating behavior characteristic data according to at least one of the behavior abnormal probability, the environment abnormal probability, the interaction activity and the basic attribute data of the target object.

Wherein the base attribute data of the target object may include account attributes of the target object. The account attribute may include at least one of an import account, an active account, a protected account, a history blocked account, and the like.

In an alternative embodiment, the basic attribute data may be encoded and converted into data in the [0,1] interval.

Illustratively, at least one of the behavior abnormal probability, the environment abnormal probability, the interaction activity probability and the basic attribute data can be spliced and fused according to a set sequence to obtain the behavior characteristic data. Wherein, the setting sequence can be uniformly set in advance by technicians.

And S306, determining a target score according to the text characteristic data and the behavior characteristic data.

And S307, performing abnormity identification on the target object according to the target score.

It should be noted that, in the technical solution of the present disclosure, the related interactive behavior data, the interactive environment information, the historical interactive behavior data, and the basic attribute data of the target object are acquired, stored, and applied, all of which conform to the regulations of the relevant laws and regulations, and do not violate the good customs of the public order.

The method comprises the steps of refining a generation process of behavior characteristic data into interactive behavior data when answer data of target question-answer data are generated according to a target object, and determining behavior abnormal probability; determining the environmental anomaly probability according to the interactive environmental information when the answer data of the target question-answer data is generated; determining interaction activity according to historical interaction behavior data of a historical period of the target question-answering data association; and generating the behavior characteristic data according to at least one of the behavior abnormal probability, the environment abnormal probability, the interactive activity and the basic attribute data of the target object, thereby perfecting the generation mode of the behavior characteristic data. Meanwhile, the behavior characteristic data is generated through different types of data, the diversity and the comprehensiveness of the behavior characteristic data are enriched, the guarantee is provided for the improvement of the accuracy of the target score determination result, and a foundation is laid for the improvement of the accuracy of the abnormal recognition result of the target object.

On the basis of the above technical solutions, the present disclosure also provides an alternative embodiment. In the embodiment, the abnormal identification process of the target object is subjected to optimization improvement.

Referring to fig. 4, an abnormality identification method includes:

s401, extracting text characteristic data in the target question answering data of the target object.

S402, determining behavior characteristic data according to behavior state data when the target object generates answer data of the target question-answer data.

And S403, determining a target score according to the text characteristic data and the behavior characteristic data.

And S404, updating the target score according to the historical target score of the historical question-answer data of the target object.

And S405, carrying out abnormal recognition on the target object according to the updated target score.

It should be noted that, whether the target object is abnormal when generating the answer data of the target question-answer data is related to the abnormal situation of the target object generating the historical question-answer data, that is, if the target object is abnormal when generating the historical question-answer data, the possibility of abnormality when generating the answer data of the target question-answer data is higher. In order to improve the accuracy of the anomaly detection result when the target object generates the answer data of the target question-answer data, in this embodiment, a historical target score of the historical question-answer data of the target object may be introduced, and the target score when the answer data of the target question-answer data is generated is optimally updated.

Wherein the historical question-answer data may be at least one question-answer data generated before answer data of the target question-answer data is generated. The number and specific generation time of the historical question and answer data are not limited in any way by the present disclosure.

In an alternative embodiment, the target score is updated according to the historical target score of the historical question-answer data of the target object, and may be: determining a historical target score of the historical question and answer data of the target object and a weighted average of the target scores of the target question and answer data; and taking the weighted average result as the updated target score. Wherein, the weight value of each historical goal score and goal score can be determined by a skilled person according to the need or experience value, or repeatedly determined by a plurality of experiments.

In a specific implementation manner, historical target scores of a set number of historical question-answer data adjacent to the target question-answer data generation time can be selected; determining the weight of the historical target score of each historical question-answer data according to the generation time interval of each historical question-answer data and the target question-answer data; determining the weight of the target score according to the weight of each historical target score; and according to each weight value, carrying out weighted summation on the historical target score and the target score to obtain an updated target score. Wherein, the smaller the generation time interval with the target question-answering data, the larger the weight. Wherein the set number may be set by a skilled person according to need or empirical values, or determined iteratively through a number of experiments. Wherein the sum of the weights of the historical target scores and the target scores is 1.

Because the question-answer data generation behavior of the target object in the knowledge question-answer community is not invariable, the question-answer data generation behavior of the target object meets certain attenuation characteristics, namely, the question-answer data generation frequency of the target object in the knowledge question-answer community is gradually reduced until the question-answer data generation frequency is stable. In view of this, in another optional embodiment of the present disclosure, the target score may also be updated based on: determining a behavior attenuation factor of the target object according to the historical target score of the historical question-answer data of the target object; the target score is updated according to the behavior decay factor.

It can be understood that the target score is updated by introducing the behavior attenuation factor, so that the updated target score is more consistent with the behavior activity rule of the target object, the accuracy of the finally determined target score is improved, and a foundation is laid for improving the accuracy of the abnormal recognition result.

For example, curve fitting may be performed on the historical target scores of the historical question-answer data at different times, and each coefficient in the curve fitting result is used as a behavior attenuation factor; correspondingly, determining a reference score of the target question-answering data at the generation moment according to the fitted curve; and taking the weighted average of the reference score and the target score as the updated target score. The weights of the reference score and the target score may be determined by a skilled person according to need or empirical values, or determined repeatedly by a number of experiments.

For example, the behavior attenuation factor of the target object may be determined using the following formula:

score_n＝score_n-1*exp(-a*(t_n-t_n-1))；

wherein a is a behavior attenuation factor, score_nAnd score_n-1History times t sequentially adjacent to the generation time of answer data of the target question-answer data_nAnd t_n-1Historical target scores corresponding to historical question-answer data;

correspondingly, the target score is updated by adopting the following formula:

last_uscore＝uscore*exp(-a*(t-t_n))；

wherein uscore is a target score, last _ uscore is an updated target score, and t is a generation time of answer data of the target question-answer data.

It should be noted that, when the target object does not generate the historical question-answer data before generating the answer data of the target question-answer data, or the amount of the generated historical question-answer data is insufficient, the behavior attenuation factor cannot be determined, and at this time, the target score may be directly determined according to the text feature data and the behavior feature data without updating the target score, so as to perform the abnormal behavior on the target object.

In the technical solution of the present disclosure, the acquisition, storage, application, and the like of the historical target score of the historical question-answer data of the related target object all conform to the regulations of related laws and regulations, and do not violate the good customs of the public order.

In the embodiment of the disclosure, in the process of identifying the abnormality of the target object, the historical target score of the historical question-answer data of the target object is introduced, and the target score is updated, so that the target score determining process can refer to the historical behavior condition of the target object instead of only according to the single generation moment of the target question-answer data, thereby improving the richness and comprehensiveness of the reference data in the target score determining process, further improving the accuracy of the target score determining result, and laying a foundation for improving the accuracy of the abnormality identifying result of the target object at the target question-answer data generation moment.

On the basis of the above technical solutions, the present disclosure also provides a preferred embodiment of an abnormality identification method.

Referring to fig. 5, a flow chart of an anomaly identification method is shown, which is used for performing anomaly account identification on a target account when answer data of target question and answer data is generated for the target account. The abnormal account identification process is realized on the basis of a text evaluation model, a behavior evaluation model, a fusion evaluation model, a target score updating module and an abnormal identification module.

1) Text evaluation model

And inputting the target question-answer data of the target account into the text evaluation model to obtain text characteristic data.

Different data anomaly identification submodels are arranged in the text evaluation model aiming at different preset dimensions and used for determining the data anomaly probability of the corresponding dimensions. Wherein the preset dimensions comprise at least one of forbidden pictures, forbidden texts, junk pictures, junk texts, administrative pictures, administrative texts, abuse pictures, abuse texts, advertisement pictures, advertisement texts and the like. Each preset dimension corresponds to an independently trained submodel, each submodel is realized based on the existing machine learning model, and the model structure of each machine learning model is not limited in any way in the disclosure.

The text evaluation model also comprises a statistic strategy module which is used for counting the length proportion, the number proportion and the punctuation number proportion of the question words in the answer words in the target question-answer data and carrying out normalization processing on the proportion data.

The text evaluation model also comprises a fusion module which is used for splicing and fusing the abnormal probability of the data output by each sub-model and the normalized proportion data output by the statistical strategy module according to a set sequence to obtain text characteristic data.

The target question-answer data of the target account refers to question-answer data corresponding to answer data generated by the target account.

2) Behavior evaluation model

And inputting the behavior state data of the target account when the answer data is generated into the behavior evaluation model to obtain behavior characteristic data.

The behavior evaluation model is provided with different function submodels for generating characteristic data of different dimensions in the behavior characteristic data.

For example, a behavior model may be included in the behavior evaluation model to determine a probability of behavior anomaly of the target account. Correspondingly, the interactive behavior data in the behavior state data are input into the trained behavior model to obtain the abnormal behavior probability. The behavioral model can be realized based on the existing machine learning model, and the model structure of each machine learning model is not limited in any way by the present disclosure. The interactive behavior data may include whether answer data in the target question-answer data is answered at one time, whether a target question-answer data generation page enters from a set entry page, an input speed of the answer data of the target question-answer data, and the like.

For example, the behavior evaluation model may include a device security discrimination model for determining a device anomaly probability when the target account generates answer data. Correspondingly, inputting equipment information when the target account generates answer data into the trained equipment safety discrimination model to obtain equipment abnormal probability. The equipment safety judgment model can be realized based on the existing machine learning model, and the model structure of each machine learning model is not limited by the method. The input device information may include, among other things, usage system information when question and answer data is generated, a device power on/off state, whether it belongs to a simulator, a device usage posture, and a device movement speed.

For example, a network risk judgment model may be included in the behavior evaluation model, and is used to determine a network anomaly probability when the target account generates answer data. Correspondingly, inputting the network information generated when the target account generates answer data into the trained network risk judgment model to obtain the network abnormal probability. The network risk judgment model can be realized based on the existing machine learning model, and the model structure of each machine learning model is not limited in any way. The network information may include a network type, a network risk level, an unsupervised network cluster number, a network occurrence frequency in a set time period, a co-occurrence frequency of a network when question data is generated and a network when answer data is generated, and the like.

For example, an activity prediction model may be included in the behavior evaluation model for predicting the activity of the target account in an adjacent historical period before generating the answer data. Correspondingly, historical interaction behavior data in an adjacent historical time period before answer data is generated by the target account are input into the trained activity prediction model, and interaction activity is obtained. The liveness prediction model can be realized based on the existing machine learning model, and the model structure of each machine learning model is not limited by the disclosure. The historical interactive behavior data may include at least one of the number of times of accessing the homepage of the knowledge question-answering community by the target account, access duration, duration of accessing question data related to the target question-answering data, duration of accessing answer data corresponding to the related question data, cumulative number of questions asked by the target account, and duration of staying of the answer control in the adjacent historical period. The answer control may be understood as a control that needs to be triggered to enter an answer editing page or submit edited answer data when the answer data of the target question and answer data is generated by the target object.

Illustratively, the behavior evaluation model may include a post-policy module for encoding the account attribute of the target account. The account attributes comprise an import account, an active account, a protection account, a history banned account and the like.

Illustratively, the behavior evaluation model includes a fusion module, which is used for splicing and fusing at least one of the behavior anomaly probability, the equipment anomaly probability, the network anomaly probability, the interaction activity and the account attribute coding data according to a set sequence to obtain behavior characteristic data.

3) Fusion evaluation model

And inputting the target characteristic data and the behavior characteristic data into the fusion evaluation model to obtain a target score when the target account generates answer data. The fusion evaluation model is realized based on the existing machine learning model, and the model structure of each machine learning model is not limited in any way.

4) Target score updating module

And the target score updating module is used for updating the target score generated by the preamble.

Illustratively, whether at least two historical target scores of historical question and answer data before the target question and answer data exist in the score list of the target account is identified; if yes, determining a behavior attenuation factor according to the scores of two adjacent historical targets at the generation moment, and updating the target score according to the behavior attenuation factor; otherwise, the target score is not processed.

score_n＝score_n-1*exp(-a*(t_n-t_n-1))；

correspondingly, the target score is updated by adopting the following formula:

last_uscore＝uscore*exp(-a*(t-t_n))；

For example, the target score may also be added to the score list of the target account for subsequent use.

5) Anomaly identification module

The anomaly identification model is used for carrying out account anomaly identification when the target account generates answer data in the target question-answer data.

Illustratively, the target score output by the target score updating module is compared with a preset score threshold; if the target score output by the target score updating module is smaller than a preset score threshold, determining that the target account is abnormal at the moment of answer data generation of the target question-answer data; otherwise, the target account is determined to be normal at the moment of generating answer data of the target question-answering data.

As an implementation of the above-described heterogeneous identity method, the present disclosure also provides an alternative embodiment of a virtual device that implements the anomaly identification method.

Referring to fig. 6, an abnormality recognition apparatus 600 includes: a text feature data extraction module 601, a behavior feature data determination module 602, a target score determination module 603, and an anomaly identification module 604. Wherein the content of the first and second substances,

a text feature data extraction module 601, configured to extract text feature data in target question-answer data of a target object;

a behavior feature data determining module 602, configured to determine behavior feature data according to behavior state data when the target object generates answer data of the target question-answering data;

a target score determining module 603, configured to determine a target score according to the text feature data and the behavior feature data;

and an anomaly identification module 604, configured to perform anomaly identification on the target object according to the target score.

In an optional embodiment, the text feature data extraction module 601 includes:

the data anomaly probability obtaining unit is used for carrying out anomaly data identification on the target question answering data under a preset dimensionality to obtain data anomaly probability;

a character statistical data obtaining unit, configured to perform character statistics on the target question-answering data to obtain character statistical data;

and the text characteristic data generating unit is used for generating the text characteristic data according to at least one of the data abnormal probability and the character statistical data.

In an optional embodiment, the behavior feature data determining module 602 includes:

the behavior anomaly probability determining unit is used for determining the behavior anomaly probability according to the interactive behavior data when the target object generates answer data of the target question-answer data;

the environment abnormal probability determining unit is used for determining the environment abnormal probability according to the interactive environment information when the answer data of the target question answering data is generated;

the interactive activity determining unit is used for determining the interactive activity according to historical interactive behavior data of the target question answering data associated historical time periods;

a behavior feature data generating unit, configured to generate the behavior feature data according to at least one of the behavior anomaly probability, the environment anomaly probability, the interaction activity, and basic attribute data of the target object.

In an optional embodiment, the environmental anomaly probability comprises a device anomaly probability; the environment anomaly probability determination unit comprises:

the equipment anomaly probability determining subunit is used for determining the equipment anomaly probability according to input equipment information when answer data of the target question-answer data are generated; and/or the presence of a gas in the gas,

the environmental anomaly probability comprises a network anomaly probability; the environment anomaly probability determination unit comprises:

and the network anomaly probability determining subunit is used for determining the network anomaly probability according to the network environment information when the answer data of the target question-answer data is generated.

In an optional embodiment, the anomaly identification module 604 includes:

the target score updating unit is used for updating the target score according to the historical target score of the historical question-answer data of the target object;

and the abnormality identification unit is used for carrying out abnormality identification on the target object according to the updated target score.

In an optional embodiment, the target score updating unit includes:

the behavior attenuation factor determining subunit is used for determining the behavior attenuation factor of the target object according to the historical target score of the historical question-answer data of the target object;

and the target score updating subunit is used for updating the target score according to the behavior attenuation factor.

In an alternative embodiment, the behavior attenuation factor determination subunit includes:

a behavior attenuation factor determination slave unit for determining a behavior attenuation factor of the target object using the following formula:

score_n＝score_n-1*exp(-a*(t_n-t_n-1))；

wherein a is the behavior attenuation factor, score_nAnd score_n-1Respectively history time t sequentially adjacent to the generation time of answer data of the target question-answer data_nAnd t_n-1Historical target scores corresponding to historical question-answer data;

the target score updating subunit comprises:

a target score updating slave unit, configured to update the target score by using the following formula:

last_uscore＝uscore*exp(-a*(t-t_n))；

The abnormality recognition device can execute the abnormality recognition method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of executing the abnormality recognition method.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the abnormality recognition method. For example, in some embodiments, the anomaly identification method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the anomaly identification method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the anomaly identification method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.

Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be performed in parallel or sequentially or in a different order, as long as the desired results of the technical solutions provided by this disclosure can be achieved, and are not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An anomaly identification method comprising:

2. The method of claim 1, wherein the extracting text feature data in the target question and answer data of the target object comprises:

under a preset dimension, carrying out abnormal data identification on the target question answering data to obtain data abnormal probability;

carrying out character statistics on the target question-answering data to obtain character statistical data;

and generating the text characteristic data according to at least one of the data anomaly probability and the character statistical data.

3. The method according to claim 1, wherein the determining the behavior feature data according to the behavior state data when the target object generates the answer data of the target question-answer data comprises:

determining abnormal behavior probability according to interactive behavior data when the target object generates answer data of the target question-answer data;

determining the abnormal probability of the environment according to the interactive environment information when the answer data of the target question-answer data is generated;

determining interaction activity according to historical interaction behavior data of the target question-answering data associated historical time periods;

and generating the behavior characteristic data according to at least one of the behavior abnormal probability, the environment abnormal probability, the interaction activeness and the basic attribute data of the target object.

4. The method of claim 3, wherein the environmental anomaly probability comprises a device anomaly probability; determining the environmental anomaly probability according to the interactive environmental information when the answer data of the target question-answer data is generated, wherein the determining comprises the following steps:

determining the equipment abnormal probability according to input equipment information when answer data of the target question-answer data is generated; and/or the presence of a gas in the gas,

the environmental anomaly probability comprises a network anomaly probability; determining the environmental anomaly probability according to the interactive environmental information when the answer data of the target question-answer data is generated, wherein the determining comprises the following steps:

and determining the network abnormal probability according to the network environment information when the answer data of the target question-answer data is generated.

5. The method of any of claims 1-4, wherein said identifying the target object for abnormalities based on the target score comprises:

updating the target score according to the historical target score of the historical question-answer data of the target object;

and performing abnormal recognition on the target object according to the updated target score.

6. The method of claim 5, wherein the updating the target score according to the historical target score of the historical question-answer data of the target object comprises:

determining a behavior attenuation factor of the target object according to the historical target score of the historical question-answer data of the target object;

and updating the target score according to the behavior attenuation factor.

7. The method of claim 6, wherein the determining a behavior decay factor for the target object based on historical target scores of historical question-answer data for the target object comprises:

determining a behavior attenuation factor of the target object by adopting the following formula:

score_n＝score_n-1*exp(-a*(t_n-t_n-1))；

wherein a is the behavior attenuation factor, score_nAnd score_n-1Respectively history time t sequentially adjacent to the generation time of answer data of the target question-answer data_nAnd t_n-1Corresponding to historyHistorical target scores for the question-answer data;

the updating the target score according to the behavior attenuation factor includes:

updating the target score using the following formula:

last_uscore＝uscore*exp(-a*(t-t_n))；

8. An abnormality recognition apparatus comprising:

9. The apparatus of claim 8, wherein the text feature data extraction module comprises:

10. The apparatus of claim 8, wherein the behavior feature data determination module comprises:

the interactive activity determining unit is used for determining interactive activity according to historical interactive behavior data of the target question answering data associated historical time periods;

11. The apparatus of claim 10, wherein the environmental anomaly probability comprises a device anomaly probability; the environment anomaly probability determination unit comprises:

12. The apparatus of any of claims 8-11, wherein the anomaly identification module comprises:

13. The apparatus of claim 12, wherein the target score updating unit comprises:

14. The apparatus of claim 13, wherein the behavior decay factor determination subunit comprises:

score_n＝score_n-1*exp(-a*(t_n-t_n-1))；

the target score updating subunit comprises:

last_uscore＝uscore*exp(-a*(t-t_n))；

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform an anomaly identification method as claimed in any one of claims 1 to 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute an anomaly identification method according to any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements an anomaly identification method according to any one of claims 1-7.