CN111881293B - Risk content identification method and device, server and storage medium - Google Patents
Risk content identification method and device, server and storage medium Download PDFInfo
- Publication number
- CN111881293B CN111881293B CN202010721587.0A CN202010721587A CN111881293B CN 111881293 B CN111881293 B CN 111881293B CN 202010721587 A CN202010721587 A CN 202010721587A CN 111881293 B CN111881293 B CN 111881293B
- Authority
- CN
- China
- Prior art keywords
- content
- value
- sample
- behavior
- identified
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 230000006399 behavior Effects 0.000 claims description 285
- 239000013598 vector Substances 0.000 claims description 70
- 238000007477 logistic regression Methods 0.000 claims description 39
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 16
- 238000010276 construction Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 9
- 230000011218 segmentation Effects 0.000 description 8
- 239000000843 powder Substances 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013016 damping Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Mathematics (AREA)
- Artificial Intelligence (AREA)
- Pure & Applied Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Algebra (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application discloses a risk content identification method and device, a server and a storage medium, comprising the following steps: intercepting a plurality of words corresponding to the content to be identified according to the preset window length to obtain a plurality of word combinations; according to the occurrence probability of each word combination in the corresponding preset phrase set, calculating to obtain the confusion value of the content to be identified; if the confusion degree value is larger than a preset confusion degree threshold value, user behavior data of at least one behavior type of the user corresponding to the content to be identified are obtained; according to the user behavior data of at least one behavior type, at least one behavior characteristic value corresponding to the at least one behavior type is obtained; inputting at least one behavior characteristic value, a confusion degree value and a statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability; and if the first classification probability is larger than a first preset probability threshold value, determining the content to be identified as risk content. By adopting the method and the device, the identification accuracy of the risk content can be improved.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a risk content identification method and apparatus, a server, and a storage medium.
Background
At present, the scheme for identifying the risk content is mainly realized by a traditional text classification method and a mode of setting specific keywords, wherein the traditional text classification method mainly uses fastext, textcnn, lstm and other related algorithms, trains a training sample according to the mode that the risk content is a negative sample and the normal content is a positive sample, and builds a text classification model, so that the risk content can be identified, but the mode has the problem that only stock data can be hit but unknown content cannot be hit; the method of setting specific keywords may be avoided by users who send risk contents by means of words such as words and homophones, and the above schemes have the problem that the accuracy of identifying risk contents is not high.
Content of the application
The embodiment of the application provides a risk content identification method and device, a server and a storage medium, so as to improve the identification accuracy of risk content.
In one aspect, an embodiment of the present application provides a risk content identification method, including:
Splitting the content to be identified into a plurality of words, and intercepting the words according to the preset window length to obtain a plurality of word combinations;
according to the occurrence probability of each word combination in the word combinations in the corresponding preset phrase sets, calculating to obtain the sentence probability of the content to be identified, and obtaining the confusion value of the content to be identified based on the sentence probability, wherein the preset phrase sets are determined based on a corpus;
if the confusion degree value is larger than a preset confusion degree threshold value, user behavior data of at least one behavior type of the user corresponding to the content to be identified are obtained;
obtaining at least one behavior characteristic value corresponding to the at least one behavior type according to the user behavior data of the at least one behavior type;
inputting the at least one behavior characteristic value, the confusion value and the statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability, wherein the first risk content identification model is obtained by training based on a first sample set and the confusion value, the statement length value of each sample content in the first sample set and a content label corresponding to the at least one behavior characteristic value of a user corresponding to each sample content;
And if the first classification probability is larger than a first preset probability threshold, outputting the content to be identified as risk content.
Optionally, the method further comprises:
if the confusion degree value is smaller than or equal to the preset confusion degree threshold value, determining a text content vector of the content to be identified according to whether the content to be identified contains preset type words, wherein the preset type words comprise digital type words, unit type words and behavior purpose type words;
inputting the text content vector into a second risk content recognition model, so that the second risk content recognition model calculates a second predicted value of the content to be recognized according to the text content vector and the weight vector, and outputs a second classification probability according to the second predicted value; the second risk content identification model is obtained by training based on a second sample set and content labels corresponding to text content vectors of each sample content in the second sample set, and comprises weight vectors;
and if the second classification probability is larger than a second preset probability threshold, outputting the content to be identified as risk content.
Optionally, the method further comprises:
Determining a text content vector of each piece of sample content according to whether the content of each piece of sample in the second sample set contains the preset type word;
training an initial logistic regression model according to the text content vector of each sample content and the content label of each sample content to obtain a first logistic regression model and a predicted content label of each sample content;
adjusting the first logistic regression model according to the content label of each piece of sample content and the predicted content label of each piece of sample content;
and when the adjusted first logistic regression model meets the convergence condition, determining the adjusted first logistic regression model as the second risk content identification model.
Optionally, the calculating to obtain the sentence probability of the content to be identified according to the occurrence probability of each word combination in the plurality of word combinations in the respective corresponding preset phrase set, and obtaining the confusion value of the content to be identified based on the sentence probability includes:
multiplying the occurrence probability of each word combination in the word combinations in the corresponding preset phrase set to obtain the sentence probability of the content to be identified, and calculating the reciprocal of the sentence probability to obtain the confusion value of the content to be identified.
Optionally, the user behavior data includes a behavior type and a behavior time;
the obtaining at least one behavior feature value corresponding to the at least one behavior feature according to the user behavior data of the at least one behavior type includes:
dividing the user behavior data of any behavior type into a plurality of groups of user behavior data aiming at any behavior type, and calculating the time interval between every two adjacent user behavior data in each group of user behavior data in the plurality of groups of user behavior data to obtain a plurality of groups of time interval data;
obtaining the attenuation variance of each group of time interval data according to each group of time interval data and the attenuation coefficient of each time interval data in each group of time interval data;
and carrying out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient of each group of time interval data to obtain the behavior characteristic value of any behavior type.
Optionally, the first risk content identification model includes k+1 regression trees and learning coefficients of each of the k+1 regression trees, the k+1 regression trees include a first regression tree and k non-first regression trees, and each of the k non-first regression trees includes a root node, an internal node, and a leaf node, where k is an integer greater than or equal to 1; the k+1 regression trees are trained based on the first sample set;
The step of inputting the at least one behavior characteristic value, the confusion degree value and the sentence length value into a first risk content identification model to obtain a first classification probability, comprising:
determining the leaf node position of the content to be identified in each non-first regression tree according to the at least one behavior characteristic value, the confusion degree value, the statement length value, the root node and the internal node of each non-first regression tree;
obtaining a first predicted value of the content to be identified according to the predicted value of the first regression tree, the output value of the leaf node position of each non-first regression tree and the learning coefficient of each regression tree in the k+1 regression trees;
and converting the first predicted value into probability to obtain the first classification probability.
Optionally, the k non-first regression trees include a j-th regression tree, where j is an integer greater than 1 and less than or equal to the k+1;
the method further comprises the construction process of the first risk content identification model:
constructing the first regression tree according to the number of dangerous contents and the number of normal contents in the first sample set, wherein the first regression tree comprises a 1 st predicted value of each sample content in the first sample characteristic set;
Obtaining the 1 st classification probability of each piece of sample content according to the 1 st predicted value of each piece of sample content, and obtaining the 1 st residual error of each piece of sample content based on the 1 st classification probability of each piece of sample content and the content label of each piece of sample content;
constructing a j-th regression tree according to at least one behavior characteristic value, a confusion value, a statement length value and a j-1 th residual error of each sample content to obtain a k non-first regression tree, wherein the j-1 th residual error of each sample content is determined based on a j-1 th classification probability of each sample content and a content label of each sample content, the j-1 th classification probability of each sample content is obtained according to a j-1 th predicted value of each sample content, and the j-1 th predicted value of each sample content is determined according to the first regression tree to the j-1 th regression tree;
obtaining a (k+1) predictive value of the content of each sample according to the first regression tree and the k non-first regression trees;
obtaining the k+1th classification probability of each piece of sample content according to the k+1th predicted value of each piece of sample content, and obtaining the k+1th residual error of each piece of sample content based on the k+1th classification probability of each piece of sample content and the content label of each piece of sample content;
And when the k+1th residual error of each sample content meets a convergence condition, obtaining the first risk content identification model according to the k+1 regression trees and the learning coefficient of each regression tree in the k+1 regression trees.
An embodiment of an aspect of the present application provides a risk content identifying apparatus, including:
the splitting and intercepting module is used for splitting the content to be identified into a plurality of words, and intercepting the words according to the preset window length to obtain a plurality of word combinations;
the confusion value calculating module is used for calculating statement probability of the content to be identified according to occurrence probability of each word combination in the plurality of word combinations in a corresponding preset phrase set, and obtaining a confusion value of the content to be identified based on the statement probability, wherein the preset phrase set is determined based on a corpus;
the acquisition module is used for acquiring user behavior data of at least one behavior type of the user corresponding to the content to be identified if the confusion degree value is larger than a preset confusion degree threshold value;
the behavior characteristic value calculating module is used for obtaining at least one behavior characteristic value corresponding to the at least one behavior type according to the user behavior data of the at least one behavior type; the at least one behavior type comprises a first behavior type, and the user behavior data comprises a behavior type and a behavior time;
The first classification probability determining module is used for inputting the at least one behavior characteristic value, the confusion degree value and the statement length value of the content to be identified into a first risk content identification model to obtain first classification probability, wherein the first risk content identification model is obtained by training based on a first sample set and the confusion degree value and the statement length value of each sample content in the first sample set and a content label corresponding to the at least one behavior characteristic value of a user corresponding to each sample content;
and the first judging and outputting module is used for outputting the content to be identified as risk content if the first classification probability is larger than a first preset probability threshold value.
Optionally, the apparatus further includes:
the judging and determining module is used for determining a text content vector of the content to be identified according to whether the content to be identified contains a preset type word or not if the confusion degree value is smaller than or equal to the preset confusion degree threshold value, wherein the preset type word comprises a digital type word, a unit type word and a behavior destination type word;
the input module is used for inputting the text content vector into a second risk content recognition model so that the second risk content recognition model calculates a second predicted value of the content to be recognized according to the text content vector and the weight vector and outputs a second classification probability according to the second predicted value; the second risk content identification model is obtained by training based on a second sample set and content labels corresponding to text content vectors of each sample content in the second sample set, and comprises weight vectors;
And the second judging and outputting module is used for outputting the content to be identified as risk content if the second classification probability is larger than a second preset probability threshold value.
Optionally, the apparatus further includes: a second model determination module.
The second model determining module is specifically configured to:
determining a text content vector of each piece of sample content according to whether the content of each piece of sample in the second sample set contains the preset type word;
training an initial logistic regression model according to the text content vector of each sample content and the content label of each sample content to obtain a first logistic regression model and a predicted content label of each sample content;
adjusting the first logistic regression model according to the content label of each piece of sample content and the predicted content label of each piece of sample content;
and when the adjusted first logistic regression model meets the convergence condition, determining the adjusted first logistic regression model as the second risk content identification model.
Optionally, the calculating confusion degree value module is specifically configured to:
multiplying the occurrence probability of each word combination in the word combinations in the corresponding preset phrase set to obtain the sentence probability of the content to be identified, and calculating the reciprocal of the sentence probability to obtain the confusion value of the content to be identified.
Optionally, the calculating behavior characteristic value module includes:
the grouping calculation unit is used for dividing the user behavior data of any behavior type into a plurality of groups of user behavior data aiming at any behavior type, and calculating the time interval between every two adjacent groups of user behavior data in the plurality of groups of user behavior data to obtain a plurality of groups of time interval data;
the attenuation variance unit is used for obtaining the attenuation variance of each group of time interval data according to each group of time interval data and the attenuation coefficient of each time interval data in each group of time interval data;
and the behavior characteristic value calculating unit is used for carrying out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient of each group of time interval data to obtain the behavior characteristic value of any behavior type.
Optionally, the first risk content identification model includes k+1 regression trees and learning coefficients of each of the k+1 regression trees, the k+1 regression trees include a first regression tree and k non-first regression trees, and each of the k non-first regression trees includes a root node, an internal node, and a leaf node, where k is an integer greater than or equal to 1; the k+1 regression trees are trained based on the first sample set;
The first classification probability determination module includes:
the leaf node position determining unit is used for determining the leaf node position of the content to be identified in each non-first regression tree according to the at least one behavior characteristic value, the confusion degree value, the statement length value, the root node and the internal node of each non-first regression tree;
the prediction value calculation unit is used for obtaining a first prediction value of the content to be identified according to the prediction value of the first regression tree, the output value of the leaf node position of each non-first regression tree and the learning coefficient of each k+1 regression tree;
and the classification probability calculation unit is used for converting the first predicted value into probability to obtain the first classification probability.
Optionally, the k non-first regression trees include a j-th regression tree, where j is an integer greater than 1 and less than or equal to the k+1;
the first classification probability determining module further includes:
a first regression tree construction unit, configured to construct the first regression tree according to the number of risk contents and the number of normal contents in the first sample set, where the first regression tree includes a 1 st predicted value of each sample content in the first sample feature set;
A 1 st residual error calculation unit, configured to obtain a 1 st classification probability of each piece of sample content according to the 1 st predicted value of each piece of sample content, and obtain a 1 st residual error of each piece of sample content based on the 1 st classification probability of each piece of sample content and the content label of each piece of sample content;
a k+1th regression tree construction unit, configured to construct the j-th regression tree according to at least one behavior feature value, a confusion value, a statement length value, and a j-1th residual of each sample content to obtain the k non-first regression tree, where the j-1th residual of each sample content is determined based on a j-1th classification probability of each sample content and a content label of each sample content, the j-1th classification probability of each sample content is obtained according to a j-1th predicted value of each sample content, and the j-1th predicted value of each sample content is determined according to the first regression tree to the j-1th regression tree;
a k+1th predicted value unit is calculated and is used for obtaining a k+1th predicted value of each sample content according to the first regression tree and the k non-first regression trees;
A k+1th residual error unit is used for obtaining a k+1th classification probability of each piece of sample content according to the k+1th predicted value of each piece of sample content, and obtaining a k+1th residual error of each piece of sample content based on the k+1th classification probability of each piece of sample content and the content label of each piece of sample content;
and the first model determining unit is used for obtaining the first risk content identification model according to the k+1 regression trees and the learning coefficient of each regression tree in the k+1 regression trees when the k+1 residual error of each sample content meets the convergence condition.
In one aspect, an embodiment of the present application provides a risk content identifying apparatus, including a processor, a memory, and a transceiver, where the processor, the memory, and the transceiver are connected to each other, where the memory is configured to store a computer program supporting the risk content identifying apparatus to execute the risk content identifying method described above, and the computer program includes program instructions; the processor is configured to invoke the program instructions to perform the risk content identification method as described in one aspect of the embodiments of the present application described above.
In one aspect, an embodiment of the present application provides a storage medium, where a computer program is stored, where the computer program includes program instructions; the program instructions, when executed by a processor, cause the processor to perform a risk content identification method as described above in one aspect of an embodiment of the application.
In the embodiment of the application, the risk content recognition platform splits the content to be recognized into a plurality of words, and intercepts the words according to the preset window length to obtain a plurality of word combinations; according to the occurrence probability of each word combination in the word combinations in the corresponding preset phrase set, calculating to obtain the sentence probability of the content to be identified, and obtaining the confusion value of the content to be identified based on the sentence probability; if the confusion degree value is larger than a preset confusion degree threshold value, user behavior data of at least one behavior type of the user corresponding to the content to be identified are obtained; according to the user behavior data of at least one behavior type, at least one behavior characteristic value corresponding to the at least one behavior type is obtained; inputting at least one behavior characteristic value, a confusion degree value and a statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability; and if the first classification probability is larger than a first preset probability threshold value, determining the content to be identified as risk content. By adopting the method and the device, the identification accuracy of the risk content can be improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;
fig. 2 is a flow chart of a risk content identification method according to an embodiment of the present application;
FIG. 3 is a graphical representation of a function of the attenuation coefficient of time interval data provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of determining the leaf node position of sample content in the (k+1) th regression tree according to the confusion degree value, the sentence length value and the behavior characteristic value of the user according to the embodiment of the application;
FIG. 5 is a schematic diagram of a first risk content identification model according to an embodiment of the present application;
fig. 6 is a flow chart of a risk content identification method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a risk content identifying apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. The scheme for identifying the risk content is mainly realized by a traditional text classification method and a mode of setting specific keywords, wherein the traditional text classification method mainly uses fastext, textcnn, lstm and other related algorithms, trains a training sample according to the mode that the risk content is a negative sample and the normal content is a positive sample, and builds a text classification model, so that the risk content can be identified, but the mode has the problem that only stock data can be hit but unknown content cannot be hit; the method of setting specific keywords may be avoided by users who send risk contents by means of words such as words and homophones, and the above schemes have the problem that the accuracy of identifying risk contents is not high. Therefore, the embodiment of the application provides a risk content identification method, which improves the identification accuracy of risk content. As shown in fig. 1, the system architecture schematic includes a risk content identification platform and a user terminal cluster, where the user terminal cluster may include a plurality of user terminals, and as shown in fig. 1, may specifically include a user terminal 100a, a user terminal 100b, a user terminal 100c, a user terminal 100n.
Each user terminal in the risk content recognition platform and the user terminal cluster may be a computer device, including a mobile phone, a tablet computer, a notebook computer, a palm computer, a smart sound, a mobile internet device (MID, mobile internet device), a POS (Point Of sale) machine, a wearable device (e.g., a smart watch, a smart bracelet, etc.), and so on.
Further, as shown in fig. 1, in the process of implementing the risk content identifying method, a user a publishes text content through a user terminal 100a and sends the text content to a risk content identifying platform, the risk content identifying platform calculates a confusion value of the text content after receiving the text content, if the confusion value of the text content is greater than a preset confusion value, user behavior data of at least one behavior type corresponding to the user of the text content is obtained, at least one behavior feature value corresponding to the at least one behavior type is calculated according to the user behavior data of the at least one behavior type, the at least one behavior feature value, the confusion value of the text content and a sentence length value are input into a first risk content identifying model to obtain a first classification probability, if the first classification probability is greater than a first preset probability threshold, the risk content identifying platform outputs the text content as the risk content, and the output result is displayed through a screen, and the risk content and the user corresponding to the risk content is hit after being manually checked.
Further, please refer to fig. 2, which is a flowchart illustrating a risk content identification method according to an embodiment of the present application. As shown in fig. 2, this method embodiment includes the steps of:
s101, splitting the content to be identified into a plurality of words, and intercepting the words according to the preset window length to obtain a plurality of word combinations.
The preset window length is the number of words contained in the word combination, and generally 2 is taken.
Specifically, the risk content recognition platform may split the content to be recognized into a plurality of words through a word segmentation tool, such as a simple chinese word segmentation system (Simple Chinese Words Segmentation, SCWS), and intercept the plurality of words by taking a preset window length as the number of words included in each word combination and taking each word in the plurality of words as the last word in the word combination to obtain a plurality of word combinations. It should be noted that, when the risk content recognition platform intercepts the words, a START identifier "START" and an END identifier "END" are added before a first word and after a last word in a plurality of words obtained by splitting the content to be recognized, where the START identifier "START" is used to represent that the first word behind the START identifier is the first word in the content to be recognized, the END identifier "END" is used to represent that the first word before the END identifier is the last word in the content to be recognized, and in addition, the numbers of the START identifier "START" and the END identifier "END" are added before the first word and behind the last word of the content to be recognized are both dependent on the value of the preset window length, and the numbers of the START identifier and the END identifier are both the difference between the preset window length and 1. Through the method, the number of words contained in each word combination in the plurality of word combinations obtained through interception can be the preset window length.
For example, the risk content recognition platform splits the content to be recognized "i love china" through SCWS to obtain three words, "i", "love" and "china", when intercepting the three words of "i", "love" and "china", adds a START identifier "START" in front of the word "i" and an END identifier "END" behind the word "chinese" according to a preset window length 2, respectively, to obtain "START", "i", "love", "china" and "END", and intercepts a plurality of words after the identifier is added according to a preset window length 2, to obtain word combinations "START i", "i love", "china" and "chinese END".
S102, calculating statement probability of the content to be identified according to occurrence probability of each word combination in the word combinations in the corresponding preset phrase sets, and obtaining a confusion value of the content to be identified based on the statement probability.
The preset phrase set is determined based on a corpus.
Specifically, the preset phrase set corresponding to the ith word combination in the plurality of word combinations is composed of word combinations, wherein the length of the word combinations in the corpus is a preset window length, and the last word in the word combinations is consistent with the last word in the ith word combination. For example, if the corpus includes "i love chinese, i am, go to chinese, leave chinese and like chinese," for the 3 rd word combination "i chinese" in step S101, the phrase combination is obtained from each corpus according to the preset window length and the mode that the last word is consistent with the last word "chinese" in "i chinese," and the obtained phrase combinations are "i chinese", "START chinese", "go to chinese", "leave chinese" and "like chinese". The obtained phrase combination is a preset word combination corresponding to Chinese love phrase combination, and a set formed by a plurality of preset phrase combinations is a preset phrase set.
The risk content recognition platform counts the occurrence times of each preset word combination corresponding to the ith word combination of the content to be recognized in the corpus, calculates the sum of the occurrence times of each preset word combination in the corpus, counts the occurrence times of the ith word combination in the corpus, and further calculates the ratio between the occurrence times of the ith word combination in the corpus and the sum of the occurrence times of each preset word combination corresponding to the ith word combination in the corpus, so as to obtain the occurrence probability of the ith word combination of the content to be recognized in the preset phrase set corresponding to the ith word combination. For example, the number of occurrences of the 3 rd word combination "love chinese" corresponding to the preset phrase set "love chinese, START chinese, go chinese, leave chinese and like chinese" in "love chinese", "START chinese", "go chinese", "leave chinese" and "like chinese" in the corpus is 1, 2, 1 and 1, respectively, and then the occurrence probability of the 3 rd word combination "love chinese" of the content to be identified in the corresponding preset phrase set thereof is 1/6.
Optionally, the calculating to obtain the sentence probability of the content to be identified according to the occurrence probability of each word combination in the plurality of word combinations in the preset phrase set corresponding to the respective word combination, and obtaining the confusion value of the content to be identified based on the sentence probability includes:
Multiplying the occurrence probability of each word combination in the word combinations in the preset phrase set corresponding to the respective word combination to obtain the sentence probability of the content to be identified, and calculating the reciprocal of the sentence probability to obtain the confusion value of the content to be identified.
For example, the word combinations "START me", "i love" and "love china" obtained in step S101 have occurrence probabilities P in the preset phrase sets corresponding to the respective word combinations, respectively 1 、P 2 And P 3 The sentence probability of the content to be identified of 'I love China' is P 1 *P 2 *P 3 Further calculating to obtain the confusion degree value of the content to be identified as 1/(P) 1 *P 2 *P 3 )。
And S103, if the confusion degree value is larger than a preset confusion degree threshold value, acquiring user behavior data of at least one behavior type of the user corresponding to the content to be identified.
Specifically, if the confusion degree value is greater than the preset confusion degree value, the risk content identification platform acquires user behavior data of at least one behavior type of the user corresponding to the content to be identified according to the user identifier carried by the content to be identified.
S104, obtaining at least one behavior characteristic value corresponding to the at least one behavior type according to the user behavior data of the at least one behavior type.
The behavior types comprise private letter sending of the user, comment sending of the user, signature changing of the user and nickname changing of the user.
Optionally, the at least one behavior type includes a first behavior type (the first behavior type is any one of the at least one behavior type), and the user behavior data includes a behavior type and a behavior time;
the obtaining at least one behavior feature value corresponding to the at least one behavior feature according to the user behavior data of the at least one behavior type includes:
dividing the first behavior type user behavior data into a plurality of groups of user behavior data according to a time period, and calculating the time interval between every two adjacent user behavior data in each group of user behavior data in the plurality of groups of user behavior data to obtain a plurality of groups of time interval data;
obtaining the attenuation variance of each group of time interval data in the plurality of groups of time interval data according to the plurality of groups of time interval data and the attenuation coefficient of each group of time interval data in the plurality of groups of time interval data;
and carrying out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient of the plurality of groups of time interval data to obtain the behavior characteristic value of the first behavior type. And obtaining a behavior characteristic value corresponding to each behavior characteristic in the at least one behavior characteristic according to the mode.
The following describes the sequence of obtaining multiple sets of time interval data, attenuation variance of each set of time interval data and at least one behavior characteristic value according to the calculation of the risk content identification platform in detail.
It should be noted that, after calculating the time interval between every two adjacent user behavior data in each group of user behavior data in multiple groups of user behavior data, the risk user identification platform removes a certain amount of time interval data with the maximum time interval value according to the time sequence corresponding to the time interval data.
In addition, the number of the time interval data in the same time period is not more than 20, and when the number of the time interval data in the same time period is 15 and the number of the time interval data groups is 3, the risk user identification platform can obtain the identification result more quickly. Of course, the above-mentioned values may be other values obtained according to experimental results.
And then, the risk content identification platform calculates the attenuation variance of each group of time interval data according to each group of time interval data and the attenuation coefficient of each time interval data in each group of time interval data.
Optionally, the plurality of sets of time interval data includes a first set of time interval data;
the obtaining the attenuation variance of each set of time interval data in the sets of time interval data according to the sets of time interval data and the attenuation coefficient of each set of time interval data in the sets of time interval data comprises the following steps:
obtaining a fluctuation value of each time interval data in the first group of time interval data according to the average value of the first group of time interval data and each time interval data in the first group of time interval data;
obtaining the attenuation fluctuation value of each time interval data in the first group of time interval data according to the fluctuation value of each time interval data in the first group of time interval data and the attenuation coefficient of each time interval data in the first group of time interval data, wherein the attenuation coefficient of each time interval data in the first group of time interval data is determined based on the data sequence number of each time interval data in the first group of time interval data;
According to the attenuation fluctuation value of each time interval data in the first group of time interval data and the data number of the first group of time interval data, the attenuation variance of the first group of time interval data of the first behavior type is obtained, and then the attenuation variance of each group of time interval data in the plurality of groups of time interval data of the first behavior type is obtained.
Wherein the number of groups of time interval data of the first behavior type may be 3.
Specifically, the risk user identification platform determines the average value mu of the first set of time interval data and the ith time interval data x in the first set of time interval data i Calculating to obtain ith time interval data x i The fluctuation value of (2) is (x) i -μ) 2 Further, according to the ith time interval data x i Fluctuation value (x) i -μ) 2 And ith time interval data x i The attenuation coefficient 1/(10+i) of (i) to obtain the ith time interval data x i The attenuation fluctuation value of (1/(10+i)) (x) i -μ) 2 Thereby obtaining the attenuation fluctuation value of each time interval data in the first group of time interval data, calculating the ratio of the sum of the attenuation fluctuation values of all the time interval data in the first group of time interval data to the data number n of the first group of time interval data, and obtaining the attenuation variance s of the first group of time interval data 1 I.e.The variance of the decay for each set of time interval data for the first behavior type is calculated in accordance with the above-described manner.
Further, please refer to fig. 3, which is a schematic diagram of a function image of the attenuation coefficient of the time interval data according to an embodiment of the present application. As shown in fig. 3, for the attenuation coefficient 1/(10+x), the attenuation coefficient gradually decreases as x increases, and the attenuation coefficient 1/(10+x) is faster for the interval 1+.x+.20 than for the other intervals, where x is the data sequence number of each time interval data in the time interval data group, and a smaller data sequence number of the time interval data indicates that the time period corresponding to the time interval data is closer to the reference time, and vice versa, further. It will be appreciated that the attenuation coefficient of each time interval data is equivalent to weighting each time interval data according to a strategy that the attenuation is faster as the time is closer and the attenuation is slower as the time is farther, so as to avoid that the attenuation coefficient of the remote time interval data is excessively attenuated to cause the occupation ratio of the remote time interval data to be excessively small.
And the risk content recognition platform carries out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient of the plurality of groups of time interval data to obtain a behavior characteristic value of the first behavior type.
Optionally, the calculating the attenuation variance of each set of time interval data and the weight coefficient of the plurality of sets of time interval data to obtain the behavior characteristic value of the first behavior type includes:
determining the weight coefficient corresponding to each group of time interval data from the weight coefficients of the plurality of groups of time interval data according to the distance between the time period corresponding to each group of time interval data and the reference time;
and carrying out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient corresponding to each group of time interval data to obtain the behavior characteristic value of the first behavior type.
Specifically, the risk content recognition platform determines a weight coefficient of each set of time interval data according to the distance between the time period corresponding to each set of time interval data and the reference time. The reference time may be, for example, a time (may also be referred to as a current time) when the risk content identification platform obtains the first behavior type user behavior data for identification, a largest weight coefficient of the weight coefficients of the plurality of sets of time interval data is determined as a weight coefficient of a set of time interval data closest to the reference time, a smallest weight coefficient of the weight coefficients of the plurality of sets of time interval data is determined as a weight coefficient of a set of time interval data farthest from the reference time, a weight coefficient corresponding to each set of behavior time interval time is obtained from the weight coefficients of the plurality of sets of time interval data in the above manner, and a damping variance of each set of time interval data and a weight coefficient corresponding to each set of time interval data are weighted and summed to obtain a behavior feature value of the first behavior type, and at least one behavior feature value of at least one behavior type may be calculated according to the above manner.
For example, the number of the time interval data of the first behavior type is three, and the weight coefficients of the three time interval data comprise 0.5, 0.3 and 0.2, the weight coefficient of the first time interval data of the three time interval data with the time period closest to the current time is determined to be 0.5, the weight coefficient of the third time interval data of the time period farthest from the current time is determined to be 0.2, the weight coefficient of the second time interval data is determined to be 0.3, and the behavior characteristic value of the first behavior type is calculated according to the weight coefficient of each time interval data and the attenuation variance of the time interval data of the group to be 0.5 s 1 +0.3*s 2 +0.2*s 3 Wherein s is 1 、s 2 Sum s 3 The variance of the decay of the first, second and third sets of time interval data, respectively.
S105, inputting at least one behavior characteristic value, a confusion degree value and a statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability.
The first risk content identification model is obtained by training based on a first sample set, a confusion value and a statement length value of each piece of sample content in the first sample set, and a content label corresponding to at least one behavior characteristic value of a user corresponding to each piece of sample content, wherein the content label comprises normal content and risk content.
Before executing step S105, the risk content recognition platform trains the first sample set and the confusion value, the sentence length value and the content label corresponding to the at least one behavior characteristic value of the user corresponding to each sample content in the first sample set to obtain a first risk content recognition model.
In an alternative embodiment, the first risk content identification model is a gradient-lifting iterative decision tree (Gradient Boosting Decision Tree, GBDT) model.
Optionally, the k non-first regression trees include a j-th regression tree, where j is an integer greater than 1 and less than or equal to the k+1;
the method further comprises the construction process of the first risk content identification model:
constructing the first regression tree according to the number of dangerous contents and the number of normal contents in the first sample set, wherein the first regression tree comprises a 1 st predicted value of each sample content in the first sample characteristic set;
obtaining the 1 st classification probability of each piece of sample content according to the 1 st predicted value of each piece of sample content, and obtaining the 1 st residual error of each piece of sample content based on the 1 st classification probability of each piece of sample content and the content label of each piece of sample content;
Constructing a j-th regression tree according to at least one behavior characteristic value, a confusion value, a statement length value and a j-1 th residual error of each sample content to obtain a k non-first regression tree, wherein the j-1 th residual error of each sample content is determined based on a j-1 th classification probability of each sample content and a content label of each sample content, the j-1 th classification probability of each sample content is obtained according to a j-1 th predicted value of each sample content, and the j-1 th predicted value of each sample content is determined according to the first regression tree to the j-1 th regression tree;
obtaining a (k+1) predictive value of the content of each sample according to the first regression tree and the k non-first regression trees;
obtaining the k+1th classification probability of each piece of sample content according to the k+1th predicted value of each piece of sample content, and obtaining the k+1th residual error of each piece of sample content based on the k+1th classification probability of each piece of sample content and the content label of each piece of sample content;
and when the k+1th residual error of each sample content meets a convergence condition, obtaining the first risk content identification model according to the k+1 regression trees and the learning coefficient of each regression tree in the k+1 regression trees.
Specifically, risk content identificationThe platform calculates the 1 st predicted value, namely log m/n, of each sample content in the first sample set according to the number m of the risk content and the number n of the normal content in the first sample set, and further obtains a first regression tree, namely 1 st regression tree, wherein the 1 st regression tree comprises a unique leaf node which comprises the 1 st predicted value log (m/n) of each sample content, and the 1 st predicted value of each sample content is substituted into a formula e z /(1+e z ) And z is a predicted value of each piece of sample content to obtain the 1 st classification probability of each piece of sample content, enabling the actual classification probability of the content label in the first sample set as risk content to be 1, enabling the actual classification probability of the content label as normal content to be 0, and calculating the difference between the actual classification probability and the 1 st classification probability of each piece of sample content to obtain the 1 st residual error of each piece of sample content.
Constructing a 2 nd regression tree according to the confusion value, the sentence length value and at least one behavior characteristic value of the corresponding user of each sample content and the 1 st residual error of each sample content, wherein the specific implementation process for constructing the 2 nd regression tree can be referred to as follows:
And sequencing the sample contents in the first sample set according to the sequence of the characteristic values of the first behavior types from small to large to obtain a sequenced first sample set, calculating the average value between the characteristic values of the first behavior types of the corresponding users of every two adjacent sample contents in the sequenced first sample set to obtain a plurality of segmentation values of the first behavior types, and respectively calculating the base index of each segmentation value of the first behavior types. Here, taking the mth divided value M of the first behavior type as an example, the calculation process of the base index of each divided value will be described in detail.
Dividing the first sample set into a first sub-sample set (i.e. left subtree) with the characteristic value smaller than the dividing value M of the first behavior type according to the dividing value M of the first behavior type, and a second sub-sample set (i.e. right subtree) with the characteristic value larger than or equal to the dividing value M of the first behavior type, and counting to obtain the sample content number a of which the content labels in the left subtree are respectively the risk content and the normal content 1 And a 2 Content label in right subtreeThe number of sample contents with labels of risk content and normal content is b 1 And b 2 。
According to formula 1- (P) 1 ) 2 -(P 2 ) 2 Respectively calculating the weight alpha of the left subtree and the weight beta of the right subtree, wherein when calculating the weight alpha of the left subtree, P 1 For the probability that the sample content in the left subtree is risk content, i.e. a 1 /(a 1 +a 2 ),P 2 For the probability that the sample content in the left subtree is normal content, i.e. a 2 /(a 1 +a 2 ) The method comprises the steps of carrying out a first treatment on the surface of the P when calculating the weight beta of the right subtree 1 The probability that the sample content in the right subtree is risk content, i.e. b 1 /(b 1 +b 2 ),P 2 For the probability that the sample content in the right subtree is normal content, i.e. b 2 /(b 1 +b 2 ). Further, according to the formula [ (a) 1 +a 2 )/(a 1 +a 2 +b 1 +b 2 )]*α+[(b 1 +b 2 )/(a 1 +a 2 +b 1 +b 2 )]* Beta is calculated to obtain the base index of the mth segmentation value M of the first behavior type.
The base index of each divided value of the first behavior type is obtained according to the above manner, and then the base index of each divided value of each behavior type, the confusion value and the sentence length value in at least one behavior type is obtained by calculation according to the above manner, and the process of constructing the regression tree according to the base index is described below.
Exemplary, according to the segmentation value M and the behavior type x corresponding to the segmentation value when the base index is minimum in each of the at least one behavior type, the confusion value and the sentence length value 1 (first behavior type) is determined as the root node of regression tree 2, i.e., x 1 < M. Further, the base index of each of the division values of each of the behavior types, the confusion value and the sentence length value in the at least one behavior type in the first sub-sample set is calculated according to the above manner, and the minimum value of the base index of each of the behavior types, the confusion value and the sentence length value in the at least one behavior type in the first sub-sample set is paired with the root node The corresponding base index is compared. If the minimum value of the base index in each behavior type, the confusion value and the statement length value in at least one behavior type in the first sub-sample set is smaller than the base index corresponding to the root node, dividing the minimum value L of the base index in each behavior type, the confusion value and the statement length value in the at least one behavior type in the first sub-sample set into a partition value L and a behavior type x corresponding to the partition value 3 (third behavior type) is determined as the internal node connected to the root node on the left in the 2 nd regression tree, which is denoted as x 3 < L. According to the internal node x 3 Dividing the first sub-sample set into a third sub-sample set and a fourth sub-sample set, and respectively calculating the base index of each dividing value in each behavior type, confusion value and statement length value in at least one behavior type in the third sub-sample set and the fourth sub-sample set. If the minimum value of the radix index in the third sub-sample set and the minimum value of the radix index in the fourth sub-sample set are both larger than the radix index corresponding to the internal node connected with the root node on the left, taking the 1 st residual of each sample content in the first sub-sample set as a leaf node Y in the left subtree in the 2 nd regression tree 1 The first set of sub-samples is not further partitioned. The internal nodes (the internal nodes are expressed as x) connected with the root node on the right side in the 2 nd regression tree are respectively obtained according to the mode 4 < J), and with internal node x 4 < J-connected internal node x 2 < K and leaf node Y 2 And with internal node x 2 < K connected two leaf nodes Y 3 And Y 4 . And obtaining each of the 3 rd to the (k+1) th regression trees according to the mode of constructing the 2 nd regression tree.
Further, the (k+1) th predicted value of each sample content is obtained according to the 1 st to (k+1) th regression trees.
Optionally, the jth regression tree includes a root node, an internal node, and a leaf node, where the root node and the internal node are configured to classify the sample content according to each of at least one behavior type, a confusion value, and a sentence length value of the sample content, and each of the leaf nodes includes a jth-1 residual of at least one sample content in the first sample set;
the k+1st predicted value of each sample content is obtained according to the first regression tree and the k non-first regression trees, and the method comprises the following steps:
Obtaining an output value of each leaf node in the j-th regression tree according to the j-1 th residual error of each sample content in each leaf node and the j-1 th classification probability of each sample content in each leaf node, so as to obtain the output value of each leaf node in each non-first regression tree;
determining the leaf node position of each sample content in the first sample set in the j-th regression tree according to the confusion value, the sentence length value, at least one behavior characteristic value of the corresponding user of each sample content, the root node and the internal node of each sample content in the first sample set, and further determining the leaf node position of each sample content in each non-first regression tree;
and obtaining the (k+1) predictive value of each sample content in the first sample set according to the 1 st predictive value, the learning coefficient of each non-first regression tree and the output value of the leaf node of each sample content at the leaf node position in each non-first regression tree.
Wherein the output value of each leaf node in the j-th regression tree isWherein I is the total number of sample contents contained in each leaf node, and a is i For the j-1 th residual of the content of the ith sample in each leaf node, the P i The j-1 th classification probability for the i-th sample content.
Specifically, according to the above formulaCalculating to obtain the j-th pieceAnd obtaining the output value of each leaf node in each non-first regression tree in the k non-first regression trees, wherein the k non-first regression trees comprise the k+1st regression tree.
For example, please refer to fig. 4, which is a schematic diagram of determining the position of leaf nodes in the k+1st regression tree according to the confusion degree value, the sentence length value and the behavior characteristic value of the user according to the embodiment of the present application. As shown in fig. 4, x 1 -x 4 Behavior characteristic values of a first behavior type to a fourth behavior type of a user corresponding to sample content, respectively, x 5 And x 6 The confusion value and the sentence length value of the sample content, respectively, and the solid line box, the broken line box, and the solid line oval box in fig. 4 represent the root node, the internal node, and the leaf node in the k+1st, respectively, and further Y in the leaf node k+1,i Representing the ith leaf node in the k+1th regression tree. If the behavior characteristic value d2 of the second behavior type of the ith sample content is less than c1 from the root node shown in fig. 4, since the behavior characteristic value d2 of the second behavior type of the ith sample content is greater than c1, the internal node x is reached 3 < c3, judging whether the behavior characteristic value d3 of the third behavior type of the user corresponding to the ith sample content is smaller than c3, and if so, reaching the leaf node Y k+1,5 I.e. the leaf node position of the ith sample content in the (k+1) th regression tree is Y k+1,5 . The leaf node position of each sample content in the first sample set in the k+1th regression tree is obtained in this way, and further, the leaf node position of each sample content in each of the 2 nd regression tree to the k+1th regression tree is obtained according to the above manner.
Then, calculating the product between the learning coefficient of the 2 nd regression tree and the output value of the leaf node of the i th sample content at the position of the leaf node in the 2 nd regression tree, and so on, calculating the product between the learning coefficient of the k+1 th regression tree and the output value of the leaf node of the i th sample content at the position of the leaf node in the k+1 th regression tree, and then calculating the sum of all the products and the 1 st predicted value of the i th user to obtain the k+1 th predicted value of the i th sample content, and further obtaining the k+1 th predicted value of each sample content in the first sample set.
Thereafter, the risk content recognition platform substitutes the (k+1) th predicted value of each sample content into e z /(1+e z ) Wherein z is a predicted value of each sample content, obtaining a k+1th classification probability of each sample content, calculating a difference between an actual classification probability and the k+1th classification probability of each sample content, obtaining a k+1th residual of each sample content, and judging whether the k+1th residual of each sample content meets a convergence condition, wherein the convergence condition is that the k+1th residual of each sample content is smaller than a residual threshold, and if the k+1th residual of each sample content meets the convergence condition, obtaining a first risk content identification model according to learning coefficients of each of the 1 st to k+1th regression trees and learning coefficients of each of the 1 st to k+1th regression trees, and referring to fig. 5, which is a schematic diagram of the first risk content identification model provided by the embodiment of the application. Wherein L in FIG. 5 1 、L 2 And L k+1 Respectively represent the learning coefficients of the 1 st regression tree, the 2 nd regression tree and the k+1 st regression tree, and L 1 =1, further, the solid line box, the broken line box, and the solid line oval box in fig. 5 represent the root node, the internal node, and the leaf node in each regression tree, respectively, and the leaf node Y in the 1 st regression tree 1,1 The 1 st predictor for each sample content in the first sample set.
Further, at least one behavior characteristic value of the content to be identified corresponding to the user, the confusion degree value of the content to be identified and the statement length value of the content to be identified are input into a first risk content identification model, and a first classification probability is obtained.
Optionally, the first risk content identification model includes k+1 regression trees and learning coefficients of each of the k+1 regression trees, the k+1 regression trees include a first regression tree and k non-first regression trees, and each of the k non-first regression trees includes a root node, an internal node, and a leaf node, where k is an integer greater than or equal to 1; the k+1 regression trees are trained based on the first sample set;
the step of inputting the at least one behavior characteristic value, the confusion degree value and the sentence length value into a first risk content identification model to obtain a first classification probability, comprising:
determining the leaf node position of the content to be identified in each non-first regression tree according to the at least one behavior characteristic value, the confusion degree value, the statement length value, the root node and the internal node of each non-first regression tree;
Obtaining a first predicted value of the content to be identified according to the predicted value of the first regression tree, the output value of the leaf node position of each non-first regression tree and the learning coefficient of each regression tree in the k+1 regression trees;
and converting the first predicted value into probability to obtain the first classification probability.
Specifically, the specific implementation manner of obtaining the leaf node position of the content to be identified in the k+1th regression tree according to the at least one behavior feature value of the content to be identified corresponding to the user, the confusion degree value of the content to be identified and the statement length value is referred to the first risk content identification platform in the step, and the description of the leaf node position of the content of the ith sample in the k+1th regression tree is found according to the at least one behavior feature value of the content of the ith sample, the confusion degree value of the content of the ith sample and the statement length value, which are not repeated herein.
Further, determining leaf node positions of the content to be identified in each of the 2 nd to the (k+1) th regression trees, and calculating learning coefficients of the 2 nd regression tree and the transmission of leaf nodes of the content to be identified at leaf node positions in the 2 nd regression tree Calculating the product between the values, and so on, calculating the product between the learning coefficient of the k+1st regression tree and the output value of the leaf node of the content to be identified at the position of the leaf node in the k+1st regression tree, then calculating the sum of all the products and the predicted value (1 st predicted value) of the first regression tree to obtain the predicted value of the content to be identified, substituting the predicted value of the content to be identified into e z /(1+e z ) Wherein z is a predicted value of each piece of sample content, and a first classification probability of the content to be identified is obtained.
And S106, outputting the content to be identified as risk content if the first classification probability is larger than a first preset probability threshold.
Wherein the first preset probability threshold may be any number greater than or equal to 0.5 and less than 1.
Specifically, if the first classification probability is greater than a first preset probability threshold, outputting the content to be identified as risk content by the risk content identification platform, displaying the output result through a screen, and striking the risk content and a user corresponding to the risk content after checking and confirming by manpower.
In the embodiment of the application, a risk content recognition platform calculates the confusion value of the content to be recognized, measures the possibility that the statement of the content to be recognized is unreasonable according to the confusion value, namely the possibility that the content to be recognized is the risk content, if the confusion value of the content to be recognized is larger than a preset confusion threshold value, the possibility that the content to be recognized is the risk content is larger is indicated, user behavior data of at least one behavior type corresponding to the user to be recognized is obtained, at least one behavior characteristic value corresponding to the at least one behavior type is calculated according to the user behavior data, the at least one behavior characteristic value, the confusion value of the content to be recognized and the statement length value are input into a first risk content recognition model, the first classification probability of the content to be recognized is obtained, and the recognition result of the content to be recognized is obtained.
Fig. 6 is a schematic flow chart of a risk content identification method according to an embodiment of the present application. As shown in fig. 6, this method embodiment includes the steps of:
s201, splitting the content to be identified into a plurality of words, and intercepting the words according to the preset window length to obtain a plurality of word combinations.
S202, calculating statement probability of the content to be identified according to occurrence probability of each word combination in the word combinations in the corresponding preset phrase sets, and obtaining a confusion value of the content to be identified based on the statement probability.
And S203, if the confusion degree value is larger than a preset confusion degree threshold value, acquiring user behavior data of at least one behavior type of the user corresponding to the content to be identified.
S204, obtaining at least one behavior characteristic value corresponding to the at least one behavior characteristic type according to the user behavior data of the at least one behavior type.
S205, inputting at least one behavior characteristic value, a confusion degree value and a statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability.
S206, outputting the content to be identified as risk content if the first classification probability is larger than a first preset probability threshold.
Here, the specific implementation manner of steps S201 to S206 may refer to descriptions of steps S101 to S106 in the corresponding embodiment of fig. 2, which are not repeated herein.
S207, if the confusion degree value is smaller than or equal to a preset confusion degree threshold value, determining a text content vector of the content to be identified according to whether the content to be identified contains a preset type word.
Optionally, the preset type words include a numeric type word, a unit type word and a behavior purpose type word. Wherein the digital type words include numbers and simplified and complex converted forms of numbers, e.g., 1, one, million, etc.; unit type words, e.g., meta, coin, K coin, etc.; action purpose type words, e.g., fill, flush, powder, etc.
Specifically, if the content to be identified contains a digital type word, the risk content identification platform sets the number of the text content vector at the digital type word position as 1, otherwise, sets the number of the text content vector at the digital type word position as 0, and determines the value of the number of the text content vector at the unit type word and the action target type word position respectively according to the mode, so as to obtain the text content vector. It will be appreciated that the text content vector is a three-dimensional column vector, and the present embodiment is not limited to this, in which each line in the text content vector specifically represents one of three types of words, namely, a numeric type word, a unit type word, and a behavioral destination type word.
For example, if the risk content recognition platform determines that the content to be recognized includes the action target type word "powder", the digital type words "2, 80000" and the unit type word "corner" in the "permanent powder 2 corner 80000", it determines that the text content vector of the content to be recognized is (1, 1) T 。
And S208, inputting the text content vector into a second risk content identification model.
The second risk content recognition model is obtained through training based on a second sample set and content labels corresponding to text content vectors of each sample content in the second sample set, and the second risk content recognition model comprises weight vectors.
Before executing step S208, the risk content recognition platform trains the second sample set and the content labels corresponding to the text content vectors of each sample content in the second sample set to obtain a second risk content recognition model.
Optionally, the method further comprises:
determining a text content vector of each piece of sample content according to whether the content of each piece of sample in the second sample set contains the preset type word;
training an initial logistic regression model according to the text content vector of each sample content and the content label of each sample content to obtain a first logistic regression model and a predicted content label of each sample content;
Adjusting the first logistic regression model according to the content label of each piece of sample content and the predicted content label of each piece of sample content;
and when the adjusted first logistic regression model meets the convergence condition, determining the adjusted first logistic regression model as the second risk content identification model.
Specifically, if the sample content contains a digital type word, the risk content identification platform sets the number of the text content vector of the sample content at the digital type word position to be 1, otherwise, sets the number of the text content vector of the sample content at the digital type word position to be 0, determines the value of the number of the text content vector at the unit type word and the action target type word position respectively according to the mode, and further obtains the text content vector of each sample content in the second sample set. Let the digital type word x 1 Word x of unit type 2 And action purpose type word x 3 The initial weight coefficients of (a) are w respectively 10 、w 20 And w 30 Then the initial prediction function f 0 (x)=w 10 *x 1 +w 20 *x 2 +w 30 *x 3 It can be appreciated that the initial prediction function is the number product between the transposed initial weight vector and the text content vector of the sample content, where the initial weight vector is (w 10 ,w 20 ,w 30 ) T The text vector of the sample content is (x 1 ,x 2 ,x 3 ) T Further, an initial logistic regression model g (z) =1/(1+e) is obtained -z ) Wherein z=f 0 (x)。
Training an initial logistic regression model according to the text content vector of each sample content and the content label of each sample content in the second sample set to obtain a first logistic regression model and the prediction classification probability of each sample content, if the prediction classification probability of the sample content is greater than a second preset probability threshold, determining the prediction content label of the sample content as risk content, otherwise, determining the prediction content label as normal content, obtaining the prediction content label of each sample content in this way, and adjusting the first logistic regression model according to the content label of each sample content and the prediction content label of the sample content. It may be understood that the risk content recognition platform adjusts the first logistic regression model by adjusting the first weight vector in the first logistic regression model until the value of the loss function of the adjusted first logistic regression model is smaller than the preset threshold, which indicates that the adjusted first logistic regression model reaches the convergence condition, determines the adjusted first logistic regression model as the second risk content recognition model, and the second risk content recognition model includes the weight vector.
S209, calculating a second predicted value of the content to be identified according to the text content vector and the weight vector, and converting the second predicted value into probability to obtain second classification probability.
Specifically, the risk content recognition platform calculates the number product between the transposed weight vector and the text content vector to obtain a second predicted value of the content to be recognized, and further calculates to obtain a second classification probability of 1/1+e -z Wherein the value of z is the second predicted value.
S210, if the second classification probability is greater than a second preset probability threshold, outputting the content to be identified as risk content.
Wherein the second preset probability threshold may be any number greater than or equal to 0.5 and less than 1.
Specifically, if the second classification probability is greater than a second preset probability threshold, outputting the content to be identified as risk content by the risk content identification platform, displaying the output result through a screen, and striking the risk content and a user corresponding to the risk content after checking and confirming by manpower.
In the embodiment of the application, when the confusion value of the content to be identified is larger than the preset confusion threshold, the risk content identification platform inputs at least one behavior characteristic value, the confusion value of the content to be identified and the statement length value into the first risk content identification model, not only considers the confusion value and the statement length value of the content to be identified, but also considers the dynamic behavior characteristic value of the content to be identified corresponding to the user, thereby improving the accuracy of identifying the risk content.
Based on the description of the method embodiments, the embodiments of the present application further provide a risk content recognition device, which may be a computer program (including program code) running in a risk content recognition platform, and the risk content recognition device may be a risk content recognition platform; fig. 7 is a schematic structural diagram of a risk content identifying apparatus according to an embodiment of the present application. As shown in fig. 7, the risk content recognition apparatus 7 may include: a split intercept module 71, a calculation confusion degree value module 72, a user behavior data acquisition module 73, a calculation behavior characteristic value module 74, a first classification probability determination module 75 and a first judgment output module 76.
The splitting and intercepting module 71 is configured to split the content to be identified into a plurality of words, and intercept the plurality of words according to a preset window length to obtain a plurality of word combinations;
a confusion-value calculating module 72, configured to calculate, according to occurrence probability of each word combination in the plurality of word combinations in a corresponding preset phrase set, a sentence probability of the content to be identified, and obtain a confusion value of the content to be identified based on the sentence probability, where the preset phrase set is determined based on a corpus;
A user behavior data obtaining module 73, configured to obtain user behavior data of at least one behavior type of the user corresponding to the content to be identified if the confusion value is greater than a preset confusion threshold;
a behavior feature value calculating module 74, configured to obtain at least one behavior feature value corresponding to the at least one behavior type according to the user behavior data of the at least one behavior type; the at least one behavior type comprises a first behavior type, and the user behavior data comprises a behavior type and a behavior time;
a first classification probability determining module 75, configured to input the at least one behavior feature value, the confusion value, and the statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability, where the first risk content identification model is obtained by training based on a first sample set and the confusion value, the statement length value, and the content label corresponding to the at least one behavior feature value of the user corresponding to each sample content in the first sample set;
the first judgment output module 76 is configured to output the content to be identified as risk content if the first classification probability is greater than a first preset probability threshold.
Optionally, the apparatus further includes:
a determining module 77, configured to determine a text content vector of the content to be identified according to whether the content to be identified includes a preset type word if the confusion value is less than or equal to the preset confusion threshold, where the preset type word includes a digital type word, a unit type word, and a behavior destination type word;
the input module 78 is configured to input the text content vector into a second risk content recognition model, so that the second risk content recognition model calculates a second predicted value of the content to be recognized according to the text content vector and the weight vector, and outputs a second classification probability according to the second predicted value; the second risk content identification model is obtained by training based on a second sample set and content labels corresponding to text content vectors of each sample content in the second sample set, and comprises weight vectors;
and a second judging and outputting module 79, configured to output the content to be identified as risk content if the second classification probability is greater than a second preset probability threshold.
Optionally, the apparatus further includes: the second model determination module 710.
The second model determining module 710 is specifically configured to:
determining a text content vector of each piece of sample content according to whether the content of each piece of sample in the second sample set contains the preset type word;
training an initial logistic regression model according to the text content vector of each sample content and the content label of each sample content to obtain a first logistic regression model and a predicted content label of each sample content;
adjusting the first logistic regression model according to the content label of each piece of sample content and the predicted content label of each piece of sample content;
and when the adjusted first logistic regression model meets the convergence condition, determining the adjusted first logistic regression model as the second risk content identification model.
Optionally, the calculating confusion value module 72 is specifically configured to:
multiplying the occurrence probability of each word combination in the word combinations in the corresponding preset phrase set to obtain the sentence probability of the content to be identified, and calculating the reciprocal of the sentence probability to obtain the confusion value of the content to be identified.
Optionally, the calculate behavior feature value module 74 includes:
A grouping calculation unit 741, configured to divide user behavior data of any behavior type into a plurality of groups of user behavior data for any behavior type, and calculate a time interval between every two adjacent groups of user behavior data in the plurality of groups of user behavior data, so as to obtain a plurality of groups of time interval data;
a variance calculating unit 742, configured to obtain a variance of attenuation of each set of the time interval data according to each set of the time interval data and the attenuation coefficient of each time interval data in each set of the time interval data;
and the behavior characteristic value calculating unit 743 is used for carrying out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient of each group of time interval data to obtain the behavior characteristic value of any behavior type.
Optionally, the first risk content identification model includes k+1 regression trees and learning coefficients of each of the k+1 regression trees, the k+1 regression trees include a first regression tree and k non-first regression trees, and each of the k non-first regression trees includes a root node, an internal node, and a leaf node, where k is an integer greater than or equal to 1; the k+1 regression trees are trained based on the first sample set;
The first classification probability determination module 75 includes:
a leaf node position determining unit 751, configured to determine a leaf node position of the content to be identified in each non-first regression tree according to the at least one behavior feature value, the confusion degree value, the sentence length value, the root node and the internal node of each non-first regression tree;
a predicted value calculating unit 752, configured to obtain a first predicted value of the content to be identified according to the predicted value of the first regression tree, the output value at the leaf node position of each non-first regression tree, and the learning coefficient of each k+1 regression tree;
and a classification probability calculating unit 753, configured to convert the first predicted value into a probability, and obtain the first classification probability.
Optionally, the k non-first regression trees include a j-th regression tree, where j is an integer greater than 1 and less than or equal to the k+1;
the first classification probability determination module 75 further includes:
a first regression tree construction unit 754, configured to construct the first regression tree according to the number of risk contents and the number of normal contents in the first sample set, where the first regression tree includes a 1 st predicted value of each sample content in the first sample feature set;
A 1 st residual computing unit 755, configured to obtain a 1 st classification probability of each piece of sample content according to the 1 st predicted value of each piece of sample content, and obtain a 1 st residual of each piece of sample content based on the 1 st classification probability of each piece of sample content and the content label of each piece of sample content;
a kth+1 regression tree construction unit 756, configured to construct the jth regression tree according to the at least one behavior feature value, the confusion value, the sentence length value, and the jth-1 residual of each sample content to obtain the k non-first regression tree, where the jth-1 residual of each sample content is determined based on the jth-1 classification probability of each sample content obtained according to the jth-1 predicted value of each sample content and the content label of each sample content, and the jth-1 predicted value of each sample content is determined according to the first regression tree to the jth-1 regression tree;
a k+1th predicted value calculating unit 757, configured to obtain a k+1th predicted value of each sample content according to the first regression tree and the k non-first regression trees;
A k+1th residual error unit 758, configured to obtain a k+1th classification probability of each piece of sample content according to the k+1th predicted value of each piece of sample content, and obtain a k+1th residual error of each piece of sample content based on the k+1th classification probability of each piece of sample content and the content label of each piece of sample content;
and the first model determining unit 759 is configured to obtain the first risk content identification model according to the k+1 regression trees and the learning coefficients of each of the k+1 regression trees when the k+1 residual error of each sample content satisfies the convergence condition.
It will be appreciated that the risk content identification means 7 is arranged to implement the steps performed by the risk content identification platform in the embodiments of fig. 2 and 6. Regarding the specific implementation and corresponding advantageous effects of the functional blocks included in the risk content recognition apparatus 7 of fig. 7, reference may be made to the foregoing specific description of the embodiments of fig. 2 and 6, which are not repeated here.
The risk content recognition apparatus 7 in the embodiment shown in fig. 7 described above may be implemented as a server 800 shown in fig. 7. Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 8, the server 800 may include: one or more processors 801, memory 802, and a transceiver 803. The processor 801, the memory 802, and the transceiver 803 are connected through a bus 804. Wherein the transceiver 803 is configured to receive or transmit data, and the memory 802 is configured to store a computer program, the computer program including program instructions; the processor 801 is configured to execute program instructions stored in the memory 802, and perform the following operations:
Splitting the content to be identified into a plurality of words, and intercepting the words according to the preset window length to obtain a plurality of word combinations;
according to the occurrence probability of each word combination in the word combinations in the corresponding preset phrase sets, calculating to obtain the sentence probability of the content to be identified, and obtaining the confusion value of the content to be identified based on the sentence probability, wherein the preset phrase sets are determined based on a corpus;
if the confusion degree value is larger than a preset confusion degree threshold value, user behavior data of at least one behavior type of the user corresponding to the content to be identified are obtained;
obtaining at least one behavior characteristic value corresponding to the at least one behavior type according to the user behavior data of the at least one behavior type;
inputting the at least one behavior characteristic value, the confusion value and the statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability, wherein the first risk content identification model is obtained by training based on a first sample set and the confusion value, the statement length value of each sample content in the first sample set and a content label corresponding to the at least one behavior characteristic value of a user corresponding to each sample content;
And if the first classification probability is larger than a first preset probability threshold, outputting the content to be identified as risk content.
Optionally, the processor 801 further performs the following operations:
if the confusion degree value is smaller than or equal to the preset confusion degree threshold value, determining a text content vector of the content to be identified according to whether the content to be identified contains preset type words, wherein the preset type words comprise digital type words, unit type words and behavior purpose type words;
inputting the text content vector into a second risk content recognition model, so that the second risk content recognition model calculates a second predicted value of the content to be recognized according to the text content vector and the weight vector, and outputs a second classification probability according to the second predicted value; the second risk content identification model is obtained by training based on a second sample set and content labels corresponding to text content vectors of each sample content in the second sample set, and comprises weight vectors;
and if the second classification probability is larger than a second preset probability threshold, outputting the content to be identified as risk content.
Optionally, the processor 801 further performs the following operations:
determining a text content vector of each piece of sample content according to whether the content of each piece of sample in the second sample set contains the preset type word;
training an initial logistic regression model according to the text content vector of each sample content and the content label of each sample content to obtain a first logistic regression model and a predicted content label of each sample content;
adjusting the first logistic regression model according to the content label of each piece of sample content and the predicted content label of each piece of sample content;
and when the adjusted first logistic regression model meets the convergence condition, determining the adjusted first logistic regression model as the second risk content identification model.
Optionally, the processor 801 calculates, according to the occurrence probability of each word combination in the plurality of word combinations in the respective corresponding preset phrase set, a sentence probability of the content to be identified, and obtains, based on the sentence probability, a confusion value of the content to be identified, and specifically performs the following operations:
multiplying the occurrence probability of each word combination in the word combinations in the corresponding preset phrase set to obtain the sentence probability of the content to be identified, and calculating the reciprocal of the sentence probability to obtain the confusion value of the content to be identified.
Optionally, the user behavior data includes a behavior type and a behavior time;
the processor 801 obtains at least one behavior feature value corresponding to the at least one behavior feature according to the user behavior data of the at least one behavior type, and specifically performs the following operations:
dividing the user behavior data of any behavior type into a plurality of groups of user behavior data aiming at any behavior type, and calculating the time interval between every two adjacent user behavior data in each group of user behavior data in the plurality of groups of user behavior data to obtain a plurality of groups of time interval data;
obtaining the attenuation variance of each group of time interval data according to each group of time interval data and the attenuation coefficient of each time interval data in each group of time interval data;
and carrying out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient of each group of time interval data to obtain the behavior characteristic value of any behavior type.
Optionally, the first risk content identification model includes k+1 regression trees and learning coefficients of each of the k+1 regression trees, the k+1 regression trees include a first regression tree and k non-first regression trees, and each of the k non-first regression trees includes a root node, an internal node, and a leaf node, where k is an integer greater than or equal to 1; the k+1 regression trees are trained based on the first sample set;
The processor 801 inputs the at least one behavior feature value, the confusion value and the sentence length value into a first risk content recognition model to obtain a first classification probability, and specifically performs the following operations:
determining the leaf node position of the content to be identified in each non-first regression tree according to the at least one behavior characteristic value, the confusion degree value, the statement length value, the root node and the internal node of each non-first regression tree;
obtaining a first predicted value of the content to be identified according to the predicted value of the first regression tree, the output value of the leaf node position of each non-first regression tree and the learning coefficient of each regression tree in the k+1 regression trees;
and converting the first predicted value into probability to obtain the first classification probability.
Optionally, the k non-first regression trees include a j-th regression tree, where j is an integer greater than 1 and less than or equal to the k+1;
the above-described processor 801 also performs the following operations:
constructing the first regression tree according to the number of dangerous contents and the number of normal contents in the first sample set, wherein the first regression tree comprises a 1 st predicted value of each sample content in the first sample characteristic set;
Obtaining the 1 st classification probability of each piece of sample content according to the 1 st predicted value of each piece of sample content, and obtaining the 1 st residual error of each piece of sample content based on the 1 st classification probability of each piece of sample content and the content label of each piece of sample content;
constructing a j-th regression tree according to at least one behavior characteristic value, a confusion value, a statement length value and a j-1 th residual error of each sample content to obtain a k non-first regression tree, wherein the j-1 th residual error of each sample content is determined based on a j-1 th classification probability of each sample content and a content label of each sample content, the j-1 th classification probability of each sample content is obtained according to a j-1 th predicted value of each sample content, and the j-1 th predicted value of each sample content is determined according to the first regression tree to the j-1 th regression tree;
obtaining a (k+1) predictive value of the content of each sample according to the first regression tree and the k non-first regression trees;
obtaining the k+1th classification probability of each piece of sample content according to the k+1th predicted value of each piece of sample content, and obtaining the k+1th residual error of each piece of sample content based on the k+1th classification probability of each piece of sample content and the content label of each piece of sample content;
And when the k+1th residual error of each sample content meets a convergence condition, obtaining the first risk content identification model according to the k+1 regression trees and the learning coefficient of each regression tree in the k+1 regression trees.
Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, where the aforementioned computer program executed by the risk content identifying apparatus 7 is stored, and the computer program includes program instructions, when executed by the processor, can execute the description of the risk content identifying method in the corresponding embodiment of fig. 2 or fig. 6, and therefore, the description will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or, alternatively, across multiple computing devices distributed across multiple sites and interconnected by a communication network, where the multiple computing devices distributed across multiple sites and interconnected by a communication network may constitute a blockchain system.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.
Claims (10)
1. A risk content identification method, comprising:
splitting the content to be identified into a plurality of words, and intercepting the words according to the preset window length to obtain a plurality of word combinations;
according to the occurrence probability of each word combination in the word combinations in the corresponding preset phrase sets, calculating to obtain the sentence probability of the content to be identified, and obtaining the confusion value of the content to be identified based on the sentence probability, wherein the preset phrase sets are determined based on a corpus;
if the confusion degree value is larger than a preset confusion degree threshold value, user behavior data of at least one behavior type of the user corresponding to the content to be identified are obtained;
obtaining at least one behavior characteristic value corresponding to the at least one behavior type according to the user behavior data of the at least one behavior type;
inputting the at least one behavior characteristic value, the confusion value and the statement length value of the content to be identified into a first risk content identification model to obtain a first classification probability, wherein the first risk content identification model is obtained by training based on a first sample set and the confusion value, the statement length value of each sample content in the first sample set and a content label corresponding to the at least one behavior characteristic value of a user corresponding to each sample content;
And if the first classification probability is larger than a first preset probability threshold, outputting the content to be identified as risk content.
2. The method according to claim 1, wherein the method further comprises:
if the confusion degree value is smaller than or equal to the preset confusion degree threshold value, determining a text content vector of the content to be identified according to whether the content to be identified contains preset type words, wherein the preset type words comprise digital type words, unit type words and behavior purpose type words;
inputting the text content vector into a second risk content recognition model, so that the second risk content recognition model calculates a second predicted value of the content to be recognized according to the text content vector and the weight vector, and outputs a second classification probability according to the second predicted value; the second risk content identification model is obtained by training based on a second sample set and content labels corresponding to text content vectors of each sample content in the second sample set, and comprises weight vectors;
and if the second classification probability is larger than a second preset probability threshold, outputting the content to be identified as risk content.
3. The method according to claim 2, wherein the method further comprises:
determining a text content vector of each piece of sample content according to whether the content of each piece of sample in the second sample set contains the preset type word;
training an initial logistic regression model according to the text content vector of each sample content and the content label of each sample content to obtain a first logistic regression model and a predicted content label of each sample content;
adjusting the first logistic regression model according to the content label of each piece of sample content and the predicted content label of each piece of sample content;
and when the adjusted first logistic regression model meets the convergence condition, determining the adjusted first logistic regression model as the second risk content identification model.
4. The method of claim 1, wherein the calculating to obtain the sentence probability of the content to be identified according to the occurrence probability of each word combination in the plurality of word combinations in the respective corresponding preset phrase set, and obtaining the confusion value of the content to be identified based on the sentence probability, includes:
Multiplying the occurrence probability of each word combination in the word combinations in the corresponding preset phrase set to obtain the sentence probability of the content to be identified, and calculating the reciprocal of the sentence probability to obtain the confusion value of the content to be identified.
5. The method of claim 1, wherein the user behavior data comprises a behavior type and a behavior time;
the obtaining at least one behavior feature value corresponding to the at least one behavior feature according to the user behavior data of the at least one behavior type includes:
dividing the user behavior data of any behavior type into a plurality of groups of user behavior data aiming at any behavior type, and calculating the time interval between every two adjacent user behavior data in each group of user behavior data in the plurality of groups of user behavior data to obtain a plurality of groups of time interval data;
obtaining the attenuation variance of each group of time interval data according to each group of time interval data and the attenuation coefficient of each time interval data in each group of time interval data;
and carrying out weighted calculation on the attenuation variance of each group of time interval data and the weight coefficient of each group of time interval data to obtain the behavior characteristic value of any behavior type.
6. The method of claim 1, wherein the first risk content identification model comprises k+1 regression trees and learning coefficients for each of the k+1 regression trees, the k+1 regression trees comprising a first regression tree and k non-first regression trees, each of the k non-first regression trees comprising a root node, an interior node, and a leaf node, wherein the k is an integer greater than or equal to 1; the k+1 regression trees are trained based on the first sample set;
the step of inputting the at least one behavior characteristic value, the confusion degree value and the sentence length value into a first risk content identification model to obtain a first classification probability, comprising:
determining the leaf node position of the content to be identified in each non-first regression tree according to the at least one behavior characteristic value, the confusion degree value, the statement length value, the root node and the internal node of each non-first regression tree;
obtaining a first predicted value of the content to be identified according to the predicted value of the first regression tree, the output value of the leaf node position of each non-first regression tree and the learning coefficient of each regression tree in the k+1 regression trees;
And converting the first predicted value into probability to obtain the first classification probability.
7. The method of claim 6, wherein the k non-first regression trees comprise a j-th regression tree, the j being an integer greater than 1 and less than or equal to the k+1;
the method further comprises the construction process of the first risk content identification model:
constructing the first regression tree according to the number of dangerous contents and the number of normal contents in the first sample set, wherein the first regression tree comprises a 1 st predicted value of each sample content in the first sample characteristic set;
obtaining the 1 st classification probability of each piece of sample content according to the 1 st predicted value of each piece of sample content, and obtaining the 1 st residual error of each piece of sample content based on the 1 st classification probability of each piece of sample content and the content label of each piece of sample content;
constructing a j-th regression tree according to at least one behavior characteristic value, a confusion value, a statement length value and a j-1 th residual error of each sample content to obtain a k non-first regression tree, wherein the j-1 th residual error of each sample content is determined based on a j-1 th classification probability of each sample content and a content label of each sample content, the j-1 th classification probability of each sample content is obtained according to a j-1 th predicted value of each sample content, and the j-1 th predicted value of each sample content is determined according to the first regression tree to the j-1 th regression tree;
Obtaining a (k+1) predictive value of the content of each sample according to the first regression tree and the k non-first regression trees;
obtaining the k+1th classification probability of each piece of sample content according to the k+1th predicted value of each piece of sample content, and obtaining the k+1th residual error of each piece of sample content based on the k+1th classification probability of each piece of sample content and the content label of each piece of sample content;
and when the k+1th residual error of each sample content meets a convergence condition, obtaining the first risk content identification model according to the k+1 regression trees and the learning coefficient of each regression tree in the k+1 regression trees.
8. A risk content recognition apparatus, comprising:
the splitting and intercepting module is used for splitting the content to be identified into a plurality of words, and intercepting the words according to the preset window length to obtain a plurality of word combinations;
the confusion value calculating module is used for calculating statement probability of the content to be identified according to occurrence probability of each word combination in the plurality of word combinations in a corresponding preset phrase set, and obtaining a confusion value of the content to be identified based on the statement probability, wherein the preset phrase set is determined based on a corpus;
The acquisition module is used for acquiring user behavior data of at least one behavior type of the user corresponding to the content to be identified if the confusion degree value is larger than a preset confusion degree threshold value;
the behavior characteristic value calculating module is used for obtaining at least one behavior characteristic value corresponding to the at least one behavior type according to the user behavior data of the at least one behavior type;
the first classification probability determining module is used for inputting the at least one behavior characteristic value, the confusion degree value and the statement length value of the content to be identified into a first risk content identification model to obtain first classification probability, wherein the first risk content identification model is obtained by training based on a first sample set and the confusion degree value and the statement length value of each sample content in the first sample set and a content label corresponding to the at least one behavior characteristic value of a user corresponding to each sample content;
and the first judging and outputting module is used for outputting the content to be identified as risk content if the first classification probability is larger than a first preset probability threshold value.
9. A server comprising a processor, a memory and a transceiver, the processor, the memory and the transceiver being interconnected, wherein the transceiver is configured to receive or transmit data, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the risk content identification method of any of claims 1-7.
10. A storage medium storing a computer program, the computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform the risk content identification method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010721587.0A CN111881293B (en) | 2020-07-24 | 2020-07-24 | Risk content identification method and device, server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010721587.0A CN111881293B (en) | 2020-07-24 | 2020-07-24 | Risk content identification method and device, server and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111881293A CN111881293A (en) | 2020-11-03 |
CN111881293B true CN111881293B (en) | 2023-11-07 |
Family
ID=73200201
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010721587.0A Active CN111881293B (en) | 2020-07-24 | 2020-07-24 | Risk content identification method and device, server and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111881293B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112579771B (en) * | 2020-12-08 | 2024-05-07 | 腾讯科技(深圳)有限公司 | Content title detection method and device |
CN112613501A (en) * | 2020-12-21 | 2021-04-06 | 深圳壹账通智能科技有限公司 | Information auditing classification model construction method and information auditing method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930836A (en) * | 2016-04-19 | 2016-09-07 | 北京奇艺世纪科技有限公司 | Identification method and device of video text |
CN111144100A (en) * | 2019-12-24 | 2020-05-12 | 五八有限公司 | Question text recognition method and device, electronic equipment and storage medium |
WO2020113918A1 (en) * | 2018-12-06 | 2020-06-11 | 平安科技(深圳)有限公司 | Statement rationality determination method and apparatus based on semantic parsing, and computer device |
-
2020
- 2020-07-24 CN CN202010721587.0A patent/CN111881293B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930836A (en) * | 2016-04-19 | 2016-09-07 | 北京奇艺世纪科技有限公司 | Identification method and device of video text |
WO2020113918A1 (en) * | 2018-12-06 | 2020-06-11 | 平安科技(深圳)有限公司 | Statement rationality determination method and apparatus based on semantic parsing, and computer device |
CN111144100A (en) * | 2019-12-24 | 2020-05-12 | 五八有限公司 | Question text recognition method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
一种适应域的汉语N-gram语言模型平滑算法;江铭虎, 朱小燕, 袁保宗;清华大学学报(自然科学版)(第09期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111881293A (en) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109783817B (en) | Text semantic similarity calculation model based on deep reinforcement learning | |
CN108829822B (en) | Media content recommendation method and device, storage medium and electronic device | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN109816438B (en) | Information pushing method and device | |
CN109829162A (en) | A kind of text segmenting method and device | |
CN111881972B (en) | Black-out user identification method and device, server and storage medium | |
CN111881293B (en) | Risk content identification method and device, server and storage medium | |
CN111694940A (en) | User report generation method and terminal equipment | |
CN112434188B (en) | Data integration method, device and storage medium of heterogeneous database | |
CN107220384A (en) | A kind of search word treatment method, device and computing device based on correlation | |
CN113435208A (en) | Student model training method and device and electronic equipment | |
CN109992676B (en) | Cross-media resource retrieval method and retrieval system | |
CN115455171B (en) | Text video mutual inspection rope and model training method, device, equipment and medium | |
CN110598109A (en) | Information recommendation method, device, equipment and storage medium | |
CN114818729A (en) | Method, device and medium for training semantic recognition model and searching sentence | |
CN111930941A (en) | Method and device for identifying abuse content and server | |
CN108021544B (en) | Method and device for classifying semantic relation of entity words and electronic equipment | |
CN112214592A (en) | Reply dialogue scoring model training method, dialogue reply method and device | |
CN104572820B (en) | The generation method and device of model, importance acquisition methods and device | |
CN109002498B (en) | Man-machine conversation method, device, equipment and storage medium | |
CN111241843A (en) | Semantic relation inference system and method based on composite neural network | |
CN110717022A (en) | Robot dialogue generation method and device, readable storage medium and robot | |
CN110245230A (en) | A kind of books stage division, system, storage medium and server | |
WO2022063202A1 (en) | Text classification method, apparatus, device, and storage medium | |
CN113342974B (en) | Method, device and equipment for identifying overlapping relationship of network security entities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |