CN112860849A - Abnormal text recognition method and device, computer equipment and storage medium - Google Patents

Abnormal text recognition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112860849A
CN112860849A CN202110076225.5A CN202110076225A CN112860849A CN 112860849 A CN112860849 A CN 112860849A CN 202110076225 A CN202110076225 A CN 202110076225A CN 112860849 A CN112860849 A CN 112860849A
Authority
CN
China
Prior art keywords
text
texts
abnormal
question
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110076225.5A
Other languages
Chinese (zh)
Other versions
CN112860849B (en
Inventor
朱运
乔建秀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110076225.5A priority Critical patent/CN112860849B/en
Publication of CN112860849A publication Critical patent/CN112860849A/en
Application granted granted Critical
Publication of CN112860849B publication Critical patent/CN112860849B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides an abnormal text recognition method, an abnormal text recognition device, computer equipment and a storage medium, wherein the abnormal text recognition method comprises the following steps: clustering a plurality of texts to be tested, and identifying a plurality of first problem texts in the plurality of texts to be tested according to a plurality of clustered centroids; calling an abnormal text recognition model to recognize a plurality of second question texts in the plurality of first question texts; extracting a bag-of-words vector of each second question text, and generating a question text image based on a plurality of bag-of-words vectors; performing target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection frames; and identifying abnormal texts in the second question texts according to the target detection boxes. The method and the device can identify the abnormal texts in batches, and the identification effect of the abnormal texts is good and the identification efficiency is high.

Description

Abnormal text recognition method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an abnormal text recognition method and device, computer equipment and a storage medium.
Background
With the continuous development of the big data age, more and more texts are available on a content platform, and an abnormal text is rapidly detected from tens of thousands of texts, so that the method is very important for the content platform, and if the non-compliant text is exposed to a user, the reputation of the content platform is greatly influenced.
The inventor finds that the existing content platform mostly establishes an abnormal word list and matches the text based on the abnormal word list so as to determine whether the text is the abnormal text, and the method needs to continuously and manually add abnormal words, so that the effect of text abnormity identification is poor; and the text needs to be matched with the abnormal words in the abnormal word list one by one, so that the efficiency of identifying the abnormal text is poor.
Disclosure of Invention
In view of the above, there is a need for an abnormal text recognition method, device, computer device and storage medium, which can recognize abnormal texts in batch, and has good abnormal text recognition effect and high abnormal text recognition efficiency.
A first aspect of the present invention provides an abnormal text recognition method, including:
clustering a plurality of texts to be tested, and identifying a plurality of first problem texts in the plurality of texts to be tested according to a plurality of clustered centroids;
calling an abnormal text recognition model to recognize a plurality of second question texts in the plurality of first question texts;
extracting a bag-of-words vector of each second question text, and generating a question text image based on a plurality of bag-of-words vectors;
performing target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection frames;
and identifying abnormal texts in the second question texts according to the target detection boxes.
In an optional embodiment, the identifying abnormal texts in the second question texts according to the target detection boxes includes:
determining a word bag vector in each target detection box;
judging whether a target word bag vector identical to any word bag vector exists in each word bag vector;
and when at least one target word bag vector exists in the target word bag vector, determining that the second problem text corresponding to the target word bag vector is an abnormal text.
In an optional embodiment, the invoking the abnormal text recognition model to recognize a second plurality of question texts of the first plurality of question texts comprises:
performing abnormal probability prediction on each first problem text in the plurality of first problem texts by adopting an abnormal text recognition model;
obtaining a plurality of first candidate problem texts with a first target anomaly probability, a plurality of second candidate problem texts with a second target anomaly probability and a plurality of third candidate problem texts with a third target anomaly probability, wherein the first target anomaly probability < the second target anomaly probability < the third target anomaly probability;
calculating a first text similarity between each of the first candidate question texts and each of the second candidate question texts, and calculating a second text similarity between each of the third candidate question texts and each of the second candidate question texts;
and identifying a plurality of second question texts in the plurality of second candidate question texts according to the plurality of first text similarities and the plurality of second text similarities corresponding to each second candidate question text.
In an optional embodiment, the identifying, according to the plurality of first text similarities and the plurality of second text similarities corresponding to each of the second candidate question texts, a plurality of second question texts in the plurality of second candidate question texts includes:
for any one second candidate question text, obtaining a plurality of first target text similarities which are greater than a preset similarity threshold value in the plurality of first text similarities, and calculating a first number of the plurality of first target text similarities;
obtaining a plurality of second target text similarities which are greater than the preset similarity threshold value in the plurality of second text similarities, and calculating a second number of the plurality of second target text similarities;
calculating a third number of the first text similarities and a fourth number of the second text similarities;
and when the ratio of the first quantity to the third quantity is smaller than a preset ratio threshold, and the ratio of the second quantity to the fourth quantity is smaller than the preset ratio threshold, determining that the any one second candidate question text is the second question text.
In an optional embodiment, the clustering the multiple texts to be tested and identifying multiple first question texts in the multiple texts to be tested according to the clustered centroids includes:
extracting a text vector of each text to be tested, and clustering the plurality of texts to be tested according to the text vectors to obtain a plurality of text clusters to be tested;
calculating an average mass center according to the mass centers of the plurality of text clusters to be detected;
calculating the distance between the mass center of each text cluster to be detected and the average mass center;
determining a text cluster to be detected with a distance greater than a preset distance threshold value as a problem text cluster;
and determining a plurality of texts to be tested in the question text cluster as a plurality of first question texts.
In an alternative embodiment, said extracting a bag-of-words vector for each of said second question texts and generating a question text image based on a plurality of said bag-of-words vectors comprises:
performing word segmentation processing on each second question text to obtain a plurality of words;
calculating the TF-IDF value of each participle;
calculating a bag-of-words vector of the second question text according to the IF-IDF value of each participle in each second question text;
performing dimensionality reduction on each word bag vector to obtain a standard word bag vector;
and generating a problem text image according to the plurality of standard bag-of-words vectors.
In an optional embodiment, the method further comprises:
extracting a plurality of abnormal words in the abnormal text;
calculating the abnormality degree of the abnormal text according to the abnormal words;
when the abnormality degree is larger than a preset abnormality degree threshold value, acquiring a user account issuing the abnormal text;
and performing number sealing processing on the user account.
A second aspect of the present invention provides an abnormal text recognition apparatus, including:
the clustering model is used for clustering a plurality of texts to be tested and identifying a plurality of first problem texts in the plurality of texts to be tested according to a plurality of centroids after clustering;
the calling module is used for calling an abnormal text recognition model to recognize a plurality of second question texts in the plurality of first question texts;
the generating module is used for extracting a bag-of-words vector of each second question text and generating a question text image based on a plurality of bag-of-words vectors;
the detection module is used for carrying out target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection frames;
and the identification module is used for identifying abnormal texts in the second question texts according to the target detection boxes.
A third aspect of the invention provides a computer device comprising a processor for implementing the method of abnormal text recognition when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the abnormal-text recognition method.
In summary, according to the abnormal text identification method, the abnormal text identification device, the computer device and the storage medium of the present invention, firstly clustering a plurality of texts to be tested, and identifying a plurality of first problem texts, which may be abnormal texts, from the plurality of texts to be tested for the first time according to a plurality of centroids after clustering; secondly, calling an abnormal text recognition model and further recognizing a plurality of second question texts which are more likely to be abnormal texts from the plurality of first question texts; and thirdly, extracting a bag-of-words vector of each second problem text by means of an image processing idea, generating a problem text image based on a plurality of bag-of-words vectors, and performing target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection boxes, so that a real abnormal text is identified from the plurality of second problem texts at one time according to the plurality of target detection boxes. The method adopts a three-step method to gradually reduce the range of identifying the abnormal texts, has high identification efficiency of the abnormal texts, can identify the abnormal texts in batches by means of the thought of image processing, further improves the identification efficiency of the abnormal texts, and improves the identification accuracy of the abnormal texts by using a target detection method.
Drawings
Fig. 1 is a flowchart of an abnormal text recognition method according to an embodiment of the present invention.
Fig. 2 is a structural diagram of an abnormal text recognition apparatus according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The abnormal text recognition method provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the abnormal text recognition device runs in the computer equipment.
Fig. 1 is a flowchart of an abnormal text recognition method according to an embodiment of the present invention. The abnormal text recognition method can be used for rapidly building an abnormal text recognition platform suitable for all technical fields from scratch, and recognizing the abnormal texts in batches from a large number of texts, so that the abnormal text recognition effect is good and the recognition efficiency is high. The abnormal text recognition method specifically comprises the following steps, and the sequence of the steps in the flowchart can be changed and some steps can be omitted according to different requirements.
S11, clustering the texts to be tested, and identifying a plurality of first problem texts in the texts to be tested according to the plurality of clustered centroids.
The computer equipment is provided with a content platform, and the content platform can receive a text issued by a first user and respond to a text request of a second user when receiving the text request and display the text corresponding to the text request.
The computer equipment can adopt a text clustering algorithm to cluster a plurality of texts to be tested in the content platform, so that a plurality of texts which are possibly abnormal are firstly identified from the plurality of texts to be tested according to the clustered centroids. A plurality of texts in which an abnormality is likely to exist for the first time are identified as first problem texts.
The text clustering algorithm may include, but is not limited to: a partition-based text clustering algorithm (K-mean, K-center), a hierarchy-based text clustering algorithm (splitting, merging), a density-based text clustering algorithm (OPTICS), a network-based text clustering algorithm (STING), a model-based text clustering algorithm.
In an optional embodiment, the clustering the multiple texts to be tested and identifying multiple first question texts in the multiple texts to be tested according to the clustered centroids includes:
extracting a text vector of each text to be tested, and clustering the plurality of texts to be tested according to the text vectors to obtain a plurality of text clusters to be tested;
calculating an average mass center according to the mass centers of the plurality of text clusters to be detected;
calculating the distance between the mass center of each text cluster to be detected and the average mass center;
determining a text cluster to be detected with a distance greater than a preset distance threshold value as a problem text cluster;
and determining a plurality of texts to be tested in the question text cluster as a plurality of first question texts.
The text vector of each text to be tested can be extracted by using a pre-training model BERT, and the BERT model can take the semantics of the text into consideration, so that the extracted text vector can express the whole semantic information of the text better.
Mapping a plurality of text vectors to a high-dimensional space and then clustering to obtain a plurality of text clusters to be tested, wherein each text cluster to be tested comprises a plurality of texts to be tested, the texts with the same content of the texts to be tested are clustered into the same text cluster to be tested, the texts with different contents of the texts to be tested are clustered into different text clusters to be tested, and each text cluster to be tested has a centroid, wherein the centroid refers to the geometric center of the text cluster to be tested.
And calculating to obtain an average position according to the positions of the plurality of centroids, wherein the average position corresponds to the average centroid.
The larger the distance between the centroid of a certain text cluster to be detected and the average centroid is, the larger the difference between the text content of the text to be detected in the text cluster to be detected and the center content of the content platform is, the more unlikely the text to be detected belongs to the text in the content platform, that is, the higher the possibility that the text to be detected is an abnormal text is; the smaller the distance between the centroid of a certain text cluster to be detected and the average centroid is, the smaller the difference between the text content of the text to be detected in the text cluster to be detected and the center content of the content platform is, the more likely the text to be detected belongs to the text in the content platform, that is, the less likely the text to be detected is the text with the abnormal existence.
In the optional embodiment, the center content of the content platform can be obtained by clustering a plurality of texts to be tested and calculating the average centroid according to the clustered centroids, so that the first problem texts which are possibly abnormal are identified according to the distance between the centroid of each text cluster to be tested and the average centroid, the widely-spread identification of the first problem texts which are possibly abnormal is realized, the identification efficiency of the first problem texts is high, all the first problem texts which are possibly abnormal can be identified, and the cardinal number of the first problem texts is large.
And S12, calling an abnormal text recognition model to recognize a plurality of second question texts in the plurality of first question texts.
Because a plurality of text clusters to be detected are obtained in a clustering mode, once the distance between the centroid of a certain text cluster to be detected and the average centroid is greater than a preset distance threshold, all texts to be detected in the text cluster to be detected are identified as first problem texts, but in practical application, the texts to be detected are possibly not abnormal texts but are identified by mistake, so that the computer equipment identifies a plurality of second problem texts from the plurality of first problem texts by using an abnormal text identification model for the second time, the identification range of the abnormal texts is reduced, and the identification efficiency of the abnormal texts is improved.
The probability that the second question text belongs to the abnormal text is greater than the probability that the first question text belongs to the abnormal text.
In an optional embodiment, the invoking the abnormal text recognition model to recognize a second plurality of question texts of the first plurality of question texts comprises:
performing abnormal probability prediction on each first problem text in the plurality of first problem texts by adopting an abnormal text recognition model;
obtaining a plurality of first candidate problem texts with a first target anomaly probability, a plurality of second candidate problem texts with a second target anomaly probability and a plurality of third candidate problem texts with a third target anomaly probability, wherein the first target anomaly probability < the second target anomaly probability < the third target anomaly probability;
calculating a first text similarity between each of the first candidate question texts and each of the second candidate question texts, and calculating a second text similarity between each of the third candidate question texts and each of the second candidate question texts;
and identifying a plurality of second question texts in the plurality of second candidate question texts according to the plurality of first text similarities and the plurality of second text similarities corresponding to each second candidate question text.
The abnormal text recognition model is obtained by offline training of computer equipment in advance and is used for predicting the abnormal probability of the second problem text, and the larger the abnormal probability is, the higher the possibility that the corresponding second problem text belongs to the abnormal text is; the smaller the abnormality probability, the lower the probability that the corresponding second question text belongs to the abnormal text.
Considering that a large amount of texts are issued by the content platform at every moment, but the iteration speed of the abnormal text recognition model is low, the generalization capability of the abnormal text recognition model is not strong, so that the recognition accuracy of the abnormal text recognition model is not high, and further abnormality confirmation is needed for the second problem text with the predicted abnormality probability corresponding to the target abnormality probability.
For example, the first target abnormality probability may be 0.4, the second target abnormality probability may be 0.5, and the third target abnormality probability may be 0.6.
If the abnormal probability obtained by predicting a certain second problem text by the abnormal text recognition model is smaller than the second target abnormal probability, determining that the second problem text does not belong to the abnormal text; and if the abnormal probability obtained by predicting a certain second problem text by the abnormal text recognition model is greater than the second target abnormal probability, determining that the second problem text belongs to the abnormal text. And the second question text corresponding to the second target abnormal probability with the abnormal probability may or may not belong to the abnormal text, so that it is necessary to assist in identifying whether the second candidate question text corresponding to the second target abnormal probability is the abnormal text by means of the plurality of first candidate question texts corresponding to the first target abnormal probability and the plurality of third candidate question texts corresponding to the third target abnormal probability.
In an optional embodiment, the identifying, according to the plurality of first text similarities and the plurality of second text similarities corresponding to each of the second candidate question texts, a plurality of second question texts in the plurality of second candidate question texts includes:
for any one second candidate question text, obtaining a plurality of first target text similarities which are greater than a preset similarity threshold value in the plurality of first text similarities, and calculating a first number of the plurality of first target text similarities;
obtaining a plurality of second target text similarities which are greater than the preset similarity threshold value in the plurality of second text similarities, and calculating a second number of the plurality of second target text similarities;
calculating a third number of the first text similarities and a fourth number of the second text similarities;
and when the ratio of the first quantity to the third quantity is smaller than a preset ratio threshold, and the ratio of the second quantity to the fourth quantity is smaller than the preset ratio threshold, determining that the any one second candidate question text is the second question text.
Wherein the preset duty ratio threshold may be 0.5.
For example, assuming that the first number of the plurality of first target text similarities is 10, the third number of the plurality of first text similarities is 15, and the ratio of the first number to the third number is greater than the ratio threshold, since the first target probability threshold corresponding to the first candidate question text is smaller than the second target probability threshold, the first candidate question text is not an abnormal text, and the probability that the second candidate question text is not an abnormal text is high, the second candidate question text is not the second question text.
Assuming that the first number of the plurality of first target text similarities is 5, the third number of the plurality of first text similarities is 15, and the ratio of the first number to the third number is smaller than the ratio threshold, because the first target probability threshold corresponding to the first candidate question text is smaller than the second target probability threshold, the first candidate question text is not an abnormal text, and the possibility that the second candidate question text is not an abnormal text is low, the second candidate question text is the second question text.
Assuming that the second number of the second target text similarities is 20, the fourth number of the second text similarities is 25, and the ratio of the second number to the fourth number is greater than the ratio threshold, because the third target probability threshold corresponding to the third candidate question text is greater than the second target probability threshold, the third candidate question text is an abnormal text, which indicates that the possibility that the second candidate question text is an abnormal text is high, the second candidate question text is the second question text.
Assuming that the second number of the second target text similarities is 10, the fourth number of the second text similarities is 25, and the ratio of the second number to the fourth number is smaller than the ratio threshold, because the third target probability threshold corresponding to the third candidate question text is larger than the second target probability threshold, the third candidate question text is an abnormal text, and the possibility that the second candidate question text is an abnormal text is low, the second candidate question text is the second question text.
In this optional embodiment, a plurality of second question texts in the plurality of second candidate question texts are identified according to the plurality of first text similarities and the plurality of second text similarities corresponding to each second candidate question text, and the plurality of second candidate question texts corresponding to the second target abnormal probability predicted by the abnormal text recognition model are further identified, so that the plurality of second question texts in the plurality of second candidate question texts are identified in a refined manner.
And S13, extracting a bag-of-words vector of each second question text, and generating a question text image based on a plurality of bag-of-words vectors.
After the second question texts with high possibility of being abnormal texts are recognized, the number of the second question texts is large, and in order to recognize the abnormal texts which are really abnormal texts in the second question texts at one time, the image processing idea is adopted to process the second question texts, so that the efficiency of recognizing the abnormal texts is improved.
In an alternative embodiment, said extracting a bag-of-words vector for each of said second question texts and generating a question text image based on a plurality of said bag-of-words vectors comprises:
performing word segmentation processing on each second question text to obtain a plurality of words;
calculating the TF-IDF value of each participle;
calculating a bag-of-words vector of the second question text according to the IF-IDF value of each participle in each second question text;
performing dimensionality reduction on each word bag vector to obtain a standard word bag vector;
and generating a problem text image according to the plurality of standard bag-of-words vectors.
The computer device can perform word segmentation processing on each second problem text by adopting a crust word segmentation tool, segment each second problem text into a plurality of words, calculate a TF-IDF value of each word in the corresponding second problem text, and assign the TF-IDF value to the corresponding word, so that a word bag vector of the second problem text can be obtained.
The computer equipment can perform singular value decomposition on the bag-of-word vector of each second problem text by adopting a singular value decomposition algorithm, wherein the smaller the singular value is, the more corresponding participles have no practical significance on the corresponding second problem text, and the larger the singular value is, the more corresponding participles have great significance on the corresponding second problem.
And the computer equipment determines a target singular value of which the singular value is smaller than a preset singular value threshold value, deletes IF-IDF corresponding to the target singular value from the word bag vectors to obtain a standard word bag vector, splices the plurality of standard word bag vectors to obtain a standard word bag matrix, and generates a problem text image according to the standard word bag matrix. Each behavior of the standard bag-of-words matrix is a standard bag-of-words vector, and the TF-IDF values in the standard bag-of-words matrix represent the pixel values of the question text image.
In this optional embodiment, the dimension of the bag-of-words vector of the second problem text is reduced by performing dimension reduction processing on the bag-of-words vector, and the dimension of the problem text image generated based on the plurality of low-dimension standard bag-of-words vectors is also reduced, so that the problem text image recognition efficiency is improved, and the abnormal text recognition efficiency is improved.
And S14, performing target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection boxes.
The computer device may perform target detection on the problem text image using a multi-target detection algorithm, such as the YOLO target detection algorithm, and frame each detected target with a target detection box.
S15, identifying abnormal texts in the second question texts according to the target detection boxes.
The area outlined and selected by the target detection frame indicates that an abnormal object exists in the area, and the area not outlined and selected by the target detection frame indicates that an abnormal object does not exist in the area.
By means of the image processing idea, the problem text image is generated according to the bag-of-word vectors of all the second problem texts which may be abnormal texts, abnormal objects in the problem text image are determined through abnormal recognition of the problem text image, so that abnormal texts in the second problem texts are determined, all the abnormal texts in the second problem texts can be recognized at one time, the recognition efficiency of the abnormal texts is high, and the accuracy is high.
In an optional embodiment, the identifying abnormal texts in the second question texts according to the target detection boxes includes:
determining a word bag vector in each target detection box;
judging whether a target word bag vector identical to any word bag vector exists in each word bag vector;
and when at least one target word bag vector exists in the target word bag vector, determining that the second problem text corresponding to the target word bag vector is an abnormal text.
The target detection frame comprises one or more word bag vectors, and the word bag vectors in the target detection frame are abnormal objects.
If a certain target detection box selects part or all of the word bag vectors in a certain word bag vector, all or part of the target word bag vectors which are the same as any word bag vector in any one target detection box exist in the word bag vectors, and the fact that an abnormal condition exists in the word bag vectors indicates that the second problem text corresponding to the word bag vectors is an abnormal text. If a certain bag-of-word vector is not selected by any target detection box, the bag-of-word vector does not have the same target bag-of-word vector as any bag-of-word vector in any target detection box, which indicates that no abnormal condition exists in the bag-of-word vector, and the second problem text corresponding to the bag-of-word vector is not an abnormal text.
In this optional embodiment, whether the second problem text is an abnormal text is identified through the word bag vector in the target detection box, so that not only can all abnormal texts in the second problem text be quickly identified, but also the word bag vector with an abnormal object in the abnormal text can be selected, that is, an abnormal content box in the abnormal text is selected.
In an optional embodiment, the method further comprises:
extracting a plurality of abnormal words in the abnormal text;
calculating the abnormality degree of the abnormal text according to the abnormal words;
when the abnormality degree is larger than a preset abnormality degree threshold value, acquiring a user account issuing the abnormal text;
and performing number sealing processing on the user account.
The computer equipment is pre-stored with an abnormal word list, wherein a plurality of abnormal words are stored in the abnormal word list, and the abnormal words refer to words which are irrelevant to text contents in a content platform.
And calculating the Euclidean distance or cosine included angle between each participle in the abnormal text and each abnormal word in the abnormal word list to obtain the similarity between each participle in the abnormal text and each abnormal word in the abnormal word list. The larger the similarity is, the more the corresponding participle belongs to the abnormal word; the smaller the similarity is, the less the corresponding segmented word belongs to the abnormal word. And determining the target participles corresponding to the target similarity smaller than the preset similarity threshold as abnormal words.
After a plurality of abnormal words in the abnormal text are identified, the ratio of the plurality of abnormal words to the segmentation words in the abnormal text is calculated, and the abnormal degree in the abnormal text is obtained. The larger the ratio is, the larger the abnormality degree is, and the more irrelevant the abnormal text is to the text content in the content platform; the smaller the ratio, the smaller the degree of abnormality, and the more relevant the abnormal text is to the text content in the content platform.
Selecting the target abnormal texts with the abnormal degrees larger than a preset abnormal degree threshold value from all the abnormal texts, determining a user account for issuing the target abnormal texts, performing number sealing processing on the user account, and prohibiting the user account from issuing texts to the content platform.
In the optional embodiment, the abnormal degree of the abnormal text is determined to perform number sealing processing on the user account issuing the target abnormal text with a larger abnormal degree, so that the user account can be prevented from continuously issuing the text irrelevant to the content platform, adverse effects on a user using the content platform to perform text retrieval can be avoided, the purity of the text in the content platform is ensured, and the use experience of the user is improved.
It is emphasized that the above-mentioned abnormal text recognition model may be stored in a node of the blockchain in order to further ensure the privacy and security of the above-mentioned abnormal text recognition model.
The abnormal text identification method comprises the steps of clustering a plurality of texts to be detected, and identifying a plurality of first problem texts which are possibly abnormal texts from the plurality of texts to be detected for the first time according to a plurality of clustered centroids; secondly, calling an abnormal text recognition model and further recognizing a plurality of second question texts which are more likely to be abnormal texts from the plurality of first question texts; and thirdly, extracting a bag-of-words vector of each second problem text by means of an image processing idea, generating a problem text image based on a plurality of bag-of-words vectors, and performing target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection boxes, so that a real abnormal text is identified from the plurality of second problem texts at one time according to the plurality of target detection boxes. The method adopts a three-step method to gradually reduce the range of identifying the abnormal texts, has high identification efficiency of the abnormal texts, can identify the abnormal texts in batches by means of the thought of image processing, further improves the identification efficiency of the abnormal texts, and improves the identification accuracy of the abnormal texts by using a target detection method.
Fig. 2 is a structural diagram of an abnormal text recognition apparatus according to a second embodiment of the present invention.
In some embodiments, the abnormal text recognition apparatus 20 may include a plurality of functional modules composed of computer program segments. The computer program of each program segment in the abnormal text recognition apparatus 20 may be stored in a memory of a computer device and executed by at least one processor to perform the function of abnormal text recognition (described in detail in fig. 1).
In this embodiment, the abnormal text recognition apparatus 20 may be divided into a plurality of functional modules according to the functions performed by the apparatus. The functional module may include: the system comprises a clustering module 201, a calling module 202, a generating module 203, a detecting module 204, an identifying module 205 and a seal number module 206. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The clustering module 201 is configured to cluster a plurality of texts to be tested, and identify a plurality of first problem texts in the plurality of texts to be tested according to a plurality of clustered centroids.
The computer equipment is provided with a content platform, and the content platform can receive a text issued by a first user and respond to a text request of a second user when receiving the text request and display the text corresponding to the text request.
The computer equipment can adopt a text clustering algorithm to cluster a plurality of texts to be tested in the content platform, so that a plurality of texts which are possibly abnormal are firstly identified from the plurality of texts to be tested according to the clustered centroids. A plurality of texts in which an abnormality is likely to exist for the first time are identified as first problem texts.
The text clustering algorithm may include, but is not limited to: a partition-based text clustering algorithm (K-mean, K-center), a hierarchy-based text clustering algorithm (splitting, merging), a density-based text clustering algorithm (OPTICS), a network-based text clustering algorithm (STING), a model-based text clustering algorithm.
In an optional embodiment, the clustering module 201 clusters a plurality of texts to be tested, and identifying a plurality of first question texts in the plurality of texts to be tested according to a plurality of clustered centroids includes:
extracting a text vector of each text to be tested, and clustering the plurality of texts to be tested according to the text vectors to obtain a plurality of text clusters to be tested;
calculating an average mass center according to the mass centers of the plurality of text clusters to be detected;
calculating the distance between the mass center of each text cluster to be detected and the average mass center;
determining a text cluster to be detected with a distance greater than a preset distance threshold value as a problem text cluster;
and determining a plurality of texts to be tested in the question text cluster as a plurality of first question texts.
The text vector of each text to be tested can be extracted by using a pre-training model BERT, and the BERT model can take the semantics of the text into consideration, so that the extracted text vector can express the whole semantic information of the text better.
Mapping a plurality of text vectors to a high-dimensional space and then clustering to obtain a plurality of text clusters to be tested, wherein each text cluster to be tested comprises a plurality of texts to be tested, the texts with the same content of the texts to be tested are clustered into the same text cluster to be tested, the texts with different contents of the texts to be tested are clustered into different text clusters to be tested, and each text cluster to be tested has a centroid, wherein the centroid refers to the geometric center of the text cluster to be tested.
And calculating to obtain an average position according to the positions of the plurality of centroids, wherein the average position corresponds to the average centroid.
The larger the distance between the centroid of a certain text cluster to be detected and the average centroid is, the larger the difference between the text content of the text to be detected in the text cluster to be detected and the center content of the content platform is, the more unlikely the text to be detected belongs to the text in the content platform, that is, the higher the possibility that the text to be detected is an abnormal text is; the smaller the distance between the centroid of a certain text cluster to be detected and the average centroid is, the smaller the difference between the text content of the text to be detected in the text cluster to be detected and the center content of the content platform is, the more likely the text to be detected belongs to the text in the content platform, that is, the less likely the text to be detected is the text with the abnormal existence.
In the optional embodiment, the center content of the content platform can be obtained by clustering a plurality of texts to be tested and calculating the average centroid according to the clustered centroids, so that the first problem texts which are possibly abnormal are identified according to the distance between the centroid of each text cluster to be tested and the average centroid, the widely-spread identification of the first problem texts which are possibly abnormal is realized, the identification efficiency of the first problem texts is high, all the first problem texts which are possibly abnormal can be identified, and the cardinal number of the first problem texts is large.
The invoking module 202 is configured to invoke an abnormal text recognition model to recognize a plurality of second question texts in the plurality of first question texts.
Because a plurality of text clusters to be detected are obtained in a clustering mode, once the distance between the centroid of a certain text cluster to be detected and the average centroid is greater than a preset distance threshold, all texts to be detected in the text cluster to be detected are identified as first problem texts, but in practical application, the texts to be detected are possibly not abnormal texts but are identified by mistake, so that the computer equipment identifies a plurality of second problem texts from the plurality of first problem texts by using an abnormal text identification model for the second time, the identification range of the abnormal texts is reduced, and the identification efficiency of the abnormal texts is improved.
The probability that the second question text belongs to the abnormal text is greater than the probability that the first question text belongs to the abnormal text.
In an alternative embodiment, the invoking module 202 invokes the abnormal text recognition model to recognize the second question texts from the first question texts comprises:
performing abnormal probability prediction on each first problem text in the plurality of first problem texts by adopting an abnormal text recognition model;
obtaining a plurality of first candidate problem texts with a first target anomaly probability, a plurality of second candidate problem texts with a second target anomaly probability and a plurality of third candidate problem texts with a third target anomaly probability, wherein the first target anomaly probability < the second target anomaly probability < the third target anomaly probability;
calculating a first text similarity between each of the first candidate question texts and each of the second candidate question texts, and calculating a second text similarity between each of the third candidate question texts and each of the second candidate question texts;
and identifying a plurality of second question texts in the plurality of second candidate question texts according to the plurality of first text similarities and the plurality of second text similarities corresponding to each second candidate question text.
The abnormal text recognition model is obtained by offline training of computer equipment in advance and is used for predicting the abnormal probability of the second problem text, and the larger the abnormal probability is, the higher the possibility that the corresponding second problem text belongs to the abnormal text is; the smaller the abnormality probability, the lower the probability that the corresponding second question text belongs to the abnormal text.
Considering that a large amount of texts are issued by the content platform at every moment, but the iteration speed of the abnormal text recognition model is low, the generalization capability of the abnormal text recognition model is not strong, so that the recognition accuracy of the abnormal text recognition model is not high, and further abnormality confirmation is needed for the second problem text with the predicted abnormality probability corresponding to the target abnormality probability.
For example, the first target abnormality probability may be 0.4, the second target abnormality probability may be 0.5, and the third target abnormality probability may be 0.6.
If the abnormal probability obtained by predicting a certain second problem text by the abnormal text recognition model is smaller than the second target abnormal probability, determining that the second problem text does not belong to the abnormal text; and if the abnormal probability obtained by predicting a certain second problem text by the abnormal text recognition model is greater than the second target abnormal probability, determining that the second problem text belongs to the abnormal text. And the second question text corresponding to the second target abnormal probability with the abnormal probability may or may not belong to the abnormal text, so that it is necessary to assist in identifying whether the second candidate question text corresponding to the second target abnormal probability is the abnormal text by means of the plurality of first candidate question texts corresponding to the first target abnormal probability and the plurality of third candidate question texts corresponding to the third target abnormal probability.
In an optional embodiment, the identifying, according to the plurality of first text similarities and the plurality of second text similarities corresponding to each of the second candidate question texts, a plurality of second question texts in the plurality of second candidate question texts includes:
for any one second candidate question text, obtaining a plurality of first target text similarities which are greater than a preset similarity threshold value in the plurality of first text similarities, and calculating a first number of the plurality of first target text similarities;
obtaining a plurality of second target text similarities which are greater than the preset similarity threshold value in the plurality of second text similarities, and calculating a second number of the plurality of second target text similarities;
calculating a third number of the first text similarities and a fourth number of the second text similarities;
and when the ratio of the first quantity to the third quantity is smaller than a preset ratio threshold, and the ratio of the second quantity to the fourth quantity is smaller than the preset ratio threshold, determining that the any one second candidate question text is the second question text.
Wherein the preset duty ratio threshold may be 0.5.
For example, assuming that the first number of the plurality of first target text similarities is 10, the third number of the plurality of first text similarities is 15, and the ratio of the first number to the third number is greater than the ratio threshold, since the first target probability threshold corresponding to the first candidate question text is smaller than the second target probability threshold, the first candidate question text is not an abnormal text, and the probability that the second candidate question text is not an abnormal text is high, the second candidate question text is not the second question text.
Assuming that the first number of the plurality of first target text similarities is 5, the third number of the plurality of first text similarities is 15, and the ratio of the first number to the third number is smaller than the ratio threshold, because the first target probability threshold corresponding to the first candidate question text is smaller than the second target probability threshold, the first candidate question text is not an abnormal text, and the possibility that the second candidate question text is not an abnormal text is low, the second candidate question text is the second question text.
Assuming that the second number of the second target text similarities is 20, the fourth number of the second text similarities is 25, and the ratio of the second number to the fourth number is greater than the ratio threshold, because the third target probability threshold corresponding to the third candidate question text is greater than the second target probability threshold, the third candidate question text is an abnormal text, which indicates that the possibility that the second candidate question text is an abnormal text is high, the second candidate question text is the second question text.
Assuming that the second number of the second target text similarities is 10, the fourth number of the second text similarities is 25, and the ratio of the second number to the fourth number is smaller than the ratio threshold, because the third target probability threshold corresponding to the third candidate question text is larger than the second target probability threshold, the third candidate question text is an abnormal text, and the possibility that the second candidate question text is an abnormal text is low, the second candidate question text is the second question text.
In this optional embodiment, a plurality of second question texts in the plurality of second candidate question texts are identified according to the plurality of first text similarities and the plurality of second text similarities corresponding to each second candidate question text, and the plurality of second candidate question texts corresponding to the second target abnormal probability predicted by the abnormal text recognition model are further identified, so that the plurality of second question texts in the plurality of second candidate question texts are identified in a refined manner.
The generating module 203 is configured to extract a bag-of-words vector of each second question text, and generate a question text image based on a plurality of the bag-of-words vectors.
After the second question texts with high possibility of being abnormal texts are recognized, the number of the second question texts is large, and in order to recognize the abnormal texts which are really abnormal texts in the second question texts at one time, the image processing idea is adopted to process the second question texts, so that the efficiency of recognizing the abnormal texts is improved.
In an alternative embodiment, the generating module 203 extracts a bag-of-words vector for each of the second question texts, and generating the question text image based on a plurality of the bag-of-words vectors includes:
performing word segmentation processing on each second question text to obtain a plurality of words;
calculating the TF-IDF value of each participle;
calculating a bag-of-words vector of the second question text according to the IF-IDF value of each participle in each second question text;
performing dimensionality reduction on each word bag vector to obtain a standard word bag vector;
and generating a problem text image according to the plurality of standard bag-of-words vectors.
The computer device can perform word segmentation processing on each second problem text by adopting a crust word segmentation tool, segment each second problem text into a plurality of words, calculate a TF-IDF value of each word in the corresponding second problem text, and assign the TF-IDF value to the corresponding word, so that a word bag vector of the second problem text can be obtained.
The computer equipment can perform singular value decomposition on the bag-of-word vector of each second problem text by adopting a singular value decomposition algorithm, wherein the smaller the singular value is, the more corresponding participles have no practical significance on the corresponding second problem text, and the larger the singular value is, the more corresponding participles have great significance on the corresponding second problem.
And the computer equipment determines a target singular value of which the singular value is smaller than a preset singular value threshold value, deletes IF-IDF corresponding to the target singular value from the word bag vectors to obtain a standard word bag vector, splices the plurality of standard word bag vectors to obtain a standard word bag matrix, and generates a problem text image according to the standard word bag matrix. Each behavior of the standard bag-of-words matrix is a standard bag-of-words vector, and the TF-IDF values in the standard bag-of-words matrix represent the pixel values of the question text image.
In this optional embodiment, the dimension of the bag-of-words vector of the second problem text is reduced by performing dimension reduction processing on the bag-of-words vector, and the dimension of the problem text image generated based on the plurality of low-dimension standard bag-of-words vectors is also reduced, so that the problem text image recognition efficiency is improved, and the abnormal text recognition efficiency is improved.
The detection module 204 is configured to perform target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection boxes.
The computer device may perform target detection on the problem text image using a multi-target detection algorithm, such as the YOLO target detection algorithm, and frame each detected target with a target detection box.
The identifying module 205 is configured to identify an abnormal text in the second question texts according to the target detection boxes.
The area outlined and selected by the target detection frame indicates that an abnormal object exists in the area, and the area not outlined and selected by the target detection frame indicates that an abnormal object does not exist in the area.
By means of the image processing idea, the problem text image is generated according to the bag-of-word vectors of all the second problem texts which may be abnormal texts, abnormal objects in the problem text image are determined through abnormal recognition of the problem text image, so that abnormal texts in the second problem texts are determined, all the abnormal texts in the second problem texts can be recognized at one time, the recognition efficiency of the abnormal texts is high, and the accuracy is high.
In an optional embodiment, the identifying module 205 identifies the abnormal text in the second question texts according to the target detection boxes includes:
determining a word bag vector in each target detection box;
judging whether a target word bag vector identical to any word bag vector exists in each word bag vector;
and when at least one target word bag vector exists in the target word bag vector, determining that the second problem text corresponding to the target word bag vector is an abnormal text.
The target detection frame comprises one or more word bag vectors, and the word bag vectors in the target detection frame are abnormal objects.
If a certain target detection box selects part or all of the word bag vectors in a certain word bag vector, all or part of the target word bag vectors which are the same as any word bag vector in any one target detection box exist in the word bag vectors, and the fact that an abnormal condition exists in the word bag vectors indicates that the second problem text corresponding to the word bag vectors is an abnormal text. If a certain bag-of-word vector is not selected by any target detection box, the bag-of-word vector does not have the same target bag-of-word vector as any bag-of-word vector in any target detection box, which indicates that no abnormal condition exists in the bag-of-word vector, and the second problem text corresponding to the bag-of-word vector is not an abnormal text.
In this optional embodiment, whether the second problem text is an abnormal text is identified through the word bag vector in the target detection box, so that not only can all abnormal texts in the second problem text be quickly identified, but also the word bag vector with an abnormal object in the abnormal text can be selected, that is, an abnormal content box in the abnormal text is selected.
The seal number module 206 is configured to extract a plurality of abnormal words in the abnormal text; calculating the abnormality degree of the abnormal text according to the abnormal words; when the abnormality degree is larger than a preset abnormality degree threshold value, acquiring a user account issuing the abnormal text; and performing number sealing processing on the user account.
The computer equipment is pre-stored with an abnormal word list, wherein a plurality of abnormal words are stored in the abnormal word list, and the abnormal words refer to words which are irrelevant to text contents in a content platform.
And calculating the Euclidean distance or cosine included angle between each participle in the abnormal text and each abnormal word in the abnormal word list to obtain the similarity between each participle in the abnormal text and each abnormal word in the abnormal word list. The larger the similarity is, the more the corresponding participle belongs to the abnormal word; the smaller the similarity is, the less the corresponding segmented word belongs to the abnormal word. And determining the target participles corresponding to the target similarity smaller than the preset similarity threshold as abnormal words.
After a plurality of abnormal words in the abnormal text are identified, the ratio of the plurality of abnormal words to the segmentation words in the abnormal text is calculated, and the abnormal degree in the abnormal text is obtained. The larger the ratio is, the larger the abnormality degree is, and the more irrelevant the abnormal text is to the text content in the content platform; the smaller the ratio, the smaller the degree of abnormality, and the more relevant the abnormal text is to the text content in the content platform.
Selecting the target abnormal texts with the abnormal degrees larger than a preset abnormal degree threshold value from all the abnormal texts, determining a user account for issuing the target abnormal texts, performing number sealing processing on the user account, and prohibiting the user account from issuing texts to the content platform.
In the optional embodiment, the abnormal degree of the abnormal text is determined to perform number sealing processing on the user account issuing the target abnormal text with a larger abnormal degree, so that the user account can be prevented from continuously issuing the text irrelevant to the content platform, adverse effects on a user using the content platform to perform text retrieval can be avoided, the purity of the text in the content platform is ensured, and the use experience of the user is improved.
It is emphasized that the above-mentioned abnormal text recognition model may be stored in a node of the blockchain in order to further ensure the privacy and security of the above-mentioned abnormal text recognition model.
The abnormal text recognition device firstly clusters a plurality of texts to be tested, and first recognizes a plurality of first problem texts which are possibly abnormal texts from the plurality of texts to be tested according to a plurality of clustered centroids; secondly, calling an abnormal text recognition model and further recognizing a plurality of second question texts which are more likely to be abnormal texts from the plurality of first question texts; and thirdly, extracting a bag-of-words vector of each second problem text by means of an image processing idea, generating a problem text image based on a plurality of bag-of-words vectors, and performing target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection boxes, so that a real abnormal text is identified from the plurality of second problem texts at one time according to the plurality of target detection boxes. The method adopts a three-step method to gradually reduce the range of identifying the abnormal texts, has high identification efficiency of the abnormal texts, can identify the abnormal texts in batches by means of the thought of image processing, further improves the identification efficiency of the abnormal texts, and improves the identification accuracy of the abnormal texts by using a target detection method.
Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 does not constitute a limitation of the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 3 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 has stored therein a computer program which, when executed by the at least one processor 32, implements all or part of the steps of the method of abnormal text recognition as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or part of the steps of the abnormal text recognition method described in the embodiments of the present invention; or implement all or part of the functions of the abnormal text recognition apparatus. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention can also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. An abnormal text recognition method, characterized in that the method comprises:
clustering a plurality of texts to be tested, and identifying a plurality of first problem texts in the plurality of texts to be tested according to a plurality of clustered centroids;
calling an abnormal text recognition model to recognize a plurality of second question texts in the plurality of first question texts;
extracting a bag-of-words vector of each second question text, and generating a question text image based on a plurality of bag-of-words vectors;
performing target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection frames;
and identifying abnormal texts in the second question texts according to the target detection boxes.
2. The abnormal text recognition method of claim 1, wherein the recognizing abnormal text in the plurality of second question texts according to the plurality of target detection boxes comprises:
determining a word bag vector in each target detection box;
judging whether a target word bag vector identical to any word bag vector exists in each word bag vector;
and when at least one target word bag vector exists in the target word bag vector, determining that the second problem text corresponding to the target word bag vector is an abnormal text.
3. The method of abnormal-text recognition of claim 1, wherein said invoking an abnormal-text recognition model to recognize a plurality of second question texts of the plurality of first question texts comprises:
performing abnormal probability prediction on each first problem text in the plurality of first problem texts by adopting an abnormal text recognition model;
obtaining a plurality of first candidate problem texts with a first target anomaly probability, a plurality of second candidate problem texts with a second target anomaly probability and a plurality of third candidate problem texts with a third target anomaly probability, wherein the first target anomaly probability < the second target anomaly probability < the third target anomaly probability;
calculating a first text similarity between each of the first candidate question texts and each of the second candidate question texts, and calculating a second text similarity between each of the third candidate question texts and each of the second candidate question texts;
and identifying a plurality of second question texts in the plurality of second candidate question texts according to the plurality of first text similarities and the plurality of second text similarities corresponding to each second candidate question text.
4. The method of claim 3, wherein the identifying the second question texts of the second question candidate texts according to the first text similarities and the second text similarities comprises:
for any one second candidate question text, obtaining a plurality of first target text similarities which are greater than a preset similarity threshold value in the plurality of first text similarities, and calculating a first number of the plurality of first target text similarities;
obtaining a plurality of second target text similarities which are greater than the preset similarity threshold value in the plurality of second text similarities, and calculating a second number of the plurality of second target text similarities;
calculating a third number of the first text similarities and a fourth number of the second text similarities;
and when the ratio of the first quantity to the third quantity is smaller than a preset ratio threshold, and the ratio of the second quantity to the fourth quantity is smaller than the preset ratio threshold, determining that the any one second candidate question text is the second question text.
5. The abnormal text recognition method according to any one of claims 1 to 4, wherein the clustering the plurality of texts to be tested and recognizing the plurality of first question texts in the plurality of texts to be tested according to the clustered centroids comprises:
extracting a text vector of each text to be tested, and clustering the plurality of texts to be tested according to the text vectors to obtain a plurality of text clusters to be tested;
calculating an average mass center according to the mass centers of the plurality of text clusters to be detected;
calculating the distance between the mass center of each text cluster to be detected and the average mass center;
determining a text cluster to be detected with a distance greater than a preset distance threshold value as a problem text cluster;
and determining a plurality of texts to be tested in the question text cluster as a plurality of first question texts.
6. The method of abnormal text recognition according to claim 5, wherein said extracting a bag-of-words vector for each of the second question texts and generating a question text image based on a plurality of the bag-of-words vectors comprises:
performing word segmentation processing on each second question text to obtain a plurality of words;
calculating the TF-IDF value of each participle;
calculating a bag-of-words vector of the second question text according to the IF-IDF value of each participle in each second question text;
performing dimensionality reduction on each word bag vector to obtain a standard word bag vector;
and generating a problem text image according to the plurality of standard bag-of-words vectors.
7. The abnormal text recognition method of claim 5, wherein the method further comprises:
extracting a plurality of abnormal words in the abnormal text;
calculating the abnormality degree of the abnormal text according to the abnormal words;
when the abnormality degree is larger than a preset abnormality degree threshold value, acquiring a user account issuing the abnormal text;
and performing number sealing processing on the user account.
8. An abnormal text recognition apparatus, characterized in that the apparatus comprises:
the clustering model is used for clustering a plurality of texts to be tested and identifying a plurality of first problem texts in the plurality of texts to be tested according to a plurality of centroids after clustering;
the calling module is used for calling an abnormal text recognition model to recognize a plurality of second question texts in the plurality of first question texts;
the generating module is used for extracting a bag-of-words vector of each second question text and generating a question text image based on a plurality of bag-of-words vectors;
the detection module is used for carrying out target detection on the problem text image by using a target detection algorithm to obtain a plurality of target detection frames;
and the identification module is used for identifying abnormal texts in the second question texts according to the target detection boxes.
9. A computer device, characterized in that the computer device comprises a processor for implementing the method of abnormal text recognition according to any one of claims 1 to 7 when executing a computer program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for recognizing an abnormal text according to any one of claims 1 to 7.
CN202110076225.5A 2021-01-20 2021-01-20 Abnormal text recognition method and device, computer equipment and storage medium Active CN112860849B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110076225.5A CN112860849B (en) 2021-01-20 2021-01-20 Abnormal text recognition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110076225.5A CN112860849B (en) 2021-01-20 2021-01-20 Abnormal text recognition method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112860849A true CN112860849A (en) 2021-05-28
CN112860849B CN112860849B (en) 2021-11-30

Family

ID=76007749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110076225.5A Active CN112860849B (en) 2021-01-20 2021-01-20 Abnormal text recognition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112860849B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182539A (en) * 2014-09-02 2014-12-03 五八同城信息技术有限公司 Abnormal information batch processing method and system
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN110175221A (en) * 2019-05-17 2019-08-27 国家计算机网络与信息安全管理中心 Utilize the refuse messages recognition methods of term vector combination machine learning
CN111125362A (en) * 2019-12-23 2020-05-08 百度国际科技(深圳)有限公司 Abnormal text determination method and device, electronic equipment and medium
KR20200072724A (en) * 2018-12-13 2020-06-23 줌인터넷 주식회사 An apparatus for detecting spam news with spam phrases, a method thereof and computer recordable medium storing program to perform the method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN104182539A (en) * 2014-09-02 2014-12-03 五八同城信息技术有限公司 Abnormal information batch processing method and system
KR20200072724A (en) * 2018-12-13 2020-06-23 줌인터넷 주식회사 An apparatus for detecting spam news with spam phrases, a method thereof and computer recordable medium storing program to perform the method
CN110175221A (en) * 2019-05-17 2019-08-27 国家计算机网络与信息安全管理中心 Utilize the refuse messages recognition methods of term vector combination machine learning
CN111125362A (en) * 2019-12-23 2020-05-08 百度国际科技(深圳)有限公司 Abnormal text determination method and device, electronic equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汪金涛 等: "图像型垃圾邮件监控系统研究与设计", 《辽宁工业大学学报(自然科学版)》 *

Also Published As

Publication number Publication date
CN112860849B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
US11804069B2 (en) Image clustering method and apparatus, and storage medium
WO2022100349A1 (en) Artificial intelligence-based dual-recording quality inspection method and apparatus, computer device and medium
GB2496009A (en) Anomaly detection in images and videos
CN112860848B (en) Information retrieval method, device, equipment and medium
CN112883154B (en) Text topic mining method and device, computer equipment and storage medium
CN111860522B (en) Identity card picture processing method, device, terminal and storage medium
CN111782900A (en) Abnormal service detection method and device, electronic equipment and storage medium
CN114663223A (en) Credit risk assessment method, device and related equipment based on artificial intelligence
CN112395432B (en) Course pushing method and device, computer equipment and storage medium
CN112860849B (en) Abnormal text recognition method and device, computer equipment and storage medium
CN113570286B (en) Resource allocation method and device based on artificial intelligence, electronic equipment and medium
CN111429085A (en) Contract data generation method and device, electronic equipment and storage medium
CN116108276A (en) Information recommendation method and device based on artificial intelligence and related equipment
CN113268579B (en) Dialogue content category identification method, device, computer equipment and storage medium
CN114996386A (en) Business role identification method, device, equipment and storage medium
CN113269190B (en) Data classification method and device based on artificial intelligence, computer equipment and medium
CN115757075A (en) Task abnormity detection method and device, computer equipment and storage medium
CN113221888B (en) License plate number management system test method and device, electronic equipment and storage medium
CN115222549A (en) Risk assessment processing method and device, computer equipment and storage medium
CN114239538A (en) Assertion processing method and device, computer equipment and storage medium
CN114881313A (en) Behavior prediction method and device based on artificial intelligence and related equipment
CN114595321A (en) Question marking method and device, electronic equipment and storage medium
CN112231196B (en) APP embedded point behavior classification method and device, computer equipment and storage medium
CN113792681B (en) Information acquisition method and device based on point cloud matching, electronic equipment and medium
CN113722590B (en) Medical information recommendation method, device, equipment and medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant