CN111125362B - Abnormal text determination method and device, electronic equipment and medium - Google Patents

Abnormal text determination method and device, electronic equipment and medium Download PDF

Info

Publication number
CN111125362B
CN111125362B CN201911341128.3A CN201911341128A CN111125362B CN 111125362 B CN111125362 B CN 111125362B CN 201911341128 A CN201911341128 A CN 201911341128A CN 111125362 B CN111125362 B CN 111125362B
Authority
CN
China
Prior art keywords
text
abnormal
cluster
current
abnormal text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911341128.3A
Other languages
Chinese (zh)
Other versions
CN111125362A (en
Inventor
刘庚
白敬亭
张伟军
彭云鹏
杨经纬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu International Technology Shenzhen Co ltd
Original Assignee
Baidu International Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu International Technology Shenzhen Co ltd filed Critical Baidu International Technology Shenzhen Co ltd
Priority to CN201911341128.3A priority Critical patent/CN111125362B/en
Publication of CN111125362A publication Critical patent/CN111125362A/en
Application granted granted Critical
Publication of CN111125362B publication Critical patent/CN111125362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses an abnormal text determining method, an abnormal text determining device, electronic equipment and a medium, and relates to the technical field of artificial intelligence. The specific implementation scheme is as follows: determining the central characteristics of the current cluster according to other texts except the last abnormal text in the text cluster; selecting the abnormal text from all texts according to the distance between all texts in the text cluster and the central characteristic of the current cluster; and determining the target abnormal text in the text cluster according to the current abnormal text. Through the technical scheme of the embodiment of the application, the abnormal text in the text cluster can be automatically filtered, so that the classification is more accurate, and the user experience is further improved.

Description

Abnormal text determination method and device, electronic equipment and medium
Technical Field
The present disclosure relates to computer technology, and in particular, to an artificial intelligence technology, and in particular, to a method, an apparatus, an electronic device, and a medium for determining abnormal text.
Background
Along with the continuous development of big data age, the acquisition amount of public opinion information is larger and larger, so that tens of thousands of information are accurately classified into a critical item, through document classification, a user can quickly find information which needs to be known, for example, in the public opinion direction, the user can customize the direction which the user needs to know, and then only the document which the user needs can be pushed to the user. At present, a lot of junk information such as advertisements appear in the data acquisition process, so that a lot of junk documents are mistakenly classified into various categories, and the use experience of users is directly affected.
Disclosure of Invention
The embodiment of the application provides an abnormal text determining method, an abnormal text determining device, electronic equipment and a medium, which can automatically filter abnormal texts in text clusters, so that classification is more accurate, and user experience is further improved.
In a first aspect, an embodiment of the present application discloses a method for determining abnormal text, the method including:
determining the central characteristics of the current cluster according to other texts except the last abnormal text in the text cluster;
selecting the abnormal text from all texts according to the distance between all texts in the text cluster and the central characteristic of the current cluster;
and determining the target abnormal text in the text cluster according to the current abnormal text.
One embodiment of the above application has the following advantages or benefits: the method comprises the steps of determining the central characteristics of the current cluster according to other texts except the last abnormal text in the text cluster by adopting a loop iteration mode, calculating the distance between each text in the text cluster and the central characteristics of the current cluster, selecting the abnormal text from all texts according to the distance between each text in the text cluster and the central characteristics of the current cluster, determining the target abnormal text in the text cluster, namely the irrelevant text according to the abnormal text, solving the problem that the abnormal text in the text cluster recommended to a user at present causes poor user experience, realizing the function of automatically filtering the abnormal text in the text cluster, ensuring more accurate classification and further improving user experience.
Optionally, determining, according to the current abnormal text, the target abnormal text in the text cluster may include:
if an iteration stop event is detected according to the current abnormal text and other texts except the current abnormal text in a text cluster, the current abnormal text is used as a target abnormal text in the text cluster; or alternatively, the process may be performed,
comparing the current abnormal text with the last abnormal text, and determining a target abnormal text in the text cluster according to a comparison result.
The above alternatives have the following advantages or benefits: by introducing iteration stop events related to the abnormal text and other texts except the abnormal text in the text cluster, the processing speed is increased under the condition that the target abnormal text in the text cluster can be accurately determined; in addition, a mode of comparing the current abnormal text with the last abnormal file can be adopted to determine the target abnormal text in the text cluster, so that the flexibility of selecting the target abnormal text, namely the irrelevant text, is improved.
Optionally, the method further comprises:
taking the abnormal text closest to the central characteristics of the cluster as a first abnormal text;
Taking other abnormal texts with the closest distance to the first abnormal text as second abnormal texts;
and if the distances between all other texts except the abnormal text of this time in the text cluster and the first abnormal text of this time are larger than the distances between the first abnormal text of this time and the second abnormal text of this time, generating the iteration stop event.
The above alternatives have the following advantages or benefits: the iteration stop event generation mode related to the abnormal text and other texts except the abnormal text in the text cluster is provided, so that the flexibility of the scheme is further increased.
Optionally, after taking other abnormal texts closest to the first abnormal text as the second abnormal text, the method further includes:
if the distance between each other text except the current abnormal text in the text cluster and the first current abnormal text is smaller than or equal to the distance between the first current abnormal text and the second current abnormal text, adding the other text into the current abnormal text, and triggering the next abnormal text determining operation.
The above alternatives have the following advantages or benefits: the trigger condition for determining the next abnormal text determination operation, namely the next iteration operation, is provided, and the flexibility of the scheme is further increased.
Optionally, the method further comprises:
and clustering the crawled text set according to the set clustering category and the keywords of the clustering category to obtain a text cluster.
Optionally, before clustering the crawled text set according to the set clustering category and the keyword of the clustering category, the method further includes:
the text set is filtered according to at least one of a set header length, a regular expression, and a specific identifier.
The above alternatives have the following advantages or benefits: through filtering the text set, classification is more accurate, and a foundation is laid for quickly determining target abnormal texts in each text cluster, namely irrelevant texts.
In a second aspect, an embodiment of the present application discloses an abnormal text determining apparatus, including:
the current center feature determining module is used for determining the current cluster center feature according to other texts except the last abnormal text in the text cluster;
The abnormal text determining module is used for selecting abnormal texts of the time from all texts according to the distances between all texts in the text cluster and the central characteristics of the current cluster;
and the target abnormal text determining module is used for determining the target abnormal text in the text cluster according to the current abnormal text.
In a third aspect, an embodiment of the present application further discloses an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the outlier text determination method according to any one of the embodiments of the present application.
In a fourth aspect, embodiments of the present application also disclose a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the abnormal text determination method according to any of the embodiments of the present application.
One embodiment of the above application has the following advantages or benefits: the method comprises the steps of determining the central characteristics of the current cluster according to other texts except the last abnormal text in the text cluster by adopting a loop iteration mode, calculating the distance between each text in the text cluster and the central characteristics of the current cluster, selecting the abnormal text from all texts according to the distance between each text in the text cluster and the central characteristics of the current cluster, determining the target abnormal text in the text cluster, namely the irrelevant text according to the abnormal text, solving the problem that the abnormal text in the text cluster recommended to a user at present causes poor user experience, realizing the function of automatically filtering the abnormal text in the text cluster, ensuring more accurate classification and further improving user experience.
Other effects of the above alternative will be described below in connection with specific embodiments.
Drawings
The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:
fig. 1 is a flowchart of a method for determining abnormal text according to a first embodiment of the present application;
FIG. 2 is a flow chart of a method of outlier text determination provided according to a second embodiment of the present application;
FIG. 3 is a flow chart of a method of outlier text determination provided according to a third embodiment of the present application;
fig. 4 is a schematic structural view of an abnormal text determination apparatus provided according to a fourth embodiment of the present application;
fig. 5 is a block diagram of an electronic device for implementing the abnormal text determination method of the embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
First embodiment
Fig. 1 is a flowchart of a method for determining abnormal text according to a first embodiment of the present application, which is applicable to a case of how abnormal text, i.e., irrelevant text such as an advertisement, is accurately identified from a text cluster. The method may be performed by an anomaly text determination device, which may be implemented in software and/or hardware and may be integrated on a computing device carrying anomaly text determination functionality. As shown in fig. 1, the abnormal text determining method provided in the present embodiment may include:
s110, determining the central characteristic of the current cluster according to other texts except the last abnormal text in the text cluster.
In this embodiment, the text cluster is a collection of a plurality of texts belonging to the same category, and may be obtained by classifying a text set crawled from the internet. For example, an unsupervised clustering algorithm such as K-Means may be used to cluster a text set crawled from the Internet to obtain a plurality of text clusters; and clustering the crawled text set according to the set clustering category and the keywords of the clustering category to obtain a text cluster. The set clustering category is a preset grouping of the text sets into which clusters, i.e. which types, for example, the set clustering category can include a new release cluster, an investment cluster, and the like; the keywords of the clustering categories are conditions or rules for attributing the texts to the categories, and the keywords of each clustering category can be different, for example, the texts of at least one of the keywords of the online, new, release, push, bright phase, first, new function, reform and the like in the titles of all the texts in the text set can be attributed to a new release cluster. It should be noted that, since the unsupervised clustering algorithm such as K-Means cannot accurately define each category and needs to be determined manually, the text set is clustered by setting the clustering category and the keywords of the clustering category.
Alternatively, after clustering the crawled text set to obtain a plurality of text clusters, the process of S110 to S130 may be used for each text cluster to determine the target abnormal text of the text cluster.
The abnormal text, namely junk text in the text cluster, can comprise advertisements, texts related to yellow-back and the like; the last abnormal text refers to the abnormal text selected from the text clusters by the last iteration, i.e. the last abnormal text determination operation. The central feature of the secondary cluster is used for representing the common feature of other texts except the last abnormal text in the text cluster, and can be represented by vectors. Alternatively, the current cluster center feature may be determined according to other texts in the text cluster except for the last abnormal text, for example, the average value of text vectors of other texts in the text cluster except for the last abnormal text may be used as the current cluster center feature. Further, if the current abnormal text determining operation is the first abnormal text determining operation, the last abnormal text is empty, and the current cluster center feature can be used for representing the common feature of all texts in the text cluster, that is, the average value of text vectors of all texts in the text cluster can be used as the current cluster center feature.
In this embodiment, the text vector of each text in the text cluster may be obtained by vectorizing the title of the text using TF-IDF (terminal frequency-inverse text frequency index), a pre-trained word vector, or a knowledge-enhanced semantic representation model (Enhanced Language Representation from knowledge IntEgration, ERNIE). Because the ERNIE model can not only avoid the problem that the matrix is too sparse after vectorization, but also process the word order and the polysemous word, the title of the text is vectorized by adopting the ERNIE model to obtain the text vector of the text.
Alternatively, in the case where the ERNIE model has a plurality of output layers, the title of each text may be first converted into a multi-dimensional matrix, and then a column may be selected from the multi-dimensional matrix as the text vector of the text. For example, for a multi-dimensional matrix of 768×12, the data of each column may be visualized, and then a column is selected as a text vector of the text according to the result of the visualization. Further, to increase the running speed of the ERNIE model vectorization, before the ERNIE model is adopted to convert the title of each text into the text vector, the following steps may be further adopted: the title of the text is processed, such as deleting punctuation and stop words for text with a length exceeding 64.
S120, selecting the abnormal text from all the texts according to the distance between all the texts in the text cluster and the central feature of the current cluster.
In this embodiment, the distance between each text and the current cluster center feature may be represented by a cosine similarity between the text vector of the text and the current cluster center feature. Alternatively, the distance is inversely proportional to the cosine similarity, i.e., the greater the cosine similarity, the closer the distance, the smaller the cosine similarity, and the farther the distance.
Specifically, after the central feature of the current cluster is determined, the cosine similarity between the text vector of each text in the text cluster and the central feature of the current cluster can be calculated, and then the current abnormal text can be selected from all the texts in the text cluster according to the cosine similarity. For example, the cosine similarity may be arranged in ascending order, and the first fixed number, such as the first 10, may be used as the current abnormal text. Alternatively, the number of abnormal texts selected from the text clusters may be the same or different each time an abnormal text, that is, each time an abnormal text determining operation, is performed.
S130, determining a target abnormal text in the text cluster according to the abnormal text.
In this embodiment, the target abnormal text refers to the abnormal text of the finally determined text cluster.
Specifically, after the current abnormal text is obtained, a preset iteration stop condition can be combined, and the target abnormal text in the text cluster can be determined according to the current abnormal text. Wherein the iteration stop condition may be a trigger mechanism for generating an iteration stop event, and may include at least one of: 1) The abnormal text of this time is identical with the abnormal text of last time completely; 2) The same number of the abnormal texts at this time and the last abnormal text is more than half of the number of the abnormal texts at this time or the number of the abnormal texts at last time; 3) After the abnormal text determining operation is executed, the abnormal text determining frequency, namely the iteration frequency reaches a set frequency threshold value, such as 5 times; 4) The other texts except the abnormal text in the text cluster do not meet the conditions of being incorporated into the abnormal text.
Optionally, according to the abnormal text, determining the target abnormal text in the text cluster may be: comparing the current abnormal text with the last abnormal text, and determining the target abnormal text in the text cluster according to the comparison result. For example, consistency comparison can be performed on the current abnormal text and the last abnormal text, and whether an iteration stop event is detected is determined according to a comparison result; if the number of the current abnormal text is completely consistent with the number of the last abnormal text, or the same number is greater than half of the number of the current abnormal text or the number of the last abnormal text (for example, if the number of the current abnormal text is 10 and the number of the current abnormal text is 6 and the number of the last abnormal text is 5, the same number is 6), the iteration stop event is detected, and at this time, the current abnormal text can be used as the target abnormal text in the text cluster, or the current abnormal text and the last abnormal text can be used as the target abnormal text in the text cluster. Wherein, the same abnormal text in the current abnormal text and the last abnormal text is calculated only one.
Optionally, according to the abnormal text, determining the target abnormal text in the text cluster may further be: and if an iteration stop event is detected according to the current abnormal text and other texts except the current abnormal text in the text cluster, taking the current abnormal text as a target abnormal text in the text cluster. Specifically, after the current abnormal text is obtained, if it is determined through statistical analysis that all the texts except the current abnormal text in the text cluster do not meet the condition of being incorporated into the current abnormal text, it is determined that an iteration stop event is detected, and at this time, the current abnormal text can be directly used as a target abnormal text in the text cluster. According to the method and the device, the iteration stop event related to the current abnormal text and other texts except the current abnormal text in the text cluster is introduced, so that the processing speed is increased under the condition that the target abnormal text in the text cluster can be accurately determined.
Alternatively, after determining the target abnormal text in the text cluster, the target abnormal text may be deleted from the text cluster, and the text cluster that does not include the target abnormal text may be recommended to the user for the user to browse.
It should be noted that, in this embodiment, the calculation of the central feature of each cluster does not include the abnormal text iterated last time, that is, the last abnormal text, so that the calculation of the central feature is more accurate; meanwhile, the distances between all texts in the text cluster and the central characteristics of the cluster are calculated every iteration, so that the probability of determining the non-abnormal text as the abnormal text is reduced, and the classification is more accurate.
According to the technical scheme provided by the embodiment of the application, the current cluster center feature is determined according to other texts except the last abnormal text in the text cluster by adopting a loop iteration mode, then the distance between each text in the text cluster and the current cluster center feature can be calculated, the current abnormal text is selected from all texts according to the distance between each text in the text cluster and the current cluster center feature, further, the target abnormal text in the text cluster, namely the irrelevant text, can be determined according to the current abnormal text, the problem that the current abnormal text recommended to a user exists in the text cluster, so that poor user experience is caused is solved, the function of automatically filtering the abnormal text in the text cluster is realized, classification is more accurate, and user experience is improved.
Example two
Fig. 2 is a flowchart of a method for determining an abnormal text according to a second embodiment of the present application, where the method further explains determining a target abnormal text in a text cluster according to the abnormal text based on the above embodiment. As shown in fig. 2, the abnormal text determining method provided in the present embodiment may include:
s210, determining the central characteristic of the current cluster according to other texts except the last abnormal text in the text cluster.
S220, selecting the abnormal text from all the texts according to the distance between all the texts in the text cluster and the central feature of the current cluster.
S230, comparing the current abnormal text with the last abnormal text; if so, executing S240; if not, S250 is performed.
S240, taking the abnormal text as a target abnormal text in the text cluster.
S250, judging whether the number of times of abnormal text determination reaches a set number of times threshold; if so, executing S260; if not, the process returns to S210.
S260, taking the last abnormal text and the current abnormal text as target abnormal texts in the text cluster.
To enable rapid determination of the target abnormal text in each text cluster, illustratively, after crawling the text sets from the internet, the crawled text sets may be filtered according to actual scenarios, e.g., the text sets may be filtered according to at least one of a set header length, a regular expression, and a specific identifier. For example, text in the text set that includes advertisements may be filtered using regular expressions, and text in the title that includes one or more specific identifiers, such as spaces, may also be filtered. In addition, in order to accelerate the subsequent processing speed, the text with the title length exceeding the set title length in the text set can be filtered.
Specifically, after the text set is crawled from the internet, the text set crawled from the internet can be filtered according to at least one of a set title length, a regular expression and a specific identifier, and the title of each text after filtering can be vectorized by adopting an ERNIE model so as to obtain a text vector of each text; and then, clustering the text sets according to the mode of setting the clustering category and the keywords of the clustering category to obtain a plurality of text clusters. For each text cluster, the process of S210 to S260 may be employed to determine the target abnormal text of the text cluster.
For example, the number of abnormal texts selected from the text clusters by each abnormal text determining operation is the same, a certain text cluster comprises 100 texts, the last abnormal text determining operation is the first abnormal text determining operation, and the number of the last abnormal texts is 10; the average value of the text vectors of the text other than the last abnormal text in the text cluster can be used as the central characteristic of the cluster, namely, the average value of the text vectors of the text other than the last abnormal text in the 100 files is used as the central characteristic of the cluster; then, the cosine similarity between the text vector of each text in 100 texts and the central characteristic of the current cluster can be calculated, the cosine similarity is arranged in ascending order, and the first 10 texts are used as the current abnormal text; carrying out consistency comparison on the current abnormal text and the last abnormal text, and taking the current abnormal text or the last abnormal text as a target abnormal text in the text cluster if the last abnormal text is consistent with the current abnormal text; if the current abnormal text is inconsistent with the last abnormal text and the determined number of times of the abnormal text, namely the iteration number, reaches a set number threshold, the last abnormal text and the current abnormal text can be used as target abnormal texts in the text cluster. Optionally, if the current abnormal text is inconsistent with the last abnormal text, and the number of times of determining the abnormal text, that is, the number of iterations, does not reach the set number of times threshold, S210 may be executed in a return manner until the target abnormal text in the text cluster is determined.
According to the technical scheme provided by the embodiment of the application, the target abnormal text in the text cluster is determined by comparing the current abnormal text with the last abnormal file, so that the flexibility of selecting the target abnormal text, namely the irrelevant text, is improved.
Example III
Fig. 3 is a flowchart of a method for determining an abnormal text according to a third embodiment of the present application, where the method further explains determining a target abnormal text in a text cluster according to the abnormal text based on the above embodiment. As shown in fig. 3, the abnormal text determining method provided in the present embodiment may include:
s310, determining the central characteristic of the current cluster according to other texts except the last abnormal text in the text cluster.
S320, selecting the abnormal text from all the texts according to the distance between all the texts in the text cluster and the central feature of the current cluster.
S330, the abnormal text with the closest distance to the central feature of the cluster is used as the first abnormal text.
For example, after the central characteristics of the current cluster are determined, the cosine similarity between the text vector of each text in the text cluster and the central characteristics of the current cluster can be calculated, the cosine similarity is arranged in ascending order, and the front fixed numerical values, such as the front 10, are used as the current abnormal text; and then, according to the cosine similarity between the text vector of each current abnormal text and the central characteristic of the current cluster, the current abnormal text with the largest cosine similarity between the current cluster central characteristic, namely the current abnormal text with the closest distance between the current cluster central characteristic, is used as the first abnormal text of the current time.
S340, taking other abnormal texts with the closest distance with the first abnormal text as second abnormal texts.
Specifically, the cosine similarity between each of the other current abnormal texts and the first current abnormal text can be calculated, and then the other current abnormal text with the largest cosine similarity between the other current abnormal texts and the first current abnormal text, namely the other current abnormal text with the closest distance between the first current abnormal texts, can be used as the second current abnormal text. Alternatively, the number of the second abnormal texts may be one or more. For example, if there is a parallel identical situation with other abnormal texts closest to the first abnormal text, a plurality of other abnormal texts may be used as the second abnormal text.
S350, if the distances between all other texts except the abnormal text of the time and the first abnormal text of the time in the text cluster are larger than the distances between the first abnormal text of the time and the second abnormal text of the time, generating an iteration stop event.
Specifically, calculating cosine similarity between the first abnormal text and the second abnormal text; meanwhile, calculating cosine similarity between each other text except the abnormal text of the time and the abnormal text of the first time in the text cluster; then, comparing the cosine similarity between each other text except the abnormal text of the time and the first abnormal text of the time in the text cluster and the cosine similarity between the abnormal text of the time and the second abnormal text of the time; if the cosine similarity between all other texts except the abnormal text of this time in the text cluster and the first abnormal text of this time is larger than the cosine similarity between the abnormal text of this time and the abnormal text of this time of second, an iteration stop event is generated, and then S360 is executed to determine the target abnormal text in the text cluster.
In addition, aiming at cosine similarity between each other text except the current abnormal text in the text cluster and the first current abnormal text, if the cosine similarity is smaller than or equal to the cosine similarity between the first current abnormal text and the second current abnormal text, adding the other text into the current abnormal text, and returning to execute next abnormal text determination operation. If the distance between each other text except the current abnormal text in the text cluster and the first current abnormal text is smaller than or equal to the distance between the first current abnormal text and the second current abnormal text, adding the other text into the current abnormal text, and triggering the next abnormal text determining operation.
In order to further increase the processing speed, after the other text is added into the abnormal text, consistency comparison can be performed between the abnormal text and the last abnormal text, and whether the next abnormal text determining operation is triggered or not is determined according to the comparison result. For example, if the last abnormal text is consistent with the current abnormal text, the current abnormal text can be directly used as the target abnormal text in the text cluster without executing the next abnormal text determining operation.
And S360, if an iteration stop event is detected according to the abnormal text and other texts except the abnormal text in the text cluster, taking the abnormal text as a target abnormal text in the text cluster.
According to the technical scheme provided by the embodiment of the application, the processing speed is accelerated under the condition that the target abnormal text in the text cluster can be accurately determined by introducing the iteration stop event related to the abnormal text and other texts except the abnormal text in the text cluster.
Fourth embodiment
Fig. 4 is a schematic structural diagram of an abnormal text determining apparatus according to a fourth embodiment of the present application, where the apparatus may be configured in a computing device that carries an abnormal text determining function, and the apparatus may perform the abnormal text determining method according to any embodiment of the present application, and has functional modules and beneficial effects corresponding to the performing method. As shown in fig. 4, the apparatus may include:
the current center feature determining module 410 is configured to determine a current center feature of the text cluster according to other text except the last abnormal text in the text cluster;
the current abnormal text determining module 420 is configured to select a current abnormal text from all the texts according to the distances between all the texts in the text cluster and the central feature of the current cluster;
The target abnormal text determining module 430 is configured to determine a target abnormal text in the text cluster according to the current abnormal text.
According to the technical scheme provided by the embodiment of the application, the current cluster center feature is determined according to other texts except the last abnormal text in the text cluster by adopting a loop iteration mode, then the distance between each text in the text cluster and the current cluster center feature can be calculated, the current abnormal text is selected from all texts according to the distance between each text in the text cluster and the current cluster center feature, further, the target abnormal text in the text cluster, namely the irrelevant text, can be determined according to the current abnormal text, the problem that the current abnormal text recommended to a user exists in the text cluster, so that poor user experience is caused is solved, the function of automatically filtering the abnormal text in the text cluster is realized, classification is more accurate, and user experience is improved.
Illustratively, the target anomaly text determination module 430 may be specifically configured to:
if an iteration stop event is detected according to the current abnormal text and other texts except the current abnormal text in the text cluster, the current abnormal text is used as a target abnormal text in the text cluster; or alternatively, the process may be performed,
Comparing the current abnormal text with the last abnormal text, and determining the target abnormal text in the text cluster according to the comparison result.
Illustratively, the apparatus may further include: an iteration stop event generation module, which may be specifically configured to:
taking the abnormal text closest to the central characteristics of the cluster as a first abnormal text;
taking other abnormal texts with the closest distance to the first abnormal text as second abnormal texts;
if the distance between all other texts except the abnormal text of this time in the text cluster and the first abnormal text of this time is larger than the distance between the first abnormal text of this time and the second abnormal text of this time, generating an iteration stop event.
Illustratively, the apparatus may further include:
the text adding triggering module is used for taking other abnormal texts closest to the first abnormal text at this time as second abnormal texts, and if the distance between each other text except the abnormal text at this time in the text cluster and the first abnormal text is smaller than or equal to the distance between the first abnormal text at this time and the second abnormal text at this time, adding the other texts into the abnormal texts at this time and triggering the next abnormal text determining operation.
Illustratively, the target abnormal text determination module 430, when configured to determine target abnormal text in a text cluster according to the comparison result, may specifically be configured to:
if the last abnormal text is consistent with the current abnormal text, the current abnormal text is used as a target abnormal text in a text cluster;
if the last abnormal text is inconsistent with the current abnormal text and the number of times of abnormal text determination reaches a set number of times threshold, the last abnormal text and the current abnormal text are used as target abnormal texts in the text cluster.
Illustratively, the apparatus may further include:
and the text cluster determining module is used for clustering the crawled text set according to the set clustering category and the keywords of the clustering category to obtain a text cluster.
Illustratively, the apparatus may further include:
and the filtering module is used for filtering the text set according to at least one of the set title length, the regular expression and the specific identifier before clustering the crawled text set according to the set clustering category and the keywords of the clustering category.
According to embodiments of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 5, a block diagram of an electronic device according to an abnormal text determination method according to an embodiment of the present application is shown. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 5, the electronic device includes: one or more processors 501, memory 502, and interfaces for connecting components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of a GUI (Graphical User Interface, GUI) on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations, e.g., as a server array, a set of blade servers, or a multiprocessor system. One processor 501 is illustrated in fig. 5.
Memory 502 is a non-transitory computer readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the anomaly text determination method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the abnormal text determination method provided by the present application.
The memory 502 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the abnormal text determination method in the embodiment of the present application, for example, the present center feature determination module 410, the present abnormal text determination module 420, and the target abnormal text determination module 430 shown in fig. 4. The processor 501 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 502, that is, implements the abnormal text determination method in the above-described method embodiment.
Memory 502 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device for implementing the abnormal text determination method, and the like. In addition, memory 502 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 502 may optionally include memory located remotely from processor 501, which may be connected via a network to the electronic device used to implement the anomaly text determination method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device for implementing the abnormal text determination method may further include: an input device 503 and an output device 504. The processor 501, memory 502, input devices 503 and output devices 504 may be connected by a bus or otherwise, for example in fig. 5.
The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device used to implement the abnormal text determination method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. input devices. The output means 504 may include a display device, auxiliary lighting means such as light emitting diodes (Light Emitting Diode, LEDs), tactile feedback means such as vibration motors, and the like. The display device may include, but is not limited to, a liquid crystal display (Liquid Crystal Display, LCD), an LED display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be implemented in digital electronic circuitry, integrated circuitry, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs are also referred to as programs, software applications, or code, including machine instructions of a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device for providing machine instructions and/or data to a programmable processor, e.g., magnetic discs, optical disks, memory, programmable logic devices (Programmable Logic Device, PLD), including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device for displaying information to a user, for example, a Cathode Ray Tube (CRT) or an LCD monitor; and a keyboard and pointing device, such as a mouse or trackball, by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes background components, e.g., as a data server; or in a computing system including middleware components, such as an application server; or in a computing system that includes a front-end component, such as a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described here, or in a computing system that includes such a back-end component, middleware component, or any combination of front-end components. The components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include: local area network (Local Area Network, LAN), wide area network (Wide Area Network, WAN), the internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, the current cluster center feature is determined according to other texts except the last abnormal text in the text cluster by adopting a loop iteration mode, then the distance between each text in the text cluster and the current cluster center feature can be calculated, the current abnormal text is selected from all texts according to the distance between each text in the text cluster and the current cluster center feature, further, the target abnormal text in the text cluster, namely the irrelevant text, can be determined according to the current abnormal text, the problem that the user experience is poor due to the abnormal text in the text cluster recommended to the user at present is solved, the function of automatically filtering the abnormal text in the text cluster is realized, classification is more accurate, and user experience is improved.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (8)

1. An abnormal text determination method, characterized by comprising:
determining the central characteristics of the current cluster according to other texts except the last abnormal text in the text cluster;
selecting the abnormal text from all texts according to the distance between all texts in the text cluster and the central feature of the current cluster, wherein the abnormal text comprises the following steps:
if an iteration stop event is detected according to the current abnormal text and other texts except the current abnormal text in a text cluster, the current abnormal text is used as a target abnormal text in the text cluster; or alternatively, the process may be performed,
comparing the current abnormal text with the last abnormal text, and determining a target abnormal text in the text cluster according to a comparison result;
the method further comprises the steps of:
taking the abnormal text closest to the central characteristics of the cluster as a first abnormal text;
taking other abnormal texts with the closest distance to the first abnormal text as second abnormal texts;
and if the distances between all other texts except the abnormal text of this time in the text cluster and the first abnormal text of this time are larger than the distances between the first abnormal text of this time and the second abnormal text of this time, generating the iteration stop event.
2. The method according to claim 1, wherein after using other abnormal text closest to the first abnormal text as the second abnormal text, the method further comprises:
if the distance between each other text except the current abnormal text in the text cluster and the first current abnormal text is smaller than or equal to the distance between the first current abnormal text and the second current abnormal text, adding the other text into the current abnormal text, and triggering the next abnormal text determining operation.
3. The method of claim 1, wherein determining the target anomaly text in the text cluster based on the comparison result comprises:
if the last abnormal text is consistent with the current abnormal text, the current abnormal text is used as a target abnormal text in the text cluster;
and if the last abnormal text is inconsistent with the current abnormal text and the abnormal text determining frequency reaches a set frequency threshold, taking the last abnormal text and the current abnormal text as target abnormal texts in the text cluster.
4. The method as recited in claim 1, further comprising:
And clustering the crawled text set according to the set clustering category and the keywords of the clustering category to obtain a text cluster.
5. The method of claim 4, wherein prior to clustering the crawled text sets based on the set of clusters and the keywords of the clusters, further comprising:
the text set is filtered according to at least one of a set header length, a regular expression, and a specific identifier.
6. An abnormal text determination apparatus, comprising:
the current center feature determining module is used for determining the current cluster center feature according to other texts except the last abnormal text in the text cluster;
the abnormal text determining module is used for selecting abnormal texts of the time from all texts according to the distances between all texts in the text cluster and the central characteristics of the current cluster;
the target abnormal text determining module is used for determining target abnormal texts in the text cluster according to the current abnormal text;
the target abnormal text determining module is specifically configured to:
if an iteration stop event is detected according to the current abnormal text and other texts except the current abnormal text in a text cluster, the current abnormal text is used as a target abnormal text in the text cluster; or alternatively, the process may be performed,
Comparing the current abnormal text with the last abnormal text, and determining a target abnormal text in the text cluster according to a comparison result;
the iteration stop event generation module is used for:
taking the abnormal text closest to the central characteristics of the cluster as a first abnormal text;
taking other abnormal texts with the closest distance to the first abnormal text as second abnormal texts;
and if the distances between all other texts except the abnormal text of this time in the text cluster and the first abnormal text of this time are larger than the distances between the first abnormal text of this time and the second abnormal text of this time, generating the iteration stop event.
7. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the outlier text determination method of any one of claims 1-5.
8. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the abnormal text determination method of any one of claims 1-5.
CN201911341128.3A 2019-12-23 2019-12-23 Abnormal text determination method and device, electronic equipment and medium Active CN111125362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911341128.3A CN111125362B (en) 2019-12-23 2019-12-23 Abnormal text determination method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911341128.3A CN111125362B (en) 2019-12-23 2019-12-23 Abnormal text determination method and device, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN111125362A CN111125362A (en) 2020-05-08
CN111125362B true CN111125362B (en) 2023-06-16

Family

ID=70501319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911341128.3A Active CN111125362B (en) 2019-12-23 2019-12-23 Abnormal text determination method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN111125362B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860849B (en) * 2021-01-20 2021-11-30 平安科技(深圳)有限公司 Abnormal text recognition method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786898A (en) * 2014-12-24 2016-07-20 中国移动通信集团公司 Domain ontology construction method and apparatus
EP3293661A1 (en) * 2016-09-08 2018-03-14 AO Kaspersky Lab System and method for detecting anomalous elements of web pages
CN108090193A (en) * 2017-12-21 2018-05-29 阿里巴巴集团控股有限公司 The recognition methods of abnormal text and device
CN108875049A (en) * 2018-06-27 2018-11-23 中国建设银行股份有限公司 text clustering method and device
CN110083475A (en) * 2019-04-23 2019-08-02 新华三信息安全技术有限公司 A kind of detection method and device of abnormal data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9589045B2 (en) * 2014-04-08 2017-03-07 International Business Machines Corporation Distributed clustering with outlier detection
US11238083B2 (en) * 2017-05-12 2022-02-01 Evolv Technology Solutions, Inc. Intelligently driven visual interface on mobile devices and tablets based on implicit and explicit user actions
US20180336437A1 (en) * 2017-05-19 2018-11-22 Nec Laboratories America, Inc. Streaming graph display system with anomaly detection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786898A (en) * 2014-12-24 2016-07-20 中国移动通信集团公司 Domain ontology construction method and apparatus
EP3293661A1 (en) * 2016-09-08 2018-03-14 AO Kaspersky Lab System and method for detecting anomalous elements of web pages
CN108090193A (en) * 2017-12-21 2018-05-29 阿里巴巴集团控股有限公司 The recognition methods of abnormal text and device
CN108875049A (en) * 2018-06-27 2018-11-23 中国建设银行股份有限公司 text clustering method and device
CN110083475A (en) * 2019-04-23 2019-08-02 新华三信息安全技术有限公司 A kind of detection method and device of abnormal data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Honglong Xu ; Rui Mao ; Hao Liao ; Minhua Lu ; He Zhang. Closest neighbors excluded outlier detection.《2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS)》.2016,105-110页. *
Zhu Huan ; Zhang Pengzhou ; Gao Zeyang. K-means Text Dynamic Clustering Algorithm Based on KL Divergence.《2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS)》.2018,659-663页. *
范佳健.微博评论信息的聚类分析.《中国优秀硕士学位论文全文数据库 信息科技辑》.2017,I138-598页. *

Also Published As

Publication number Publication date
CN111125362A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111460083B (en) Method and device for constructing document title tree, electronic equipment and storage medium
CN111967256B (en) Event relation generation method and device, electronic equipment and storage medium
CN111563385B (en) Semantic processing method, semantic processing device, electronic equipment and medium
CN111950254B (en) Word feature extraction method, device and equipment for searching samples and storage medium
CN111753914B (en) Model optimization method and device, electronic equipment and storage medium
CN111488740B (en) Causal relationship judging method and device, electronic equipment and storage medium
KR20210038467A (en) Method and apparatus for generating an event theme, device and storage medium
KR20210132578A (en) Method, apparatus, device and storage medium for constructing knowledge graph
CN111460289B (en) News information pushing method and device
CN112115313B (en) Regular expression generation and data extraction methods, devices, equipment and media
CN111400456B (en) Information recommendation method and device
CN111756832B (en) Method and device for pushing information, electronic equipment and computer readable storage medium
CN111737966B (en) Document repetition detection method, device, equipment and readable storage medium
CN111563198B (en) Material recall method, device, equipment and storage medium
CN111797216B (en) Search term rewriting method, apparatus, device and storage medium
CN111460296B (en) Method and apparatus for updating event sets
CN111460791B (en) Text classification method, device, equipment and storage medium
CN111310058B (en) Information theme recommendation method, device, terminal and storage medium
CN111125362B (en) Abnormal text determination method and device, electronic equipment and medium
CN113111216B (en) Advertisement recommendation method, device, equipment and storage medium
CN113127669B (en) Advertisement mapping method, device, equipment and storage medium
CN111414487B (en) Method, device, equipment and medium for associated expansion of event theme
CN111125445B (en) Community theme generation method and device, electronic equipment and storage medium
CN113722593B (en) Event data processing method, device, electronic equipment and medium
CN111340222B (en) Neural network model searching method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant