WO2020052547A1 - 短信垃圾新词识别方法、装置及电子设备 - Google Patents

短信垃圾新词识别方法、装置及电子设备 Download PDF

Info

Publication number
WO2020052547A1
WO2020052547A1 PCT/CN2019/105123 CN2019105123W WO2020052547A1 WO 2020052547 A1 WO2020052547 A1 WO 2020052547A1 CN 2019105123 W CN2019105123 W CN 2019105123W WO 2020052547 A1 WO2020052547 A1 WO 2020052547A1
Authority
WO
WIPO (PCT)
Prior art keywords
spam
word
new
candidate
short message
Prior art date
Application number
PCT/CN2019/105123
Other languages
English (en)
French (fr)
Inventor
高喆
康杨杨
周笑添
孙常龙
刘晓钟
司罗
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020052547A1 publication Critical patent/WO2020052547A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of text mining technology, and in particular, to a method and a device for identifying new words of short message spam, and an electronic device.
  • a typical scenario for sending text messages is that a merchant sends text messages to consumers through a network platform to facilitate timely delivery of information such as product promotions to consumers, thereby ensuring the effective implementation of merchant sales plans and improving the user experience.
  • a lot of spam messages also appeared.
  • the proliferation of spam messages has seriously affected the normal lives of consumers, the image of online platforms and even social stability.
  • SMS content security system uses the SMS content security system to analyze the content of business-to-customer (B2C) SMS, and perform intelligent SMS interception and channel optimization.
  • B2C business-to-customer
  • the discovery of new words of spam messages is an important function of the SMS content security system. Effectively mining new words of spam messages can improve the accuracy rate of the spam message recognition model, can timely respond to online spam message variants, and can be used by SMS reviewers Provide new junk variants that appear online every day.
  • new methods of spam text recognition are mainly divided into two categories: methods of spam text recognition based on supervised new word discovery, and methods of spam text recognition based on unsupervised new word discovery.
  • the supervised new word discovery technology relies on sequence tagging results to implement.
  • the unsupervised new word discovery technology uses certain indicators to calculate candidate phrases that are not included in the spam text dictionary. Based on the probability of word formation, it is judged whether the candidate phrase is a new word of spam according to the word formation probability. This method has low cost and high efficiency. Because unsupervised new word discovery technology has the advantages of low cost and high efficiency, it has become a most commonly used spam short message new word recognition scheme.
  • the inventor found that the technical solution of spam new word recognition based on unsupervised new word discovery technology has at least the following problems: On the one hand, because it simply relies on simple frequency features (such as the frequency of occurrence of words Etc.), without considering the distribution of categories, it is easy to recall a lot of normal words, which leads to higher subsequent review costs; on the other hand, because of the simple frequency feature to calculate the probability of word formation, the variation in spam messages The occurrence frequency of new words is low, so the word formation probability of such words will be low, so that this type of spam text new words cannot be identified, and the problem of low-frequency spam new words cannot be recalled.
  • simple frequency features such as the frequency of occurrence of words Etc.
  • the existing technology has the problems that the spam SMS new word recall rate is low and the recall rate is low.
  • This application provides a method for identifying new spam words in short messages, so as to solve the problem that the recall rate of new words in spam messages in the prior art is low and the recall rate is low.
  • This application additionally provides a short message spam recognition device, and an electronic device.
  • This application provides a method for identifying new words in SMS spam, including:
  • the short message collection includes multiple spam messages and multiple normal messages;
  • a spam short message new word is determined from the candidate word set.
  • determining the candidate word set corresponding to the multiple spam messages includes:
  • the combined word formed by at least two adjacent short message words meets the candidate word rule, the combined word is used as the candidate word.
  • the candidate word rule includes that the number of words of the candidate word is less than a preset number of words.
  • the related index of the tendency of the short message category includes at least one of the following indicators: cross entropy, dominance rate, and mutual information;
  • the related index of document rarity includes: inverted document frequency IDF.
  • the determining a new spam word score of the candidate word according to the short message category propensity related indicator and the document rarity related indicator includes:
  • a weighted average of the short message category tendency related index and the document rareness related index is used as the spam short message new word score.
  • determining the new spam message from the candidate word set based on the new spam message word score includes:
  • the candidate word whose score of the new spam message new word is greater than the score threshold is taken as the new word of the spam message.
  • determining the new spam message from the candidate word set based on the new spam message word score includes:
  • the target new word is the spam short message new word.
  • determining the new spam message from the candidate word set based on the new spam message word score includes:
  • determining the new spam message from the candidate new spam messages based on the word formation probability includes:
  • the candidate new spam message with the word formation probability greater than the word formation probability threshold is used as the new spam message.
  • determining the new spam message from the candidate new spam messages based on the word formation probability includes:
  • the target new word is the spam short message new word.
  • determining the word formation probability of the candidate new spam message includes:
  • a word formation probability of the candidate new spam message is obtained.
  • obtaining the word formation probability of the candidate new spam message based on the internal cohesion and the external degree of freedom includes:
  • An average value of the internal aggregation degree and the external degree of freedom is used as the word formation probability.
  • determining the new spam message from the candidate word set based on the new spam message word score includes:
  • the new spam short word is determined from the candidate new spam short words.
  • determining the new spam message from the candidate new spam messages based on the spam vocabulary similarity includes:
  • the candidate new spam word of the spam short message vocabulary similarity is greater than a first similarity threshold or less than a second similarity threshold as a new spam short word.
  • determining the new spam message from the candidate new spam messages based on the spam vocabulary similarity includes:
  • the target candidate new word is the spam short message new word.
  • determining the spam vocabulary similarity of the candidate new spam message based on the semantic similarity includes:
  • determining the semantic similarity between at least one preset spam message vocabulary and the candidate new spam message vocabulary includes:
  • This application also provides a method for identifying new words in spam messages, including:
  • the short message collection includes multiple spam messages and multiple normal messages;
  • a new spam message is determined from the candidate word set according to the short message category propensity score.
  • determining a new spam message from the candidate word set according to the short message category propensity score includes:
  • the new spam short word is determined from the candidate new spam short words.
  • This application also provides a method for identifying new words in spam messages, including:
  • the short message collection includes multiple spam messages and multiple normal messages;
  • a new spam message is determined from the candidate word set according to the spam message vocabulary similarity.
  • This application also provides a new word recognition device for spam messages, including:
  • a short message set acquiring unit for obtaining a short message set includes multiple spam messages and multiple normal messages;
  • a candidate word set determining unit configured to determine a candidate word set corresponding to the multiple spam messages
  • An index determining unit configured to determine an index related to the short message category tendency of the candidate word according to the short message category information of the short message; and obtain a document rareness related index of the candidate word;
  • a score determining unit configured to determine a new spam word score of the candidate word according to the short message category tendency related index and the document rareness related index;
  • a new word determining unit is configured to determine a new word of a spam message from the candidate word set according to the score of the new word of the spam message.
  • the candidate word set determining unit includes:
  • the combined word formed by at least two adjacent short message words meets the candidate word rule, the combined word is used as the candidate word.
  • the score determination unit is specifically configured to use the weighted average of the short message category tendency related index and the document rarity related index as the spam short message new word score.
  • the new word determining unit includes:
  • a candidate word selection subunit configured to obtain a new word with a spam message new word score greater than a score threshold, as a candidate new word of spam message
  • a word formation probability determining subunit configured to determine a word formation probability of the candidate new spam message
  • a first new word determination subunit is configured to determine the new spam message from the candidate new spam messages according to the word formation probability.
  • the new word determining unit includes:
  • a candidate word selection subunit configured to obtain the candidate word with a score of the new spam message greater than a score threshold, as a candidate new word of the spam message
  • a first similarity determination subunit configured to determine a semantic similarity between at least one preset spam message vocabulary and the candidate new spam message
  • a second similarity determination subunit configured to determine, based on the semantic similarity, a spam vocabulary similarity of the candidate spam new word
  • the second new word determination subunit is configured to determine the new spam message from the candidate new spam messages based on the spam vocabulary similarity.
  • This application also provides an electronic device, including:
  • the memory is configured to store a program for implementing a method for identifying new words in a short message spam. After the device is powered on and runs the program for identifying a method for identifying new words in a short message through the processor, the following steps are performed: obtaining a short message set; the short message set includes A plurality of spam messages and a plurality of normal messages; determining a candidate word set corresponding to the plurality of spam messages; and determining an index related to a short message category of the candidate word according to the short message category information of the short messages; Document rarity related indicators of candidate words; determine spam new word scores of the candidate words according to the short message category propensity related indicators and the document rarity related indicators; Identify new words in spam messages from the candidate word set.
  • This application also provides a new word recognition device for spam messages, including:
  • a short message set acquiring unit for obtaining a short message set includes multiple spam messages and multiple normal messages;
  • a candidate word set determining unit configured to determine a candidate word set corresponding to the multiple spam messages
  • a score determination unit configured to determine a short message category propensity score of the candidate word according to the short message category information of the short message
  • a new word determining unit is configured to determine a new spam message from the candidate word set according to the short message category propensity score.
  • the new word determining unit includes:
  • a candidate word selection subunit configured to obtain the candidate word with a tendency score of the short message category greater than a score threshold as a new candidate short message for spam;
  • a similarity determination subunit configured to determine a spam vocabulary similarity of the candidate spam new word
  • a new word determination subunit is configured to determine the new spam message from the candidate new spam messages based on the spam vocabulary similarity.
  • This application also provides an electronic device, including:
  • the memory is configured to store a program for implementing a method for identifying new words in a short message spam. After the device is powered on and runs the program for identifying a method for identifying new words in a short message through the processor, the following steps are performed: obtaining a short message set; the short message set includes Multiple spam messages and multiple normal text messages; determining candidate word sets corresponding to the multiple spam messages; determining a short message category propensity score of the candidate words according to the short message category information of the short messages; and according to the short message category tendency Sex score, determine spam new words from the candidate word set.
  • This application also provides a new word recognition device for spam messages, including:
  • a short message set acquiring unit for obtaining a short message set includes multiple spam messages and multiple normal messages;
  • a candidate word set determining unit configured to determine a candidate word set corresponding to the multiple spam messages
  • a first semantic similarity determining unit configured to determine a semantic similarity between at least one preset spam vocabulary word and the candidate word respectively;
  • a second semantic similarity determining unit configured to determine a spam vocabulary similarity of the candidate new word according to the semantic similarity
  • a new word determining unit is configured to determine a new word of a spam message from the candidate word set according to the similarity of the spam word vocabulary.
  • This application also provides an electronic device, including:
  • the memory is configured to store a program for implementing a method for identifying new words in a short message spam. After the device is powered on and runs the program for identifying a method for identifying new words in a short message through the processor, the following steps are performed: obtaining a short message set; the short message set includes A plurality of spam messages and a plurality of normal messages; determining a candidate word set corresponding to the plurality of spam messages; determining a semantic similarity between at least one preset spam vocabulary word and the candidate word respectively; and according to the semantic similarity To determine the spam vocabulary similarity of the candidate new words; and to determine the spam new vocabulary from the candidate word set according to the spam short message vocabulary similarity.
  • the present application also provides a computer-readable storage medium having instructions stored in the computer-readable storage medium that, when run on a computer, causes the computer to execute the various methods described above.
  • the present application also provides a computer program product including instructions that, when run on a computer, causes the computer to perform the various methods described above.
  • An embodiment of the present application provides a method for identifying spam new words in a short message, by obtaining a short message set, determining a candidate word set corresponding to the multiple spam short messages, and determining a short message category of the candidate word according to short message category information of the short message.
  • this processing method makes it possible to recall new words of low-frequency spam messages based on relevant indicators of document rarity; therefore, the recall rate can be effectively improved.
  • this processing method makes it possible to screen out new words that have a certain degree of rarity in spam messages and are more inclined to spam messages in categories; therefore, it can effectively improve the recall rate and recall To improve the accuracy of new word discovery.
  • Another method for identifying new spam words of short messages is to obtain a short message set, determine a candidate word set corresponding to the multiple short messages, and determine the short message of the candidate word according to the short message category information of the short messages.
  • Category preference score the SMS category includes spam or normal SMS category; new spam messages are determined from the candidate word set according to the SMS category preference score; this processing method makes the preference according to the category of SMS Most of the new words recalled by sex-related indicators are representative for spam messages and avoid recalling many normal words; therefore, the recall rate can be effectively improved.
  • a method for identifying new spam words in a short message is obtained by obtaining a short message set; determining a candidate word set corresponding to the multiple spam messages; and determining at least one preset spam word vocabulary and the candidate word respectively. Based on the semantic similarity, determine the spam vocabulary similarity of the candidate new words; determine the spam new vocabulary words from the candidate word set based on the spam vocabulary similarity; such processing Method so that the candidate words are semantically expanded by word embedding, and the new spam messages are determined according to the semantic similarity between the candidate words and the spam words, so some rare new words will not be ignored; therefore , Can effectively improve the recall rate of new words in spam messages.
  • FIG. 1 is a flowchart of an embodiment of a short message spam word recognition method provided by the present application
  • FIG. 2 is a specific flowchart of an embodiment of a method for identifying new words in a short message spam provided by the present application
  • FIG. 3 is a specific flowchart of an embodiment of a method for identifying new words of short message spam provided by the present application
  • FIG. 4 is a specific flowchart of an embodiment of a method for identifying new words in a short message spam provided by the present application
  • FIG. 5 is a schematic diagram of an embodiment of a short message spam new word recognition device provided by the present application.
  • FIG. 6 is a schematic diagram of an embodiment of an electronic device provided by the present application.
  • FIG. 7 is a flowchart of an embodiment of a short message spam word recognition method provided by the present application.
  • FIG. 8 is a schematic diagram of an embodiment of a short message spam new word recognition device provided by the present application.
  • FIG. 9 is a schematic diagram of an embodiment of an electronic device provided by the present application.
  • FIG. 10 is a flowchart of an embodiment of a short message spam word recognition method provided by the present application.
  • FIG. 11 is a schematic diagram of an embodiment of a short message spam new word recognition device provided by the present application.
  • FIG. 12 is a schematic diagram of an embodiment of an electronic device provided by the present application.
  • FIG. 1 is a flowchart of an embodiment of a method for identifying vocabulary of spam messages provided by the present application.
  • the execution body of the method includes a device for identifying vocabulary of spam messages.
  • a method for identifying vocabulary of spam messages provided in this application includes:
  • Step S101 Obtain a short message set.
  • the short message also called short message or short message, includes but is not limited to mobile phone short messages, and may also be other forms of short messages such as instant messages.
  • the short message set includes multiple spam messages and multiple normal messages. Among them, the short message category is marked as a spam message, and the normal message category is marked as a normal message.
  • Step S103 Determine a candidate word set corresponding to the multiple spam messages.
  • the candidate words also referred to as candidate new words, include words other than the spam message dictionary appearing in the multiple spam messages, but because the segmentation result of the spam message is not a new word, the segmentation result of the spam message is not included.
  • the spam text dictionary includes a plurality of spam text words that have been determined.
  • step S103 may include the following sub-steps: 1) using a word segmentation algorithm to obtain the words included in the spam message as short message words; 2) if the combined word formed by at least two adjacent short message words meets the candidate word rule , Then use the combined word as the candidate word.
  • the existing word segmentation algorithm can be used for word segmentation processing of spam messages.
  • Existing word segmentation algorithms can be divided into three categories: word segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. According to whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and labeling.
  • the word segmentation algorithm is a relatively mature existing technology, which is not repeated here, and any existing word segmentation algorithm can be selected according to actual needs.
  • the content of the spam message includes “New Rural Cooperative Insurance”
  • the segmentation results include the following short message words: “new”, “agricultural”, “cooperative” and "insurance”.
  • the combined word formed by at least two adjacent short message words meets the candidate word rule, the combined word is used as the candidate word.
  • the candidate word rule includes, but is not limited to, the number of words of the candidate word is less than a preset number of words.
  • the preset number of words can be set according to business requirements, for example, set to 4 and so on.
  • step S103 may also adopt other specific implementations, as long as the candidate word sets corresponding to the multiple spam messages can be determined.
  • Step S105 According to the short message category information of the short message, determine a related indicator of the short message category tendency of the candidate word; and obtain a document rareness related indicator of the candidate word.
  • the method provided in the embodiment of the present application introduces short message category information, and obtains a short message category tendency related index of the candidate word according to the distribution of the candidate word in spam messages and normal short messages.
  • the category of the short message may be a spam message or a normal message.
  • the short message category tendency related index refers to an index that can reflect the short message category tendency of the candidate word, that is, the short message category tendency of the candidate word can be determined according to the short message category tendency related index. .
  • the related index of the short message category tendency includes, but is not limited to, at least one of the following indicators: cross entropy, dominance rate, and mutual information.
  • Cross entropy is an important concept in Shannon's information theory. In the method provided in this application, it is mainly used to measure the difference information between the two probability distributions (spam message probability and normal message probability of the candidate word).
  • the odds ratio (Odds Ratio) is only applicable to the case of binary classification, and its characteristic is that it only cares about the score of the text feature for the target class.
  • Pos represents the target category (such as spam messages), and neg represents the non-target category (such as normal messages).
  • Mutual Information is a useful measure of information in information theory. It can be regarded as the amount of information about another random variable contained in a random variable, or a random variable is known because of another random variable. Reduced uncertainty.
  • the method provided in the embodiment of the present application further introduces a document rarity-related index of the candidate word, so as to discover new words that are mutated in spam messages with a low frequency of occurrence, and the like.
  • the document rarity related index refers to an index that reflects the document rarity of the candidate word, including, but not limited to, an inverse document frequency (IDF).
  • IDF also known as inverse document frequency, is the inverse of document frequency, and is mainly used in the concept of TF-IDF (term frequency-inverse document frequency).
  • the document rarity related indicator includes a document rarity related indicator of the candidate word in the short message set.
  • Step S107 Determine the new word score of the spam message of the candidate word according to the short message category tendency related index and the document rarity related index.
  • the two types of indicators can be combined to determine the candidate word spam new word score.
  • step S107 may be implemented in the following manner: using the weighted average of the short message category tendency related index and the document rarity related index as the spam short message new word score.
  • the weight of each indicator can be determined according to business needs.
  • the score of the low-frequency candidate words will be improved, which is helpful to screen out more low-frequency candidate words, but may include more normal words;
  • the score of the candidate words with a high propensity for spam messages will be improved, which is helpful for screening high-frequency candidate words that are representative of spam text, but some low-frequency candidate words may be ignored.
  • the high-frequency word A is easier to be filtered, but the low-frequency word B may be a critical junk Vocabulary, in order to make the vocabulary like B easy to be collected, we need to consider adding a document rarity index to increase the weight of low-frequency words.
  • the idf of the high-frequency word A is 2, and the idf of the low-frequency word B is 6.
  • Step S109 Determine a new spam message from the candidate word set according to the new spam message word score.
  • new spam words are determined from the candidate word set according to the spam new word score of each of the candidate words.
  • the candidate word whose score of the new spam message new word is greater than a score threshold is taken as the new word of spam message.
  • step S109 includes the following sub-steps: 1) acquiring and displaying the new word with a new word score greater than a score threshold; 2) receiving a determination instruction for the target new word input by the user; The target new word is the new spam message.
  • the score threshold may be determined according to business requirements. The higher the scoring threshold is, the lower the noise of spam new words is, but the lower the call rate of new spam messages is, some new spam messages may be lost; the lower the scoring threshold is, the more spam new words are. The higher the recall rate, but the more noisy new words in spam messages, the greater the amount of manual review.
  • step S109 may include the following sub-steps:
  • Step S201 Acquire a new word whose score of the new spam message is greater than a score threshold, and use it as a candidate new word of the spam message.
  • Step S203 Determine the word formation probability of the candidate new spam message.
  • the step of determining a word formation probability of the candidate new spam message may include the following sub-steps: 1) determining an internal combination degree of the candidate new spam message; and Boundary degrees of freedom of the candidate new spam message; 2) Obtain the word formation probability of the candidate new spam message according to the internal cohesion and the external degree of freedom.
  • the step of obtaining the word formation probability of the candidate spam new word according to the internal cohesion and the external degree of freedom may be implemented as follows: combining the internal cohesion and the The average value of the external degrees of freedom is used as the word formation probability.
  • Step S205 Determine the new spam message from the candidate new spam messages according to the word formation probability.
  • the candidate new spam message with the word formation probability greater than the threshold for the word formation probability may be used as the new spam message; or the following sub-steps may be used to determine the new spam message: 1) Obtain and display The candidate new spam message with the word formation probability greater than the threshold for the word formation probability; 2) receiving a determination instruction for the target new word input by the user; 3) using the target new word as the new spam message.
  • the word formation probability threshold may be determined according to business requirements. In this embodiment, in order to avoid filtering out low-frequency candidate words filtered according to the spam short message new word score, the word formation probability threshold may be set larger.
  • the method provided in the embodiment of the present application obtains a new word with a score greater than a scoring threshold of the new spam message as a candidate new spam word; determines a word formation probability of the candidate new spam word; Word probability, determine the new spam message from the candidate new spam messages; this processing method makes the determined new spam message still more practical words, such as "post-80s", avoiding screening New spam text words that have no practical meaning, such as "Xinhe”; therefore, it can effectively improve the effectiveness of new spam text words.
  • step S109 may include the following sub-steps:
  • Step S301 Obtain the candidate word whose score of the new spam message is greater than a score threshold, and use it as a candidate new word of spam message.
  • Step S303 Determine the semantic similarity between at least one preset spam message vocabulary and the candidate new spam message.
  • the at least one preset spam message vocabulary includes, but is not limited to, words in a spam message dictionary.
  • step S303 may include the following sub-steps: 1) determining a word vector of the candidate new spam message; 2) according to the word vector of the preset spam vocabulary word and the candidate new spam message , Determine the semantic similarity between the preset spam word and the candidate new spam message.
  • the following methods can be used: offline or online calculation of the word-based language model embedding (word embedding, word vector) of all messages in the message set, such as N-Gram Or the Skip-Gram language model, or using cbow, glove, etc. to determine the word vector of the candidate new spam message.
  • word-based language model embedding word embedding, word vector
  • the accuracy of the word vector can be effectively improved.
  • SMS A is "Sale and Purchase Invoice, Add Me WeChat", where "Invoice” is a common word
  • SMS B is “Sale and Purchase, Add Me WeChat”
  • “Fa Bun” is a new word.
  • “Fa Bun” itself has a low frequency, but embedding characterizes the context in which the word often appears. Therefore, “Fa Bun” and “Invoice” are similar in embedding.
  • the preset spam vocabulary can be determined by calculating the cosine distance between the two word vectors and the like. The semantic similarity between the candidate new spam messages.
  • Step S305 Determine the spam vocabulary similarity of the candidate spam new words according to the semantic similarity.
  • the spam vocabulary similarity of the candidate new spam message can be determined based on these semantic similarities.
  • the spam vocabulary similarity includes a semantic similarity between a word and an existing spam vocabulary.
  • step S305 is implemented in the following manner: among the semantic similarities between each preset spam message vocabulary and the candidate new spam message, the largest semantic similarity is used as the spam message vocabulary similarity. degree.
  • the spam vocabulary similarity of the candidate spam new word can be set to the semantic similarity between the word and the spam vocabulary with the closest semantics.
  • Step S307 Determine the new spam message from the candidate new spam messages based on the spam vocabulary similarity.
  • the candidate new spam word of the spam vocabulary similarity greater than a first similarity threshold or less than a second similarity threshold may be used as a new spam short word.
  • the first similarity threshold and the second similarity threshold may be set according to service requirements.
  • step S307 may also take the following sub-steps: 1) acquiring and displaying the candidate new spam words of the spam short message vocabulary similarity greater than a first similarity threshold or less than a second similarity threshold; 2) receiving a determination instruction for a target new word input by a user; 3) changing the target new word to the spam short message new word.
  • a sampling probability of the spam short message vocabulary similarity can be used to obtain a sampling probability of the new spam short message based on a piecewise probability density function, and the new spam short word is determined according to the sampling probability.
  • the following piecewise probability density function is used:
  • x represents the vocabulary similarity of the spam message
  • p (x) represents the sampling probability
  • the method provided in the embodiment of the present application determines a word vector of a candidate word by performing word embedding processing on the candidate word, and determines the existing junk word and candidate word according to the word vector of the existing spam message word and the word vector of the candidate word. Semantic similarity between them, and then determine the spam vocabulary similarity of the candidate words, and determine the sampling probability of the candidate words through the segmented probability density function, and show the candidate words with the highest probability to the reviewer for review; this processing method, The semantic expansion of candidate words is made, and new word recognition is not only from the perspective of frequency. After considering the semantics of candidate words, on the one hand, it will not ignore rare new words, which can highlight those semantic comparisons with existing junk words.
  • Similar candidate words can filter out candidate words that are completely different in semantics from existing spam vocabulary, and thus can find some new spam text messages.
  • the context of this text message may not have appeared, but since the word Passed the review of SMS category preference and word formation probability, then it may be a good complement to the existing junk vocabulary ; Therefore, can effectively improve the accuracy of identifying new word spam message.
  • FIG. 4 is a schematic diagram of an embodiment of a method for identifying new words of spam messages in this application.
  • a candidate word set is first determined in step S103, and the set includes a plurality of candidate words.
  • the candidate words such as cross entropy, Relevance indicators of SMS category propensity such as odds ratio and mutual information, and determination of document rarity related indicators such as IDF, and the candidate spam SMS new word score is determined in step S107, and then the candidate words in the candidate word set are evaluated according to the score.
  • the first layer of filtering removes candidate words with a score lower than or equal to the scoring threshold; then, the second layer of filtering is performed on the remaining candidate words in the set, that is, based on the internal aggregation degree of the candidate words and External degree of freedom, obtain the word formation probability of candidate words, and perform a second-level filtering based on the word formation probability, remove candidate words with word formation probability lower than or equal to the word formation probability threshold; then, The remaining candidate words are filtered at the third level, that is, the cosine similarity between the candidate word and the existing spam message vocabulary is calculated by word embedding, and The sampling probability of the candidate words is obtained through the segmented probability density function, and the new spam message displayed for manual review is finally determined according to the sampling probability.
  • the unsupervised new word discovery technology based on the three-layer screening mode is used to mine junk vocabulary in text messages, which can effectively improve the accuracy and recall rate of new word recognition in spam text messages, thereby solving the manual full review of junk vocabulary efficiency.
  • this solution can provide reviewers with new online spam vocabulary in time for risk management and control.
  • the spammed vocabulary mined can build a black library and improve the recognition rate of spam text.
  • the short message spam new word recognition method obtains a short message set; determines a candidate word set corresponding to the multiple spam short messages; and determines the candidate based on the short message category information of the short messages.
  • Related index of the short message category tendency of the word and, obtaining a document rareness related indicator of the candidate word; and determining a new spam message of the candidate word according to the short message category tendency related indicator and the document rareness related indicator.
  • new spam messages are determined from the candidate word set based on the new spam message word scores; this processing method makes most new words recalled according to the related index of SMS category tendencies to be representative of spam messages Meaning, avoid recalling many normal words; therefore, it can effectively improve the recall rate.
  • this processing method makes it possible to recall new words of low-frequency spam messages based on relevant indicators of document rarity; therefore, the recall rate can be effectively improved.
  • this processing method makes it possible to screen out new words that have a certain degree of rarity in spam messages and are more inclined to spam messages in categories; therefore, it can effectively improve the recall rate and recall To improve the accuracy of new word discovery.
  • this application also provides a device for identifying new words in short messages.
  • This device corresponds to an embodiment of the method described above.
  • FIG. 5 is a schematic diagram of an embodiment of a short message spam word recognition device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment.
  • the embodiments of the short message spam new word recognition device described below are merely exemplary.
  • the present application further provides a short message spam word recognition device, including:
  • the short message set obtaining unit 501 is configured to obtain a short message set, where the short message set includes multiple spam messages and multiple normal messages;
  • An index determining unit 505 is configured to determine an index related to the short message category tendency of the candidate word according to the short message category information of the short message; and obtain a document rareness related index of the candidate word;
  • a score determining unit 507 configured to determine a new spam word score of the candidate word according to the short message category tendency related index and the document rarity related index;
  • a new word determining unit 509 is configured to determine a new word of spam message from the candidate word set according to the score of the new word of spam message.
  • the candidate word set determining unit 503 includes:
  • the combined word formed by at least two adjacent short message words meets the candidate word rule, the combined word is used as the candidate word.
  • the score determination unit 507 is specifically configured to use the weighted average of the short message category tendency related index and the document rarity related index as the spam short message new word score.
  • the new word determining unit 509 includes:
  • a candidate word selection subunit configured to obtain a new word with a spam message new word score greater than a score threshold, as a candidate new word of spam message
  • a word formation probability determining subunit configured to determine a word formation probability of the candidate new spam message
  • a first new word determination subunit is configured to determine the new spam message from the candidate new spam messages according to the word formation probability.
  • the new word determining unit 509 includes:
  • a candidate word selection subunit configured to obtain the candidate word with a score of the new spam message greater than a score threshold, as a candidate new word of the spam message
  • a first similarity determination subunit configured to determine a semantic similarity between at least one preset spam message vocabulary and the candidate new spam message
  • a second similarity determination subunit configured to determine, based on the semantic similarity, a spam vocabulary similarity of the candidate spam new word
  • the second new word determination subunit is configured to determine the new spam message from the candidate new spam messages based on the spam vocabulary similarity.
  • FIG. 6 is a schematic diagram of an embodiment of an electronic device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The device embodiments described below are only schematic.
  • An electronic device in this embodiment includes: a processor 601 and a memory 602; the memory is configured to store a program for realizing a method for identifying new words in a short message junk, and the device is powered on and runs the short message through the processor.
  • the program of the new spam word recognition method the following steps are performed: obtaining a short message set; the short message set includes multiple spam messages and multiple normal short messages; determining a candidate word set corresponding to the multiple spam messages; and according to the short message
  • the short message category information of the candidate word to determine an index related to the short message category of the candidate word; and to obtain a document rarity related indicator of the candidate word; according to the short message category tendency related indicator and the document rarity related indicator, Determine the new word score of the spam message of the candidate word; and determine the new word of the spam message from the candidate word set according to the new word score of the spam message.
  • this application also provides a method for identifying new words in short messages. This method has the same technical idea as the above method.
  • FIG. 7 is a flowchart of an embodiment of a short message spam word recognition method of the present application. Since this method embodiment corresponds to the above method embodiment, it is described relatively simply, and for related parts, reference may be made to part of the description of the above method embodiment.
  • the present application further provides a method for identifying new words in SMS spam, including:
  • Step S701 Obtain a short message collection.
  • Step S703 Determine candidate word sets corresponding to multiple spam messages.
  • Step S705 Determine the short message category propensity score of the candidate word according to the short message category information of the short message.
  • the short message category propensity score is calculated in the following manner: a weighted average of the short message category propensity-related indicators is used as the short message category propensity score.
  • Step S707 Determine a new spam message from the candidate word set according to the short message category propensity score.
  • the candidate word with the short message category propensity score greater than a score threshold is taken as a new spam message.
  • the score threshold can be set according to business requirements. In the case where the SMS category propensity score is a spam SMS propensity score, the smaller the score threshold is set, the more candidate words are passed, but all the words that are prone to spam may be selected, among which there will be many non-spam SMS new words; the larger the scoring threshold is set, the fewer candidate words are passed, which is helpful for filtering high-frequency candidate words that are representative of spam text, but some low-frequency candidate words may be ignored.
  • the following subsequent processing may be performed on the candidate words with a tendency score of the short message category greater than a score threshold: 1) performing a second-level screening on the candidate words according to the word formation probability of the candidate words; 2) According to the similarity of the spam short message vocabulary of the candidate word, obtain a sampling probability of the candidate word based on the similarity of the spam short message vocabulary through a piecewise probability density function, and determine the new spam message according to the sampling probability.
  • the determined new words of spam messages are still words with more practical meaning, and can also highlight those candidate words that are close to the semantics of the existing spam vocabulary, and completely different from the semantics of the existing spam vocabulary Candidate.
  • the short message spam new word recognition method obtains a short message set; determines a candidate word set corresponding to the multiple spam short messages; and determines the candidate according to short message category information of the short messages
  • the short message category propensity score of a word the short message category includes a spam message category or a normal short message category; and according to the short message category propensity score, a new word of a spam message is determined from the candidate word set;
  • Most of the new words recalled by the related indicators of SMS category tendencies are representative for spam messages and avoid recalling many normal words; therefore, the recall rate can be effectively improved.
  • this application also provides a device for identifying new words in short messages.
  • This device corresponds to an embodiment of the method described above.
  • FIG. 8 is a schematic diagram of an embodiment of a short message spam new word recognition device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment.
  • the embodiments of the short message spam new word recognition device described below are merely exemplary.
  • the present application further provides a short message spam word recognition device, including:
  • the short message set obtaining unit 801 is configured to obtain a short message set, where the short message set includes multiple spam messages and multiple normal messages;
  • a score determination unit 805, configured to determine a short message category propensity score of the candidate word according to the short message category information of the short message;
  • a new word determining unit 807 is configured to determine a new spam message from the candidate word set according to the short message category propensity score.
  • the new word determining unit 807 includes:
  • a candidate word selection subunit configured to obtain the candidate word with a tendency score of the short message category greater than a score threshold as a new candidate short message for spam;
  • a similarity determination subunit configured to determine a spam vocabulary similarity of the candidate spam new word
  • a new word determination subunit is configured to determine the new spam message from the candidate new spam messages based on the spam vocabulary similarity.
  • FIG. 9 is a schematic diagram of an embodiment of an electronic device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For relevant parts, refer to the description of the method embodiment. The device embodiments described below are only schematic.
  • An electronic device in this embodiment includes: a processor 901 and a memory 902; the memory is configured to store a program for implementing a method for identifying a new word of a short message junk, and the device is powered on and runs the short message through the processor.
  • the program of the new spam word recognition method the following steps are performed: obtaining a short message set; the short message set includes multiple spam messages and multiple normal short messages; determining a candidate word set corresponding to the multiple spam messages; and according to the short message Determine the short message category propensity score of the candidate word based on the short message category information; and determine a new spam message from the candidate word set according to the short message category propensity score.
  • this application also provides a method for identifying new words in short messages. This method has the same technical idea as the above method.
  • FIG. 10 is a flowchart of an embodiment of a short message spam word recognition method of the present application. Since this method embodiment corresponds to the above method embodiment, it is described relatively simply, and for related parts, reference may be made to part of the description of the above method embodiment.
  • the present application further provides a method for identifying new words in SMS spam, including:
  • Step S1001 Acquire a short message set.
  • Step S1003 Determine candidate word sets corresponding to multiple spam messages.
  • Step S1005 Determine the semantic similarity between at least one preset spam vocabulary word and the candidate word respectively.
  • Step S1007 Determine the spam vocabulary similarity of the candidate new words according to the semantic similarity.
  • Step S1009 Determine a new spam message from the candidate word set according to the spam short message vocabulary similarity.
  • step S1009 may be implemented in the following manner: according to the spam short message vocabulary similarity of the candidate word, a sampling probability probability function is used to obtain the sampling probability of the candidate word according to the spam short message vocabulary similarity, The new spam message is determined according to the sampling probability.
  • a sampling probability probability function is used to obtain the sampling probability of the candidate word according to the spam short message vocabulary similarity
  • the new spam message is determined according to the sampling probability.
  • the short message spam word recognition method obtains a short message set, determines a candidate word set corresponding to the multiple spam messages, and determines at least one preset spam word vocabulary and the candidate respectively Semantic similarity between words; determining spam vocabulary similarity of the candidate new words according to the semantic similarity; determining spam new words from the candidate word set according to the spam vocabulary similarity;
  • This processing method makes the candidate words be semantically expanded by word embedding, and determines the new spam messages according to the semantic similarity between the candidate words and the spam words, so that some rare new words will not be ignored. Words; therefore, it can effectively improve the recall of new words in spam messages.
  • this application also provides a device for identifying new words in short messages.
  • This device corresponds to an embodiment of the method described above.
  • FIG. 11 is a schematic diagram of an embodiment of a short message spam new word recognition device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment.
  • the embodiments of the short message spam new word recognition device described below are merely exemplary.
  • the present application further provides a short message spam word recognition device, including:
  • the short message set acquiring unit 1101 is configured to obtain a short message set, where the short message set includes multiple spam messages and multiple normal messages;
  • a first semantic similarity determining unit 1105 configured to determine a semantic similarity between at least one preset spam vocabulary word and the candidate word respectively;
  • a second semantic similarity determining unit 1107 configured to determine a spam vocabulary similarity of the candidate new word according to the semantic similarity
  • a new word determining unit 1109 is configured to determine a new word of a spam message from the candidate word set according to the similarity of the spam word vocabulary.
  • FIG. 12 is a schematic diagram of an embodiment of an electronic device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The device embodiments described below are only schematic.
  • An electronic device in this embodiment includes: a processor 1201 and a memory 1202; the memory is configured to store a program that implements a method for identifying a new message in a short message, and the device is powered on and runs the short message through the processor.
  • the program of the new spam word recognition method the following steps are performed: obtaining a short message set; the short message set includes multiple spam messages and multiple normal short messages; determining a candidate word set corresponding to the multiple spam messages; determining at least one Let the semantic similarity between the spam short message vocabulary and the candidate word be determined respectively; determine the spam short message vocabulary similarity of the candidate new word according to the semantic similarity; Identify new spam words in the word collection.
  • a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory.
  • processors CPUs
  • input / output interfaces output interfaces
  • network interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-persistent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media. Information can be stored by any method or technology. Information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information that can be accessed by computing devices. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • this application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.

Abstract

本申请公开了短信垃圾新词识别方法、装置及电子设备。其中,所述方法包括:获取短信集合;确定多个垃圾短信对应的候选词集合;根据短信类别信息,确定候选词的短信类别倾向性相关指标;以及,获取候选词的文档稀有性相关指标;根据短信类别倾向性相关指标和文档稀有性相关指标,确定候选词的垃圾短信新词得分;根据垃圾短信新词得分,从候选词集合中确定垃圾短信新词。采用这种处理方式,使得根据短信类别倾向性相关指标召回的大多数新词均对垃圾短信具有代表意义,避免召回很多正常的词汇;因此,可以有效提升召准率。同时,这种处理方式,使得根据文档稀有性的相关指标可召回低频的垃圾短信新词;因此,可以有效提升召全率。

Description

短信垃圾新词识别方法、装置及电子设备
本申请要求2018年09月14日递交的申请号为201811076259.9、发明名称为“短信垃圾新词识别方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及文本挖掘技术领域,具体涉及短信垃圾新词识别方法和装置,以及电子设备。
背景技术
一种典型的短信发送场景是,商家通过网络平台向消费者发送短信,以便于将商品促销等信息及时送至消费者处,从而确保商家销售计划的有效实施,并提升用户体验。然而,伴随着这些有益效果的同时,也出现了大量垃圾短信。垃圾短信泛滥,已经严重影响到消费者正常生活、网络平台形象乃至社会稳定。
随着互联网技术的不断发展,越来越多的网络平台利用短信内容安全系统对商对客(Business-to-Customer,B2C)的短信进行内容分析,并进行智能短信拦截和通道优化。其中,垃圾短信新词发现是短信内容安全系统的一个重要功能,有效地挖掘垃圾短信新词可提高垃圾短信识别模型准确率,可以及时地应对线上的垃圾短信变种,并可以为短信审核人员提供每天线上新出现的垃圾变种词。目前,垃圾短信新词识别方法主要分为两类:基于有监督的新词发现的垃圾短信新词识别方法、和基于无监督的新词发现的垃圾短信新词识别方法。其中,有监督的新词发现技术要依赖序列标注结果实现,该方法成本高,不容易获取语料;无监督的新词发现技术,通过某些指标来计算未包括在垃圾短信词典中的候选短语的成词概率,根据成词概率判断候选短语是否为垃圾短信新词,该方法成本低且效率高。由于无监督的新词发现技术具有成本低且效率高的优点,因此成为一种最为常用的垃圾短信新词识别方案。
然而,在实现本发明过程中,发明人发现基于无监督的新词发现技术的垃圾短信新词识别技术方案至少存在如下问题:一方面,由于简单地依赖简单的频率特征(如词的出现频率等)进行识别,而没有考虑类别的分布,因此容易召回很多正常的词汇,由此导致后续审核成本较高;另一方面,由于要依赖简单的频率特征计算成词概率,而垃圾 短信中变异的新词出现频率低,因此这类词的成词概率就会低,这样就无法识别出这类垃圾短信新词,出现低频垃圾短信新词无法被召回的问题。
综上所述,现有技术存在垃圾短信新词召全率且召准率均较低的问题。
发明内容
本申请提供短信垃圾新词识别方法,以解决现有技术存在的垃圾短信新词召全率且召准率均较低的问题。本申请另外提供短信垃圾新词识别装置,以及电子设备。
本申请提供一种短信垃圾新词识别方法,包括:
获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
确定所述多个垃圾短信对应的候选词集合;
根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;
根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;
根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
可选的,所述确定所述多个垃圾短信对应的候选词集合,包括:
通过分词算法,获取所述垃圾短信包括的词,作为短信词;
若至少两个相邻短信词构成的组合词符合候选词规则,则将所述组合词作为所述候选词。
可选的,所述候选词规则包括候选词的字数小于预设字数。
可选的,所述短信类别倾向性相关指标包括以下指标的至少一项:交叉熵,优势率,互信息;
所述文档稀有性相关指标包括:倒文档频率IDF。
可选的,所述根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分,包括:
将所述短信类别倾向性相关指标和所述文档稀有性相关指标的加权平均值作为所述垃圾短信新词得分。
可选的,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
将所述垃圾短信新词得分大于得分阈值的所述候选词作为垃圾短信新词。
可选的,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
获取并展示所述垃圾短信新词得分大于得分阈值的新词;
接收针对目标新词的确定指令;
将所述目标新词为所述垃圾短信新词。
可选的,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
获取所述垃圾短信新词得分大于得分阈值的新词,作为候选的垃圾短信新词;
确定所述候选的垃圾短信新词的成词概率;
根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
可选的,所述根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
将所述成词概率大于成词概率阈值的候选的垃圾短信新词作为所述垃圾短信新词。
可选的,所述根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
获取并展示所述成词概率大于成词概率阈值的候选的垃圾短信新词;
接收用户输入的针对目标新词的确定指令;
将所述目标新词为所述垃圾短信新词。
可选的,所述确定所述候选的垃圾短信新词的成词概率,包括:
确定所述候选的垃圾短信新词的内部结合度;以及,确定所述候选的垃圾短信新词的边界自由度;
根据所述内部凝聚度和所述外部自由度,获取所述候选的垃圾短信新词的成词概率。
可选的,所述根据所述内部凝聚度和所述外部自由度,获取所述候选的垃圾短信新词的成词概率,包括:
将所述内部凝聚度和所述外部自由度的平均值作为所述成词概率。
可选的,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
获取所述垃圾短信新词得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度;
根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
可选的,所述根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
将所述垃圾短信词汇相似度大于第一相似度阈值或小于第二相似度阈值的所述候选的垃圾短信新词作为垃圾短信新词。
可选的,所述根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
获取并展示所述垃圾短信词汇相似度大于第一相似度阈值或小于第二相似度阈值的所述候选的垃圾短信新词;
接收针对目标候选新词的确定指令;
将所述目标候选新词为所述垃圾短信新词。
可选的,所述根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度,包括:
将所述语义相似度的最大值作为所述垃圾短信词汇相似度。
可选的,所述确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度,包括:
确定所述候选新词的词向量;
根据所述预设垃圾短信词汇的词向量和所述候选新词的词向量,确定所述预设垃圾词汇与所述候选新词之间的语义相似度。
本申请还提供一种垃圾短信新词识别方法,包括:
获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
确定所述多个垃圾短信对应的候选词集合;
根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;
根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
可选的,所述根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词,包括:
获取所述短信类别倾向性得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
本申请还提供一种垃圾短信新词识别方法,包括:
获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
确定所述多个垃圾短信对应的候选词集合;
确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;
根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;
根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
本申请还提供一种垃圾短信新词识别装置,包括:
短信集获取单元,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
候选词集确定单元,用于确定所述多个垃圾短信对应的候选词集合;
指标确定单元,用于根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;
得分确定单元,用于根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;
新词确定单元,用于根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
可选的,所述候选词集确定单元包括:
通过分词算法,获取所述垃圾短信包括的词,作为短信词;
若至少两个相邻短信词构成的组合词符合候选词规则,则将所述组合词作为所述候选词。
可选的,所述得分确定单元,具体用于将所述短信类别倾向性相关指标和所述文档稀有性相关指标的加权平均值作为所述垃圾短信新词得分。
可选的,所述新词确定单元包括:
候选词选取子单元,用于获取所述垃圾短信新词得分大于得分阈值的新词,作为候选的垃圾短信新词;
成词概率确定子单元,用于确定所述候选的垃圾短信新词的成词概率;
第一新词确定子单元,用于根据所述成词概率,从所述候选的垃圾短信新词中确定 所述垃圾短信新词。
可选的,所述新词确定单元包括:
候选词选取子单元,用于获取所述垃圾短信新词得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
第一相似度确定子单元,用于确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度;
第二相似度确定子单元,用于根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
第二新词确定子单元,用于根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
本申请还提供一种电子设备,包括:
处理器;以及
存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
本申请还提供一种垃圾短信新词识别装置,包括:
短信集获取单元,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
候选词集确定单元,用于确定所述多个垃圾短信对应的候选词集合;
得分确定单元,用于根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;
新词确定单元,用于根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
可选的,所述新词确定单元包括:
候选词选取子单元,用于获取所述短信类别倾向性得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
相似度确定子单元,用于确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
新词确定子单元,用于根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
本申请还提供一种电子设备,包括:
处理器;以及
存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
本申请还提供一种垃圾短信新词识别装置,包括:
短信集获取单元,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
候选词集确定单元,用于确定所述多个垃圾短信对应的候选词集合;
第一语义相似度确定单元,用于确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;
第二语义相似度确定单元,用于根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;
新词确定单元,用于根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
本申请还提供一种电子设备,包括:
处理器;以及
存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各种方法。
本申请还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各种方法。
与现有技术相比,本申请具有以下优点:
本申请实施例提供的一种短信垃圾新词识别方法,通过获取短信集合;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词;这种处理方式,使得根据短信类别倾向性相关指标召回的大多数新词均对垃圾短信具有代表意义,避免召回很多正常的词汇;因此,可以有效提升召准率。同时,这种处理方式,使得根据文档稀有性的相关指标可召回低频的垃圾短信新词;因此,可以有效提升召全率。综上所述,这种处理方式,使得可筛选出既在垃圾短信中具有一定的稀有性,且在类别上比较倾向于垃圾短信的新词;因此,可以有效提升召准率和召全率,从而提升新词发现准确度。
本申请实施例提供的另一种短信垃圾新词识别方法,通过获取短信集合;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;所述短信类别包括垃圾短信类或正常短信类;根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词;这种处理方式,使得根据短信类别倾向性相关指标召回的大多数新词均对垃圾短信具有代表意义,避免召回很多正常的词汇;因此,可以有效提升召准率。
本申请实施例提供的又一种短信垃圾新词识别方法,通过获取短信集合;确定所述多个垃圾短信对应的候选词集合;确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词;这种处理方式,使得通过词嵌入的方式对候选词进行了语义扩展,并根据候选词与垃圾短信词汇之间的语义相似度,确定所述垃圾短信新词,因此不会忽略一些稀有的新词;因此,可以有效提升垃圾短信新词的召全率。
附图说明
图1是本申请提供的一种短信垃圾新词识别方法的实施例的流程图;
图2是本申请提供的一种短信垃圾新词识别方法的实施例的具体流程图;
图3是本申请提供的一种短信垃圾新词识别方法的实施例的具体流程图;
图4是本申请提供的一种短信垃圾新词识别方法的实施例的具体流程图;
图5是本申请提供的一种短信垃圾新词识别装置的实施例的示意图;
图6是本申请提供的电子设备的实施例的示意图;
图7是本申请提供的一种短信垃圾新词识别方法的实施例的流程图;
图8是本申请提供的一种短信垃圾新词识别装置的实施例的示意图;
图9是本申请提供的电子设备的实施例的示意图;
图10是本申请提供的一种短信垃圾新词识别方法的实施例的流程图;
图11是本申请提供的一种短信垃圾新词识别装置的实施例的示意图;
图12是本申请提供的电子设备的实施例的示意图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请中,提供了垃圾短信新词识别方法和装置,以及电子设备。在下面的实施例中逐一对各种方案进行详细说明。
第一实施例
请参考图1,其为本申请提供的一种垃圾短信词汇识别方法实施例的流程图,该方法的执行主体包括垃圾短信词汇识别装置。本申请提供的一种垃圾短信词汇识别方法包括:
步骤S101:获取短信集合。
所述短信,又称为短消息或短信息,包括但不限于手机短信,也可以是即时消息等等其它形式的短信。
所述短信集合包括多个垃圾短信和多个正常短信。其中,将垃圾短信的短信类别标注为垃圾短信,将正常短信的短信类别标注为正常短信。
步骤S103:确定所述多个垃圾短信对应的候选词集合。
所述候选词,又称为候选新词,包括所述多个垃圾短信中出现的垃圾短信词典以外的词,但由于垃圾短信的分词结果并非新词,因此不包括垃圾短信的分词结果。所述垃 圾短信词典包括多个已经确定的垃圾短信词汇。
在一个示例中,步骤S103可包括如下子步骤:1)通过分词算法,获取所述垃圾短信包括的词,作为短信词;2)若至少两个相邻短信词构成的组合词符合候选词规则,则将所述组合词作为所述候选词。
1)通过分词算法,获取所述垃圾短信包括的词,作为短信词。
具体实施时,可采用现有的分词算法对垃圾短信进行分词处理。现有的分词算法,可分为三大类:基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法。按照是否与词性标注过程相结合,又可以分为单纯分词方法和分词与标注相结合的一体化方法。分词算法属于较为成熟的现有技术,此处不再赘述,可根据实际需求选取任意一种现有分词算法。
例如,垃圾短信内容包括“新农合保险”,分词结果包括如下短信词:“新”、“农”、“合”和“保险”。
2)若至少两个相邻短信词构成的组合词符合候选词规则,则将所述组合词作为所述候选词。
所述候选词规则,包括但不限于:候选词的字数小于预设字数。所述预设字数可以根据业务需求设置,例如,设置为4等等。
例如,对于上述垃圾短信内容“新农合保险”,分词结果:“新”、“农”、“合”和“保险”,所述至少两个相邻短信词构成的组合词包括:“新农”、“农合”、“合保险”。
需要说明的是,步骤S103也可以采用其它具体实施方式,只要能够确定所述多个垃圾短信对应的候选词集合即可。
步骤S105:根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标。
本申请实施例提供的方法,引入短信类别信息,根据所述候选词在垃圾短信和正常短信中的分布情况,获取所述候选词的短信类别倾向性相关指标。其中,短信类别可以是垃圾短信类或正常短信类。
所述短信类别倾向性,可以是正常短信倾向性,也可以是垃圾短信倾向性,例如,候选词A在正常短信中出现10次,在垃圾短信中出现2次,则候选词A的正常短信倾向性得分可以是10/2=5,垃圾短信倾向性得分可以是2/10=0.2。
所述短信类别倾向性相关指标,是指能够反映所述候选词的短信类别倾向性的指 标,也就是说,根据所述短信类别倾向性相关指标,可确定所述候选词的短信类别倾向性。
所述短信类别倾向性相关指标,包括但不限于以下指标的至少一项:交叉熵、优势率、互信息。
交叉熵(Cross Entropy)是Shannon信息论中一个重要概念,在本申请提供的方法中,主要用于度量两个概率(所述候选词的垃圾短信概率和正常短信概率)分布间的差异性信息。
优势率(Odds Ratio)只适用于二元分类的情况,其特点是只关心文本特征对于目标类的分值。Pos表示目标类(如垃圾短信),neg表示非目标类(如正常短信)。
互信息(Mutual Information)是信息论里一种有用的信息度量,它可以看成是一个随机变量中包含的关于另一个随机变量的信息量,或者说是一个随机变量由于已知另一个随机变量而减少的不肯定性。
本申请实施例提供的方法,还引入所述候选词的文档稀有性相关指标,以便于挖掘出来出现频率较低的垃圾短信中变异的新词等等。
所述文档稀有性相关指标,是指反映所述候选词的文档稀有性的指标,包括但不限于:逆文档频率(inverse document frequency,IDF)。IDF又称反文档频率,是文档频率的倒数,主要用于概念TF-IDF(term frequency–inverse document frequency)中。
在本实施例中,所述文档稀有性相关指标包括所述候选词在所述短信集内的文档稀有性相关指标。
步骤S107:根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分。
通过上述步骤获得每一个所述候选词的所述短信类别倾向性相关指标和所述文档稀有性相关指标之后,就可以综合这两类指标,确定所述候选词的垃圾短信新词得分。
在一个示例中,步骤S107可采用如下方式实现:将所述短信类别倾向性相关指标和所述文档稀有性相关指标的加权平均值作为所述垃圾短信新词得分。
其中,各个指标的权重,可根据业务需求确定。当所述文档稀有性相关指标的权重较大时,会提升低频候选词的得分,有利于筛选出较多低频的候选词,但可能包括较多的正常词汇;当所述短信类别倾向性相关指标的权重较大时,会提升垃圾短信倾向性高的候选词的得分,有利于筛选出对垃圾文本具有代表性的高频的候选词,但可能忽略部 分低频的候选词。
例如,所述候选词A在垃圾短信中出现10次,在正常短信中出现1次,简化的垃圾短信倾向性得分为10/1=10;所述候选词B在垃圾短信中出现2次,在正常短信中出现1次,简化的垃圾短信倾向性得分为2/1=2,如果仅按照垃圾短信倾向性得分,高频词A更容易被筛选,但是低频词B可能是个很关键的垃圾词汇,为了让B这样的词汇也同样容易被采集到,就要考虑加入文档稀有性指标,对低频词的权重增大。譬如高频词A的idf是2,低频词B的idf是6,那么按照交叉熵、优势率和互信息这三个指标占0.5,idf占0.5的权重,此时最终得分为高频词A=0.5*10+0.5*2=6,低频词B=0.5*2+0.5*6=4,这样不仅可以筛选出高频词A,也能够筛选出低频词B。
步骤S109:根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
本步骤根据每一个所述候选词的垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
在一个示例中,将所述垃圾短信新词得分大于得分阈值的所述候选词作为垃圾短信新词。
在另一个示例中,步骤S109包括如下子步骤:1)获取并展示所述垃圾短信新词得分大于得分阈值的新词;2)接收用户输入的针对目标新词的确定指令;3)将所述目标新词为所述垃圾短信新词。
所述得分阈值,可根据业务需求确定。所述得分阈值越高,则垃圾短信新词的噪声越小,但垃圾短信新词的召全率越低,可能丢掉一些垃圾短信新词;所述得分阈值越低,则垃圾短信新词的召全率越高,但垃圾短信新词的噪声越大,人工审核量越大。
请参看图2,其为本申请的垃圾短信新词识别方法的实施例的具体流程图。在另一个示例中,步骤S109可包括如下子步骤:
步骤S201:获取所述垃圾短信新词得分大于得分阈值的新词,作为候选的垃圾短信新词。
步骤S203:确定所述候选的垃圾短信新词的成词概率。
在一个示例中,所述确定所述候选的垃圾短信新词的成词概率的步骤,可包括如下子步骤:1)确定所述候选的垃圾短信新词的内部结合度;以及,确定所述候选的垃圾短信新词的边界自由度;2)根据所述内部凝聚度和所述外部自由度,获取所述候选的垃圾短信新词的成词概率。
具体实施时,所述根据所述内部凝聚度和所述外部自由度,获取所述候选的垃圾短 信新词的成词概率的步骤,可采用如下方式实现:将所述内部凝聚度和所述外部自由度的平均值作为所述成词概率。
步骤S205:根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
具体实施时,可将所述成词概率大于成词概率阈值的候选的垃圾短信新词作为所述垃圾短信新词;也可以采用如下子步骤确定所述垃圾短信新词:1)获取并展示所述成词概率大于成词概率阈值的候选的垃圾短信新词;2)接收用户输入的针对目标新词的确定指令;3)将所述目标新词为所述垃圾短信新词。
所述成词概率阈值,可根据业务需求确定。在本实施例中,为避免将根据所述垃圾短信新词得分筛选出的低频的候选词过滤掉,可将所述成词概率阈值设置大些。
本申请实施例提供的方法,通过获取所述垃圾短信新词得分大于得分阈值的新词,作为候选的垃圾短信新词;确定所述候选的垃圾短信新词的成词概率;根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词;这种处理方式,使得确定出的垃圾短信新词还是更加具有实际意义的词,如“80后”,避免筛选出不具有实际意义的垃圾短信新词,如“新合”;因此,可以有效提升垃圾短信新词的有效性。
请参看图3,其为本申请的垃圾短信新词识别方法的实施例的具体流程图。在另一个示例中,步骤S109可包括如下子步骤:
步骤S301:获取所述垃圾短信新词得分大于得分阈值的所述候选词,作为候选的垃圾短信新词。
步骤S303:确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度。
所述至少一个预设垃圾短信词汇,包括但不限于:垃圾短信词典中的词汇。
在一个示例中,步骤S303可包括如下子步骤:1)确定所述候选的垃圾短信新词的词向量;2)根据所述预设垃圾短信词汇的词向量和所述候选的垃圾短信新词的词向量,确定所述预设垃圾词汇与所述候选的垃圾短信新词之间的语义相似度。
1)确定所述候选的垃圾短信新词的词向量。
要确定所述候选的垃圾短信新词的词向量,可采用如下方式实现:离线或在线计算所述短信集中所有短信的基于字的语言模型的embedding(词嵌入,词向量),如N-Gram或Skip-Gram语言模型,或采用cbow、glove等方式,以此确定所述候选的垃圾短信新词的词向量。采用这种处理方式,可以有效提升词向量的准确度,例如,短信A“买卖 发票,加我微信”,其中“发票”是常见词;短信B是“买卖发瞟,加我微信”,“发瞟”是新词,此时“发瞟”本身词频比较低,但是embedding刻画了该词经常出现的上下文,因此“发瞟”和“发票”在embedding是比较相似的。
2)根据所述预设垃圾短信词汇的词向量和所述候选的垃圾短信新词的词向量,确定所述预设垃圾词汇与所述候选的垃圾短信新词之间的语义相似度。
获取到所述预设垃圾短信词汇的词向量和所述候选的垃圾短信新词的词向量之后,就可以通过计算两个词向量之间的余弦距离等方式,确定所述预设垃圾词汇与所述候选的垃圾短信新词之间的语义相似度。
步骤S305:根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度。
在获得每一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度后,就可以根据这些语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度。
所述垃圾短信词汇相似度,包括一个词与已有垃圾短信词汇之间的语义相似度。
在一个示例中,步骤S305采用如下方式实现:将每一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度中,最大的语义相似度作为所述垃圾短信词汇相似度。采用这种处理方式,可以将所述候选的垃圾短信新词的垃圾短信词汇相似度设定为该词和语义最相近的垃圾短信词汇之间的语义相似度。
步骤S307:根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
在一个示例中,可将所述垃圾短信词汇相似度大于第一相似度阈值或小于第二相似度阈值的所述候选的垃圾短信新词作为垃圾短信新词。所述第一相似度阈值和所述第二相似度阈值,可根据业务需求设定。
在另一个示例中,步骤S307也可以采用如下子步骤:1)获取并展示所述垃圾短信词汇相似度大于第一相似度阈值或小于第二相似度阈值的所述候选的垃圾短信新词;2)接收用户输入的针对目标新词的确定指令;3)将所述目标新词为所述垃圾短信新词。
具体实施时,可通过一个分段概率密度函数,根据所述垃圾短信词汇相似度得到所述候选的垃圾短信新词的采样概率,根据采样概率确定所述垃圾短信新词。在本实施例中,采用如下分段概率密度函数:
Figure PCTCN2019105123-appb-000001
其中,x表示所述垃圾短信词汇相似度,p(x)表示采样概率。由该函数可见,所述第一相似度阈值可设置为0.7,所述第二相似度阈值可设置为0.3。
本申请实施例提供的方法,通过对候选词进行词嵌入的处理,确定候选词的词向量,并根据已有垃圾短信词汇的词向量和候选词的词向量,确定已有垃圾词汇与候选词之间的语义相似度,进而确定候选词的垃圾短信词汇相似度,并通过分段概率密度函数确定候选词的采样概率,将采用概率大的候选词展示给审核人员审核;这种处理方式,使得对候选词进行了语义扩展,新词识别不仅仅从频率角度出发,在考虑到候选词的语义后,一方面不会忽略稀有的新词,由此可突出那些与已有垃圾词汇语义比较相近的候选词,另一方面可以筛选出与已有垃圾词汇语义完全不相近的候选词,由此可找到一些新的垃圾短信模式,这种短信的上下文可能并未出现过,但是既然该词通过了短信类别倾向性的审核和成词概率的审核,那么对现有垃圾词汇可能是个很好的补充;因此,可以有效提升垃圾短信新词识别的准确度。
例如,“买卖发票,加我微信”,其中发票是常见词,某条短信是“买卖发瞟,加我微信”,发瞟是新词,此时“发瞟”本身词频比较低,但是词向量(embedding)刻画了该词经常出现的上下文,因此“发瞟”和“发票”在词向量是比较相似的,“发瞟”作为与已有垃圾词汇语义比较相近的候选词被筛选出来,供审核人员审核。
请参看图4,其为本申请的垃圾短信新词识别方法的实施例的示意图。由图4可见,本实施例首先通过步骤S103确定候选词集合,该集合包括多个候选词;然后,通过步骤S105,根据标签(即所述短信类别信息),确定候选词的诸如交叉熵、优势率和互信息等短信类别倾向性相关指标,以及确定IDF等文档稀有性相关指标,并通过步骤S107确定候选词的垃圾短信新词得分,再根据该得分对候选词集合中的候选词进行第一层筛选,将得分低于或等于得分阈值的候选词从该集合中去除;接下来,再对该集合中保留下来的候选词进行第二层筛选,即根据候选词的内部凝聚度和外部自由度,获取候选词的成词概率,根据成词概率进行第二层筛选,将成词概率低于或等于成词概率阈值的候选词从该集合中去除;接下来,再对该集合中保留下来的候选词进行第三层筛选,即通过词嵌入方式,计算候选词与已有垃圾短信词汇的余弦相似度,并通过分段概率密度函数获取候选词的采样概率,根据采样概率最终确定出展示给人工审核的垃圾短信新 词。采用这种处理方式,通过基于三层筛选模式的无监督新词发现技术来挖掘短信中的垃圾词汇,能够有效提升垃圾短信新词识别的准确度和召回率,从而解决人工全量审核垃圾词汇效率低下的痛点,该方案可为审核人员及时提供线上新出现的垃圾词汇,来进行风险管控;另外,挖掘出的垃圾词汇可以构建黑库,还可以提高垃圾文本的识别率。
从上述实施例可见,本申请实施例提供的短信垃圾新词识别方法,通过获取短信集合;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词;这种处理方式,使得根据短信类别倾向性相关指标召回的大多数新词均对垃圾短信具有代表意义,避免召回很多正常的词汇;因此,可以有效提升召准率。同时,这种处理方式,使得根据文档稀有性的相关指标可召回低频的垃圾短信新词;因此,可以有效提升召全率。综上所述,这种处理方式,使得可筛选出既在垃圾短信中具有一定的稀有性,且在类别上比较倾向于垃圾短信的新词;因此,可以有效提升召准率和召全率,从而提升新词发现准确度。
在上述的实施例中,提供了一种短信垃圾新词识别方法,与之相对应的,本申请还提供一种短信垃圾新词识别装置。该装置是与上述方法的实施例相对应。
第二实施例
请参看图5,其为本申请的短信垃圾新词识别装置的实施例的示意图。由于装置实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的短信垃圾新词识别装置实施例仅仅是示意性的。
本申请另外提供一种短信垃圾新词识别装置,包括:
短信集获取单元501,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
候选词集确定单元503,用于确定所述多个垃圾短信对应的候选词集合;
指标确定单元505,用于根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;
得分确定单元507,用于根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;
新词确定单元509,用于根据所述垃圾短信新词得分,从所述候选词集合中确定垃 圾短信新词。
可选的,所述候选词集确定单元503包括:
通过分词算法,获取所述垃圾短信包括的词,作为短信词;
若至少两个相邻短信词构成的组合词符合候选词规则,则将所述组合词作为所述候选词。
可选的,所述得分确定单元507,具体用于将所述短信类别倾向性相关指标和所述文档稀有性相关指标的加权平均值作为所述垃圾短信新词得分。
可选的,所述新词确定单元509包括:
候选词选取子单元,用于获取所述垃圾短信新词得分大于得分阈值的新词,作为候选的垃圾短信新词;
成词概率确定子单元,用于确定所述候选的垃圾短信新词的成词概率;
第一新词确定子单元,用于根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
可选的,所述新词确定单元509包括:
候选词选取子单元,用于获取所述垃圾短信新词得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
第一相似度确定子单元,用于确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度;
第二相似度确定子单元,用于根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
第二新词确定子单元,用于根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
第三实施例
请参考图6,其为本申请的电子设备实施例的示意图。由于设备实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的设备实施例仅仅是示意性的。
本实施例的一种电子设备,该电子设备包括:处理器601和存储器602;所述存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;根据所述短信 的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
第四实施例
在上述的实施例中,提供了一种短信垃圾新词识别方法,与之相对应的,本申请还提供一种短信垃圾新词识别方法。该方法是与上述方法具有相同的技术构思。
请参看图7,其为本申请的短信垃圾新词识别方法的实施例的流程图。由于该方法实施例与上述方法实施例相对应,所以描述得比较简单,相关之处参见上述方法实施例的部分说明即可。
本申请另外提供一种短信垃圾新词识别方法,包括:
步骤S701:获取短信集合。
步骤S703:确定多个垃圾短信对应的候选词集合。
步骤S705:根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分。
在一个示例中,所述短信类别倾向性得分采用如下方式计算:将短信类别倾向性相关指标的加权平均值作为所述短信类别倾向性得分。
步骤S707:根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
在一个示例中,将所述短信类别倾向性得分大于得分阈值的所述候选词作为垃圾短信新词。所述得分阈值,可根据业务需求设置。在短信类别倾向性得分为垃圾短信倾向性得分的情况下,得分阈值设置的越小,通过的候选词越多,但可能将所有倾向于垃圾短信的词都选取出来,其中会存在很多非垃圾短信新词;得分阈值设置的越大,通过的候选词越少,有利于筛选出对垃圾文本具有代表性的高频的候选词,但可能忽略部分低频的候选词。
在又一个示例中,可对所述短信类别倾向性得分大于得分阈值的所述候选词进行如下后续的处理:1)根据所述候选词的成词概率,对其进行第二层的筛选;2)根据所述候选词的垃圾短信词汇相似度,通过一个分段概率密度函数,根据所述垃圾短信词汇相似度得到所述候选词的采样概率,根据采样概率确定所述垃圾短信新词。采用这种处理方式,使得确定出的垃圾短信新词还是更加具有实际意义的词,并且还可以突出那些与 已有垃圾词汇语义比较相近的候选词,以及,与已有垃圾词汇语义完全不相近的候选词。
从上述实施例可见,本申请实施例提供的短信垃圾新词识别方法,通过获取短信集合;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;所述短信类别包括垃圾短信类或正常短信类;根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词;这种处理方式,使得根据短信类别倾向性相关指标召回的大多数新词均对垃圾短信具有代表意义,避免召回很多正常的词汇;因此,可以有效提升召准率。
第五实施例
在上述的实施例中,提供了一种短信垃圾新词识别方法,与之相对应的,本申请还提供一种短信垃圾新词识别装置。该装置是与上述方法的实施例相对应。
请参看图8,其为本申请的短信垃圾新词识别装置的实施例的示意图。由于装置实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的短信垃圾新词识别装置实施例仅仅是示意性的。
本申请另外提供一种短信垃圾新词识别装置,包括:
短信集获取单元801,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
候选词集确定单元803,用于确定所述多个垃圾短信对应的候选词集合;
得分确定单元805,用于根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;
新词确定单元807,用于根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
可选的,所述新词确定单元807包括:
候选词选取子单元,用于获取所述短信类别倾向性得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
相似度确定子单元,用于确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
新词确定子单元,用于根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
第六实施例
请参考图9,其为本申请的电子设备实施例的示意图。由于设备实施例基本相似于 方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的设备实施例仅仅是示意性的。
本实施例的一种电子设备,该电子设备包括:处理器901和存储器902;所述存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
第七实施例
在上述的实施例中,提供了一种短信垃圾新词识别方法,与之相对应的,本申请还提供一种短信垃圾新词识别方法。该方法是与上述方法具有相同的技术构思。
请参看图10,其为本申请的短信垃圾新词识别方法的实施例的流程图。由于该方法实施例与上述方法实施例相对应,所以描述得比较简单,相关之处参见上述方法实施例的部分说明即可。
本申请另外提供一种短信垃圾新词识别方法,包括:
步骤S1001:获取短信集合。
步骤S1003:确定多个垃圾短信对应的候选词集合。
步骤S1005:确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度。
步骤S1007:根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度。
步骤S1009:根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
在一个示例中,步骤S1009可采用如下方式实现:根据所述候选词的垃圾短信词汇相似度,通过一个分段概率密度函数,根据所述垃圾短信词汇相似度得到所述候选词的采样概率,根据采样概率确定所述垃圾短信新词。采用这种处理方式,可以突出那些与已有垃圾词汇语义比较相近的候选词,以及,与已有垃圾词汇语义完全不相近的候选词。
从上述实施例可见,本申请实施例提供的短信垃圾新词识别方法,通过获取短信集合;确定所述多个垃圾短信对应的候选词集合;确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;根据所述语义相似度,确定所述候选新词的垃圾短信词 汇相似度;根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词;这种处理方式,使得通过词嵌入的方式对候选词进行了语义扩展,并根据候选词与垃圾短信词汇之间的语义相似度,确定所述垃圾短信新词,因此不会忽略一些稀有的新词;因此,可以有效提升垃圾短信新词的召全率。
第八实施例
在上述的实施例中,提供了一种短信垃圾新词识别方法,与之相对应的,本申请还提供一种短信垃圾新词识别装置。该装置是与上述方法的实施例相对应。
请参看图11,其为本申请的短信垃圾新词识别装置的实施例的示意图。由于装置实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的短信垃圾新词识别装置实施例仅仅是示意性的。
本申请另外提供一种短信垃圾新词识别装置,包括:
短信集获取单元1101,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
候选词集确定单元1103,用于确定所述多个垃圾短信对应的候选词集合;
第一语义相似度确定单元1105,用于确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;
第二语义相似度确定单元1107,用于根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;
新词确定单元1109,用于根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
第九实施例
请参考图12,其为本申请的电子设备实施例的示意图。由于设备实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的设备实施例仅仅是示意性的。
本实施例的一种电子设备,该电子设备包括:处理器1201和存储器1202;所述存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;根据所述垃圾短信词汇相似度,从所述候选词集 合中确定垃圾短信新词。
本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
1、计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
2、本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。

Claims (31)

  1. 一种垃圾短信新词识别方法,其特征在于,包括:
    获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
    确定所述多个垃圾短信对应的候选词集合;
    根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;
    根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;
    根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述多个垃圾短信对应的候选词集合,包括:
    通过分词算法,获取所述垃圾短信包括的词,作为短信词;
    若至少两个相邻短信词构成的组合词符合候选词规则,则将所述组合词作为所述候选词。
  3. 根据权利要求2所述的方法,其特征在于,所述候选词规则包括候选词的字数小于预设字数。
  4. 根据权利要求1所述的方法,其特征在于,
    所述短信类别倾向性相关指标包括以下指标的至少一项:交叉熵,优势率,互信息;
    所述文档稀有性相关指标包括:倒文档频率IDF。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分,包括:
    将所述短信类别倾向性相关指标和所述文档稀有性相关指标的加权平均值作为所述垃圾短信新词得分。
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
    将所述垃圾短信新词得分大于得分阈值的所述候选词作为垃圾短信新词。
  7. 根据权利要求1所述的方法,其特征在于,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
    获取并展示所述垃圾短信新词得分大于得分阈值的新词;
    接收针对目标新词的确定指令;
    将所述目标新词为所述垃圾短信新词。
  8. 根据权利要求1所述的方法,其特征在于,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
    获取所述垃圾短信新词得分大于得分阈值的新词,作为候选的垃圾短信新词;
    确定所述候选的垃圾短信新词的成词概率;
    根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
  9. 根据权利要求8所述的方法,其特征在于,所述根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
    将所述成词概率大于成词概率阈值的候选的垃圾短信新词作为所述垃圾短信新词。
  10. 根据权利要求8所述的方法,其特征在于,所述根据所述成词概率,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
    获取并展示所述成词概率大于成词概率阈值的候选的垃圾短信新词;
    接收用户输入的针对目标新词的确定指令;
    将所述目标新词为所述垃圾短信新词。
  11. 根据权利要求8所述的方法,其特征在于,所述确定所述候选的垃圾短信新词的成词概率,包括:
    确定所述候选的垃圾短信新词的内部凝聚度;以及,确定所述候选的垃圾短信新词的外部自由度;
    根据所述内部凝聚度和所述外部自由度,获取所述候选的垃圾短信新词的成词概率。
  12. 根据权利要求11所述的方法,其特征在于,所述根据所述内部凝聚度和所述外部自由度,获取所述候选的垃圾短信新词的成词概率,包括:
    将所述内部凝聚度和所述外部自由度的平均值作为所述成词概率。
  13. 根据权利要求1所述的方法,其特征在于,所述根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词,包括:
    获取所述垃圾短信新词得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
    确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度;
    根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
    根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
  14. 根据权利要求13所述的方法,其特征在于,所述根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
    将所述垃圾短信词汇相似度大于第一相似度阈值或小于第二相似度阈值的所述候选的垃圾短信新词作为垃圾短信新词。
  15. 根据权利要求13所述的方法,其特征在于,所述根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词,包括:
    获取并展示所述垃圾短信词汇相似度大于第一相似度阈值或小于第二相似度阈值的所述候选的垃圾短信新词;
    接收针对目标候选新词的确定指令;
    将所述目标候选新词为所述垃圾短信新词。
  16. 根据权利要求13所述的方法,其特征在于,所述根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度,包括:
    将所述语义相似度的最大值作为所述垃圾短信词汇相似度。
  17. 根据权利要求13所述的方法,其特征在于,所述确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度,包括:
    确定所述候选新词的词向量;
    根据所述预设垃圾短信词汇的词向量和所述候选新词的词向量,确定所述预设垃圾词汇与所述候选新词之间的语义相似度。
  18. 一种垃圾短信新词识别方法,其特征在于,包括:
    获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
    确定所述多个垃圾短信对应的候选词集合;
    根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;
    根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
  19. 根据权利要求18所述的方法,其特征在于,所述根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词,包括:
    获取所述短信类别倾向性得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
    确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
    根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
  20. 一种垃圾短信新词识别方法,其特征在于,包括:
    获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
    确定所述多个垃圾短信对应的候选词集合;
    确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;
    根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;
    根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
  21. 一种短信垃圾新词识别装置,其特征在于,包括:
    短信集获取单元,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
    候选词集确定单元,用于确定所述多个垃圾短信对应的候选词集合;
    指标确定单元,用于根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;
    得分确定单元,用于根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;
    新词确定单元,用于根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
  22. 根据权利要求21所述的装置,其特征在于,所述候选词集确定单元包括:
    通过分词算法,获取所述垃圾短信包括的词,作为短信词;
    若至少两个相邻短信词构成的组合词符合候选词规则,则将所述组合词作为所述候选词。
  23. 根据权利要求21所述的装置,其特征在于,
    所述得分确定单元,具体用于将所述短信类别倾向性相关指标和所述文档稀有性相关指标的加权平均值作为所述垃圾短信新词得分。
  24. 根据权利要求21所述的装置,其特征在于,所述新词确定单元包括:
    候选词选取子单元,用于获取所述垃圾短信新词得分大于得分阈值的新词,作为候选的垃圾短信新词;
    成词概率确定子单元,用于确定所述候选的垃圾短信新词的成词概率;
    第一新词确定子单元,用于根据所述成词概率,从所述候选的垃圾短信新词中确定 所述垃圾短信新词。
  25. 根据权利要求21所述的装置,其特征在于,所述新词确定单元包括:
    候选词选取子单元,用于获取所述垃圾短信新词得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
    第一相似度确定子单元,用于确定至少一个预设垃圾短信词汇分别与所述候选的垃圾短信新词之间的语义相似度;
    第二相似度确定子单元,用于根据所述语义相似度,确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
    第二新词确定子单元,用于根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
  26. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性相关指标;以及,获取所述候选词的文档稀有性相关指标;根据所述短信类别倾向性相关指标和所述文档稀有性相关指标,确定所述候选词的垃圾短信新词得分;根据所述垃圾短信新词得分,从所述候选词集合中确定垃圾短信新词。
  27. 一种短信垃圾新词识别装置,其特征在于,包括:
    短信集获取单元,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
    候选词集确定单元,用于确定所述多个垃圾短信对应的候选词集合;
    得分确定单元,用于根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;
    新词确定单元,用于根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
  28. 根据权利要求27所述的装置,其特征在于,所述新词确定单元包括:
    候选词选取子单元,用于获取所述短信类别倾向性得分大于得分阈值的所述候选词,作为候选的垃圾短信新词;
    相似度确定子单元,用于确定所述候选的垃圾短信新词的垃圾短信词汇相似度;
    新词确定子单元,用于根据所述垃圾短信词汇相似度,从所述候选的垃圾短信新词中确定所述垃圾短信新词。
  29. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;根据所述短信的短信类别信息,确定所述候选词的短信类别倾向性得分;根据所述短信类别倾向性得分,从所述候选词集合中确定垃圾短信新词。
  30. 一种短信垃圾新词识别装置,其特征在于,包括:
    短信集获取单元,用于获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;
    候选词集确定单元,用于确定所述多个垃圾短信对应的候选词集合;
    第一语义相似度确定单元,用于确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;
    第二语义相似度确定单元,用于根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;
    新词确定单元,用于根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
  31. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储实现短信垃圾新词识别方法的程序,该设备通电并通过所述处理器运行该短信垃圾新词识别方法的程序后,执行下述步骤:获取短信集合;所述短信集合包括多个垃圾短信和多个正常短信;确定所述多个垃圾短信对应的候选词集合;确定至少一个预设垃圾短信词汇分别与所述候选词之间的语义相似度;根据所述语义相似度,确定所述候选新词的垃圾短信词汇相似度;根据所述垃圾短信词汇相似度,从所述候选词集合中确定垃圾短信新词。
PCT/CN2019/105123 2018-09-14 2019-09-10 短信垃圾新词识别方法、装置及电子设备 WO2020052547A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811076259.9A CN110909540B (zh) 2018-09-14 2018-09-14 短信垃圾新词识别方法、装置及电子设备
CN201811076259.9 2018-09-14

Publications (1)

Publication Number Publication Date
WO2020052547A1 true WO2020052547A1 (zh) 2020-03-19

Family

ID=69777335

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/105123 WO2020052547A1 (zh) 2018-09-14 2019-09-10 短信垃圾新词识别方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN110909540B (zh)
WO (1) WO2020052547A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112000794A (zh) * 2020-07-30 2020-11-27 北京百度网讯科技有限公司 文本语料筛选方法、装置、电子设备及存储介质
CN112926319A (zh) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 一种领域词汇的确定方法、装置、设备以及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434512A (zh) * 2020-09-17 2021-03-02 上海二三四五网络科技有限公司 一种结合上下文语境的新词确定方法及装置
CN115858771A (zh) * 2022-01-11 2023-03-28 北京中关村科金技术有限公司 词语的查找方法、装置和计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105516499A (zh) * 2015-12-14 2016-04-20 北京奇虎科技有限公司 一种对短信进行分类的方法、装置、通信终端及服务器
WO2017084267A1 (zh) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 一种关键词提取方法和装置
CN106878347A (zh) * 2017-04-28 2017-06-20 北京奇虎科技有限公司 信息处理方法、系统、移动终端和服务器
CN107193804A (zh) * 2017-06-02 2017-09-22 河海大学 一种面向词和组合词的垃圾短信文本特征选择方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153658A (zh) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 一种基于关键字加权算法的舆情热词发现方法
CN107402945B (zh) * 2017-03-15 2020-07-10 阿里巴巴集团控股有限公司 词库生成方法及装置、短文本检测方法及装置
CN108509474B (zh) * 2017-09-15 2022-01-07 腾讯科技(深圳)有限公司 搜索信息的同义词扩展方法及装置
CN108021558A (zh) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 关键词的识别方法、装置、电子设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017084267A1 (zh) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 一种关键词提取方法和装置
CN105516499A (zh) * 2015-12-14 2016-04-20 北京奇虎科技有限公司 一种对短信进行分类的方法、装置、通信终端及服务器
CN106878347A (zh) * 2017-04-28 2017-06-20 北京奇虎科技有限公司 信息处理方法、系统、移动终端和服务器
CN107193804A (zh) * 2017-06-02 2017-09-22 河海大学 一种面向词和组合词的垃圾短信文本特征选择方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112000794A (zh) * 2020-07-30 2020-11-27 北京百度网讯科技有限公司 文本语料筛选方法、装置、电子设备及存储介质
CN112000794B (zh) * 2020-07-30 2023-08-22 北京百度网讯科技有限公司 文本语料筛选方法、装置、电子设备及存储介质
CN112926319A (zh) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 一种领域词汇的确定方法、装置、设备以及存储介质
CN112926319B (zh) * 2021-02-26 2024-01-12 北京百度网讯科技有限公司 一种领域词汇的确定方法、装置、设备以及存储介质

Also Published As

Publication number Publication date
CN110909540A (zh) 2020-03-24
CN110909540B (zh) 2022-05-24

Similar Documents

Publication Publication Date Title
WO2020052547A1 (zh) 短信垃圾新词识别方法、装置及电子设备
US10042896B2 (en) Providing search recommendation
CN109815308B (zh) 意图识别模型的确定及检索意图识别方法、装置
CN108170692B (zh) 一种热点事件信息处理方法和装置
WO2021227831A1 (zh) 威胁情报的主题检测方法、装置和计算机存储介质
Khuc et al. Towards building large-scale distributed systems for twitter sentiment analysis
JP6661790B2 (ja) テキストタイプを識別する方法、装置及びデバイス
WO2017101728A1 (zh) 一种相似词的聚合方法和装置
Bates et al. Counting clusters in twitter posts
SG192380A1 (en) Social media data analysis system and method
JP2009093654A (ja) 文書の具体性の決定
JP2019519019A5 (zh)
JP2015500525A (ja) 情報検索のための方法および装置
Nithish et al. An Ontology based Sentiment Analysis for mobile products using tweets
Yu et al. Open relation extraction and grounding
WO2016040772A1 (en) Method and apparatus of matching an object to be displayed
CN113191145B (zh) 关键词的处理方法、装置、电子设备和介质
CN113767403B (zh) 知识图中过指定和欠指定的自动解析
Skanda et al. Detecting stance in kannada social media code-mixed text using sentence embedding
Heravi et al. Tweet location detection
CN112529627B (zh) 商品隐式属性抽取方法、装置、计算机设备及存储介质
CN115129864A (zh) 文本分类方法、装置、计算机设备和存储介质
CN113127639B (zh) 一种异常会话文本检测方法和装置
Samuel et al. A spatial, temporal and sentiment based framework for indexing and clustering in twitter blogosphere
JP6039057B2 (ja) 文書分析装置及び文書分析プログラム

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19860851

Country of ref document: EP

Kind code of ref document: A1