WO2018028065A1 - Method and device for classifying short message and computer storage medium - Google Patents

Method and device for classifying short message and computer storage medium Download PDF

Info

Publication number
WO2018028065A1
WO2018028065A1 PCT/CN2016/105378 CN2016105378W WO2018028065A1 WO 2018028065 A1 WO2018028065 A1 WO 2018028065A1 CN 2016105378 W CN2016105378 W CN 2016105378W WO 2018028065 A1 WO2018028065 A1 WO 2018028065A1
Authority
WO
WIPO (PCT)
Prior art keywords
short message
type
short
vector
word
Prior art date
Application number
PCT/CN2016/105378
Other languages
French (fr)
Chinese (zh)
Inventor
陈军
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018028065A1 publication Critical patent/WO2018028065A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/725Cordless telephones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72436User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. short messaging services [SMS] or e-mails
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • a first determining module configured to determine a first classification model, where the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
  • the first determining module is configured to determine, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
  • a second operation module configured to perform a weighting operation on the read symbol vector and the word vector according to the second classification model, to obtain a second operation result
  • Step 101 Identify a preset feature word in the received short message.
  • Step 106 Determine, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
  • the preset feature words may be an email address, a web address, a date, a time, a percentage, a quantifier, a currency, a phone number, a number, a foreign language, etc., or may be a customized vocabulary, including a vocabulary of a professional application field. Idioms, food, places, works, equipment, names of people, place names and institution names, etc., are not limited by the present invention.
  • the short information may be gradually determined by using a cascading manner, that is, the first classification model, the second classification model, the third classification model, and the fourth classification model are sequentially used to determine Achieve a finer classification.
  • the standardized short message can facilitate subsequent semantic analysis.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Provided in the present invention are a method and a device for classifying a short message and a computer storage medium. The method for classifying a short message comprises: recognizing a preset feature word in a received short message; substituting the preset feature word in the short message with a feature symbol corresponding to the preset feature word; determining a first classification model; reading, from a high-frequency word vector library of the first classification model, a symbol vector of the feature symbol and a word vector of the remaining words other than the preset feature word in the short message, performing a weighted operation, according to the first classification model, on the symbol vector and word vector that have been read, to obtain a first operation result, and determining the type of the short message according to the first operation result. The solution of the present invention, by means of a preset classification model, can accurately determine the type of short message to which a short message belongs, achieve a smart management of short messages, and facilitate a user to query and organize short messages.

Description

一种短信息分类方法、装置及计算机存储介质Short message classification method, device and computer storage medium 技术领域Technical field
本发明涉及文本分类统计技术领域,特别是涉及一种短信息分类方法、装置及计算机存储介质。The invention relates to the technical field of text classification statistics, in particular to a short message classification method, device and computer storage medium.
背景技术Background technique
目前,终端中的短信息(包括通知中心的文本消息)基本没有分类,或者仅采用发送方号码进行分类存储,按接收的时间进行排列。At present, the short messages in the terminal (including the text message of the notification center) are basically not classified, or are only classified and stored by the sender number, and are arranged according to the time of reception.
这样,当终端中存储有大量短信息时,上述分类方式会使得用户查询整理短信息时极为不便。例如,用户想要找几天前招商银行发送的信用卡还款短信,这时需要用户在大量的招商银行发送的短信中手工查找,费时费力。即使用户经常手动整理短信息,也容易出现误删及漏删的情况。Thus, when a large amount of short information is stored in the terminal, the above classification method makes it extremely inconvenient for the user to query and organize the short message. For example, the user wants to find a credit card repayment message sent by China Merchants Bank a few days ago. At this time, the user needs to manually find the SMS sent by a large number of China Merchants Bank, which is time-consuming and laborious. Even if the user often manually organizes the short message, it is prone to accidental deletion and deletion.
发明内容Summary of the invention
本发明实施例的目的在于提供一种短信息分类方法及装置,以解决现有的分类短信息的方式使得用户查询整理短信息时极为不便的问题。The purpose of the embodiments of the present invention is to provide a method and a device for classifying short messages, so as to solve the problem that the existing method for classifying short messages makes the user query for short information inconvenient.
为了实现上述的目的,本发明实施例提供一种短信息分类方法,包括:In order to achieve the above object, an embodiment of the present invention provides a short message classification method, including:
识别接收的短信息中的预设特征词;Identifying a preset feature word in the received short message;
将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;Substituting the preset feature words in the short message with the feature symbols corresponding to the preset feature words;
确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型;Determining a first classification model, wherein the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the first classification model;
根据所述第一分类模型,对读取的符号向量和字向量进行加权运算, 得到第一运算结果;Performing a weighting operation on the read symbol vector and the word vector according to the first classification model, Obtaining the first operation result;
根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。Determining, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
优选的,所述方法还包括:Preferably, the method further includes:
若所述短信息的类型为所述非第一短信息类型,确定第二分类模型,其中,所述第二分类模型对应的短信息类型包括至少一种第二短信息类型和非第二短信息类型;If the type of the short message is the non-first short message type, determining a second classification model, where the short information type corresponding to the second classification model includes at least one second short information type and a non-second short Type of information;
从所述第二分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the second classification model;
根据所述第二分类模型,对读取的符号向量和字向量进行加权运算,得到第二运算结果;Performing a weighting operation on the read symbol vector and the word vector according to the second classification model to obtain a second operation result;
根据所述第二运算结果,判定所述短信息的类型为所述第二短信息类型或所述非第二短信息类型。And determining, according to the second operation result, that the type of the short information is the second short information type or the non-second short information type.
优选的,所述根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果的步骤,包括:Preferably, the step of performing a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result includes:
根据所述第一分类模型,对所述读取的符号向量和字向量进行处理,得到与所述短信息对应的信息向量;Processing the read symbol vector and the word vector according to the first classification model to obtain an information vector corresponding to the short information;
确定每种第一短信息类型和所述非第一短信息类型的与所述信息向量对应的权重系数向量,其中,所述信息向量中的信息值与所述权重系数向量中的权重系数一一对应;Determining, for each of the first short message type and the non-first short message type, a weight coefficient vector corresponding to the information vector, wherein the information value in the information vector and the weight coefficient in the weight coefficient vector One-to-one correspondence;
利用所述信息向量与确定的每种短信息类型的权重系数向量进行加权运算,得到至少两个预测量化值。The weighting operation is performed by using the information vector and the determined weight coefficient vector of each short information type to obtain at least two predicted quantized values.
优选的,所述根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型的步骤,包括:Preferably, the step of determining, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type comprises:
比较所述至少两个预测量化值,得到所述至少两个预测量化值中的最 大的预测量化值;Comparing the at least two predicted quantized values to obtain the most of the at least two predicted quantized values Large predicted quantized value;
判定所述短信息的类型为所述最大的预测量化值对应的短信息类型。Determining that the type of the short message is a short message type corresponding to the largest predicted quantized value.
优选的,所述识别接收的短信息中的预设特征词的步骤之前,所述方法还包括:Preferably, before the step of identifying the preset feature words in the received short message, the method further includes:
对所述接收的短信息进行规范处理;Standardizing the received short message;
所述识别接收的短信息中的预设特征词的步骤包括:The step of identifying a preset feature word in the received short message includes:
识别所述规范处理后的短信息中的预设特征词。Identifying a preset feature word in the short message processed by the specification.
优选的,所述读取所述短信息中除所述预设特征词之外的其余字的字向量的步骤,包括:Preferably, the step of reading the word vector of the remaining words except the preset feature word in the short message includes:
根据文本分词技术,获取所述短信息中除所述预设特征词之外的其余字中的词语;Acquiring words in the remaining words of the short message except the preset feature words according to a text segmentation technique;
读取所述获取的词语的词向量和所述短信息中除所述预设特征词及所述获取的词语之外的其余字的字向量。Reading a word vector of the acquired word and a word vector of the remaining words other than the preset feature word and the acquired word in the short message.
优选的,所述根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型的步骤之后,所述方法还包括:Preferably, after the step of determining that the type of the short information is the first short information type or the non-first short information type according to the first operation result, the method further includes:
将所述短信息分类保存至其所属的短信息类型中。The short message classification is saved to the short message type to which it belongs.
优选的,所述根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型的步骤之后,所述方法还包括:Preferably, after the step of determining that the type of the short information is the first short information type or the non-first short information type according to the first operation result, the method further includes:
输出所述预设特征词中的至少一个。Outputting at least one of the preset feature words.
本发明还提供一种短信息分类装置,包括:The invention also provides a short message classification device, comprising:
识别模块,用于识别接收的短信息中的预设特征词;An identification module, configured to identify a preset feature word in the received short message;
替换模块,用于将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;a replacement module, configured to replace a preset feature word in the short message with a feature symbol corresponding to the preset feature word;
第一确定模块,用于确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型; a first determining module, configured to determine a first classification model, where the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
第一读取模块,用于从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;a first reading module, configured to read, from a high frequency word vector library of the first classification model, a symbol vector of the feature symbol and a rest of the short information except the preset feature word Word vector of words;
第一运算模块,用于根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果;a first operation module, configured to perform a weighting operation on the read symbol vector and the word vector according to the first classification model, to obtain a first operation result;
第一判定模块,用于根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。The first determining module is configured to determine, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
优选的,所述装置还包括:Preferably, the device further comprises:
第二确定模块,用于在所述短信息的类型为所述非第一短信息类型时,确定第二分类模型,其中,所述第二分类模型对应的短信息类型包括至少一种第二短信息类型和非第二短信息类型;a second determining module, configured to determine a second classification model when the type of the short information is the non-first short information type, where the short information type corresponding to the second classification model includes at least one second Short message type and non-second short message type;
第二读取模块,用于从所述第二分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;a second reading module, configured to read, from a high frequency word vector library of the second classification model, a symbol vector of the feature symbol and a rest of the short information except the preset feature word Word vector of words;
第二运算模块,用于根据所述第二分类模型,对读取的符号向量和字向量进行加权运算,得到第二运算结果;a second operation module, configured to perform a weighting operation on the read symbol vector and the word vector according to the second classification model, to obtain a second operation result;
第二判定模块,用于根据所述第二运算结果,判定所述短信息的类型为所述第二短信息类型或所述非第二短信息类型。And a second determining module, configured to determine, according to the second operation result, that the type of the short message is the second short information type or the non-second short information type.
优选的,所述第一运算模块包括:Preferably, the first operation module includes:
处理单元,用于根据所述第一分类模型,对所述读取的符号向量和字向量进行处理,得到与所述短信息对应的信息向量;a processing unit, configured to process the read symbol vector and the word vector according to the first classification model, to obtain an information vector corresponding to the short information;
确定单元,用于确定每种第一短信息类型和所述非第一短信息类型的与所述信息向量对应的权重系数向量,其中,所述信息向量中的信息值与所述权重系数向量中的权重系数一一对应;a determining unit, configured to determine a weight coefficient vector corresponding to the information vector of each of the first short information type and the non-first short information type, wherein the information value in the information vector and the weight coefficient vector The weighting factors in the one-to-one correspondence;
运算单元,用于利用所述信息向量与确定的每种短信息类型的权重系 数向量进行加权运算,得到至少两个预测量化值。An arithmetic unit for utilizing the information vector and the determined weighting type of each short message type The number vector is weighted to obtain at least two predicted quantized values.
优选的,所述第一判定模块包括:Preferably, the first determining module comprises:
比较单元,用于比较所述至少两个预测量化值,得到所述至少两个预测量化值中的最大的预测量化值;a comparing unit, configured to compare the at least two predicted quantized values to obtain a largest predicted quantized value of the at least two predicted quantized values;
判定单元,用于判定所述短信息的类型为所述最大的预测量化值对应的短信息类型。The determining unit is configured to determine that the type of the short message is a short message type corresponding to the largest predicted quantized value.
优选的,所述装置还包括:Preferably, the device further comprises:
规范处理模块,用于对所述接收的短信息进行规范处理;a specification processing module, configured to perform normal processing on the received short message;
所述识别模块具体用于:The identification module is specifically configured to:
识别所述规范处理后的短信息中的预设特征词。Identifying a preset feature word in the short message processed by the specification.
优选的,所述读取模块包括:Preferably, the reading module comprises:
获取单元,用于根据文本分词技术,获取所述短信息中除所述预设特征词之外的其余字中的词语;An obtaining unit, configured to acquire, in the short message, words in the remaining words except the preset feature words according to a text segmentation technique;
读取单元,用于读取所述获取的词语的词向量和所述短信息中除所述预设特征词及所述获取的词语之外的其余字的字向量。a reading unit, configured to read a word vector of the acquired word and a word vector of the remaining words of the short message except the preset feature word and the acquired word.
优选的,所述装置还包括:Preferably, the device further comprises:
分类保存模块,用于将所述短信息分类保存至其所属的短信息类型中。The category saving module is configured to save the short message category into the short message type to which it belongs.
优选的,所述装置还包括:Preferably, the device further comprises:
输出模块,用于输出所述预设特征词中的至少一个。And an output module, configured to output at least one of the preset feature words.
本发明实施例还提供一种计算机存储介质,该存储介质包括一组指令,当执行所述指令时,引起至少一个处理器执行包括以下的操作:Embodiments of the present invention also provide a computer storage medium, the storage medium comprising a set of instructions that, when executed, cause at least one processor to perform operations including:
识别接收的短信息中的预设特征词;将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;Identifying a preset feature word in the received short message; replacing the preset feature word in the short message with the feature symbol corresponding to the preset feature word;
确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型; Determining a first classification model, wherein the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the first classification model;
根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果;根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。Performing a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result; and determining, according to the first operation result, the type of the short message as the first short message type Or the non-first short message type.
通过本发明实施例的上述技术方案,本发明的有益效果在于:Through the above technical solutions of the embodiments of the present invention, the beneficial effects of the present invention are:
本发明实施例的短信息分类方法,通过预先设置的分类模型,能够对短信息所属的短信息类型进行准确地判定,实现对短信息的智能管理,方便用户对短信息进行查询整理。According to the short message classification method in the embodiment of the present invention, the short message type to which the short message belongs can be accurately determined through the pre-set classification model, thereby realizing intelligent management of the short message, and facilitating the user to query and organize the short message.
附图说明DRAWINGS
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without paying for creative labor.
图1表示本发明实施例的短信息分类方法的流程图。FIG. 1 is a flow chart showing a short message classification method according to an embodiment of the present invention.
图2表示本发明实施例的短信息分类装置的结构示意图。FIG. 2 is a schematic structural diagram of a short message classification apparatus according to an embodiment of the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
参见图1所示,本发明实施例提供一种短信息分类方法,其中,所述方法包括如下步骤: Referring to FIG. 1, an embodiment of the present invention provides a short message classification method, where the method includes the following steps:
步骤101:识别接收的短信息中的预设特征词;Step 101: Identify a preset feature word in the received short message.
步骤102:将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;Step 102: Replace a preset feature word in the short message with a feature symbol corresponding to the preset feature word;
步骤103:确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型;Step 103: Determine a first classification model, where the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
步骤104:从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Step 104: Read, from the high-frequency word vector library of the first classification model, a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word;
步骤105:根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果;Step 105: Perform a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result.
步骤106:根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。Step 106: Determine, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
本发明实施例的短信息分类方法,通过预先设置的分类模型,能够对短信息所属的短信息类型进行准确地判定,实现对短信息的智能管理,方便用户对短信息进行查询整理。According to the short message classification method in the embodiment of the present invention, the short message type to which the short message belongs can be accurately determined through the pre-set classification model, thereby realizing intelligent management of the short message, and facilitating the user to query and organize the short message.
其中,所述预设特征词可以为电子邮箱、网址、日期、时间、百分比、量词、货币、电话号码、数字、外文词等等,也可以为自定义的词汇,包括专业应用领域的词汇、成语、食物、地点、作品、设备、人名、地名和机构名称等等,本发明不对其进行限制。The preset feature words may be an email address, a web address, a date, a time, a percentage, a quantifier, a currency, a phone number, a number, a foreign language, etc., or may be a customized vocabulary, including a vocabulary of a professional application field. Idioms, food, places, works, equipment, names of people, place names and institution names, etc., are not limited by the present invention.
而与所述预设特征词对应的特征符号是预先设置的。例如,与时间对应的特征符号可为DATE,与货币对应的特征符号可为CURRENCY,与银行对应的特征符号可为BANK,等等。And the feature symbol corresponding to the preset feature word is preset. For example, the feature symbol corresponding to the time may be DATE, the feature symbol corresponding to the currency may be CURRENCY, the feature symbol corresponding to the bank may be BANK, and the like.
需要说明的是,预先设置特征符号并对特征词进行替换,主要是因为在短信息分类过程中,从语义上终端只需要了解短信息中存在哪些特征词即可,并不关心所述特征词具体是什么。It should be noted that the feature symbols are preset and the feature words are replaced, mainly because in the short message classification process, the terminal only needs to know which feature words exist in the short message, and does not care about the feature words. What is it?
例如,终端接收到短信息“您个人信用卡11月账单人民币4818.93, 到期还款日11月23日。[招商银行]”,经过识别可得到预设特征词“11月”、“人民币4818.93”、“11月23日”和“招商银行”,那么,通过对应的特征符号替换后,所述短信息就成为“您个人信用卡DATE账单CURRENCY,到期还款日DATE。[BANK]”,更体现出短信息中存在哪些特征词。也就是说,在分析该短信息时,终端并不关心具体的金额、日期、具体银行等,只需要了解到存在金钱、日期、银行等即可。For example, the terminal received the short message "Your personal credit card November bill RMB 4818.93, The repayment date is November 23rd. [China Merchants Bank]", after the identification, the default feature words "11th month", "RMB4818.93", "November 23rd" and "China Merchants Bank" are obtained, then the short message is replaced by the corresponding feature symbol. Just become "your personal credit card DATE bill CURRENCY, due date DATE. [BANK], which reflects the characteristics of the short message. That is to say, when analyzing the short message, the terminal does not care about the specific amount, date, specific bank, etc., only need to know the existence of money, date, Banks can wait.
本发明实施例中,所述第一分类模型是预先训练好的,且所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型。也就是说,依据所述第一分类模型,可将终端接收的短信息的类型判定为第一短信息类型(即所述至少一种第一短信息类型中的一种),或非第一短信息类型。In the embodiment of the present invention, the first classification model is pre-trained, and the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type. That is, according to the first classification model, the type of the short message received by the terminal may be determined as the first short information type (ie, one of the at least one first short information type), or not the first Short message type.
例如,所述第一分类模型可为一个单类分类器,对应的短信息类型包括还款提醒短信息类型和非还款提醒短信息类型;或者,所述第一分类模型也可为一个多类分类器,对应的短信息类型包括还款提醒短信息类型、消费账单短信息类型和入账账单短信息类型,以及其他类短信息类型(即非还款提醒,也非消费账单和入账账单短信息类型)。For example, the first classification model may be a single class classifier, and the corresponding short message type includes a repayment reminding short message type and a non-repayment reminding short message type; or the first classification model may also be more than one Class classifier, corresponding short message types include repayment reminder short message type, consumption bill short message type and account billing short message type, and other types of short message types (ie non-repayment reminders, non-consumer bills and short bills) Type of information).
日常生活中,常用汉字及符号的数量大概是3500个左右,但在某一类型短信息中出现的汉字符号(即高频字)远没有这么多,所以,对于资源受限终端,并不需要通过所有汉字及符号才能断定短信息类型,只需关注特定分类模型下的高频字。即在对样本分类模型进行训练时,只需保留高频字的字向量,低频字都以一个统一的特定符号代替,即低频字共用一个字向量,从而形成与该分类模型对应的高频字字向量库。In daily life, the number of commonly used Chinese characters and symbols is about 3,500, but the Chinese character number (ie, high frequency word) appearing in a certain type of short message is far less than so, so for resource-constrained terminals, it is not necessary. The short message type can be determined by all Chinese characters and symbols, and only the high frequency words under a specific classification model can be considered. That is, when training the sample classification model, it is only necessary to retain the word vector of the high frequency word, and the low frequency words are replaced by a uniform specific symbol, that is, the low frequency word shares a word vector, thereby forming a high frequency word corresponding to the classification model. Word vector library.
其中,所述字向量指的是有限维的浮点数,代表着字的语义的量化数值。此处的有限维可以是4维、8维或12维等,依据训练时样本大小和训练模型而定,通常取4的倍数。 Wherein, the word vector refers to a finite dimensional floating point number, which represents a quantized value of the semantics of the word. The finite dimension here can be 4D, 8D or 12D, etc., depending on the sample size and training model during training, usually taking a multiple of 4.
在对短信息分析过程中,要从第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量,并依据读取的符号向量和字向量对短信息进行分析。In the process of analyzing the short message, the symbol vector of the feature symbol and the remaining words of the short message except the preset feature word are read from the high frequency word vector library of the first classification model. The word vector, and the short information is analyzed based on the read symbol vector and word vector.
具体的,所述第一分类模型例如为采用动态k-max池化的卷积神经网络训练好的模型。而所述根据所述第一分类模型,对读取的字向量进行加权运算,得到第一运算结果的步骤具体为:Specifically, the first classification model is, for example, a trained model using a dynamic k-max pooled convolutional neural network. And the step of performing a weighting operation on the read word vector according to the first classification model to obtain a first operation result is specifically:
根据所述第一分类模型,对所述读取的符号向量和字向量进行处理,得到与所述短信息对应的信息向量;这个步骤即是对短信息的符号向量和字向量进行卷积运算后抽取出能表示句子语义的向量。And processing the read symbol vector and the word vector according to the first classification model to obtain an information vector corresponding to the short information; this step is performing convolution operation on the symbol vector and the word vector of the short information. Then extract a vector that can represent the semantics of the sentence.
确定每种第一短信息类型和所述非第一短信息类型的与所述信息向量对应的权重系数向量,其中,所述信息向量中的信息值与所述权重系数向量中的权重系数一一对应;Determining, for each of the first short message type and the non-first short message type, a weight coefficient vector corresponding to the information vector, wherein the information value in the information vector and the weight coefficient in the weight coefficient vector One-to-one correspondence;
利用所述信息向量与确定的每种短信息类型的权重系数向量进行加权运算,得到至少两个预测量化值。The weighting operation is performed by using the information vector and the determined weight coefficient vector of each short information type to obtain at least two predicted quantized values.
需要说明的是,所述预测量化值可为预测的概率值或评分,用于判断短信息的类型。并且实际应用中,为了准确判定短信息的类型,在得到所述预测量化值时,可在加权运算得到的求和结果值的基础上再加上一个偏置系数。It should be noted that the predicted quantized value may be a predicted probability value or a score for determining the type of the short message. In practical applications, in order to accurately determine the type of the short message, when the predicted quantized value is obtained, an offset coefficient may be added to the summed result value obtained by the weighting operation.
进一步的,所述根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型的步骤具体为:Further, the step of determining that the type of the short information is the first short information type or the non-first short information type according to the first operation result is specifically:
比较所述至少两个预测量化值,得到所述至少两个预测量化值中的最大的预测量化值;Comparing the at least two predicted quantized values to obtain a largest predicted quantized value of the at least two predicted quantized values;
判定所述短信息的类型为所述最大的预测量化值对应的短信息类型。Determining that the type of the short message is a short message type corresponding to the largest predicted quantized value.
也就是说,在利用所述信息向量与确定的每种短信息类型的权重系数向量进行加权运算时,会计算得到与每种短信息类型对应的预测量化值, 并将其中最大预测量化值对应的短信息类型判定为所述短信息的类型。That is to say, when the weighting operation is performed by using the information vector and the determined weight coefficient vector of each short information type, the predicted quantized value corresponding to each short information type is calculated. The type of short message corresponding to the largest predicted quantized value is determined as the type of the short message.
本发明实施例中,在所述步骤106之后,所述方法还包括:In the embodiment of the present invention, after the step 106, the method further includes:
若所述短信息的类型为所述非第一短信息类型,确定第二分类模型,其中,所述第二分类模型对应的短信息类型包括至少一种第二短信息类型和非第二短信息类型;If the type of the short message is the non-first short message type, determining a second classification model, where the short information type corresponding to the second classification model includes at least one second short information type and a non-second short Type of information;
从所述第二分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the second classification model;
根据所述第二分类模型,对读取的符号向量和字向量进行加权运算,得到第二运算结果;Performing a weighting operation on the read symbol vector and the word vector according to the second classification model to obtain a second operation result;
根据所述第二运算结果,判定所述短信息的类型为所述第二短信息类型或所述非第二短信息类型。And determining, according to the second operation result, that the type of the short information is the second short information type or the non-second short information type.
在另外的实施例中,如果对第一短信息类型进一步进行类型细分,也可以将判别为第一短信息类型的短信息输入第三分类模型进行进一步分类。比如第一分类模型只识别短信息是银行账单类型还是非银行账单类型。对于识别出银行账单类型的短信息可进一步进行第三分类模型(可识别出消费类型、入账类型、还款类型及其他银行账单类型)的细分判断。In another embodiment, if the first short message type is further type-divided, the short information determined as the first short message type may be input into the third classification model for further classification. For example, the first classification model only identifies whether the short message is a bank bill type or a non-bank bill type. For the short message identifying the bank bill type, a third classification model (which can identify the type of consumption, the type of repayment, the type of repayment, and other bank bill types) can be further subdivided.
也就是说,对于资源受限的终端,可以采用级联方式对短信息进行逐步判定,即依次利用第一分类模型、第二分类模型、第三分类模型、第四分类模型等进行判定,以实现较细的分类。That is to say, for a resource-constrained terminal, the short information may be gradually determined by using a cascading manner, that is, the first classification model, the second classification model, the third classification model, and the fourth classification model are sequentially used to determine Achieve a finer classification.
其中,在级联判定过程中,涉及的分类模型例如可以采用银行账单分类模型、航班火车等出发日程提醒分类模型、广告消息分类模型、诈骗消息分类模型等单一分类模型,以满足不同的用户需求。In the process of cascading determination, the classification model involved may adopt a single classification model such as a bank bill classification model, a departure schedule reminder classification model such as a flight train, an advertisement message classification model, and a fraud message classification model to meet different user requirements. .
本发明实施例中,在所述步骤101之前,所述方法还包括:In the embodiment of the present invention, before the step 101, the method further includes:
对所述接收的短信息进行规范处理;Standardizing the received short message;
而所述步骤101具体为:识别所述规范处理后的短信息中的预设特征 词。The step 101 is specifically: identifying a preset feature in the short message after the specification processing word.
这样,经过规范化处理的短信息,能够便于后续的语义分析。In this way, the standardized short message can facilitate subsequent semantic analysis.
其中,具体的规范处理可包括统一字符编码,繁体转简体,全角半角转换,不规范用语替换,剔除文本中多余空白符,剔除对语义分析没有帮助的语气词、特殊标点符号等等,本发明不对其进行限制。The specific specification processing may include unified character encoding, traditional to simplified, full-width half-angle conversion, non-standard term substitution, culling redundant white space in the text, eliminating modal particles, special punctuation marks, etc., which are not helpful for semantic analysis, and the like. Do not limit it.
本发明实施例中,在读取字向量之前,还可以对短信息文本采用现有技术中的文本分词技术进行分词,即把常用词分出来,这样能更具有语义特征。因为中文汉字中单个字往往不能准确表达意思,几个汉字组成的词语能更加准确的表达特定的意思。例如,“公”和“司”两个字的意思与“公司”完全不一样;这样,当进行分词后,则读取“公司”的词向量即可,而不必读取“公”和“司”两个字向量。其中,读取词向量后进行的处理过程和运算过程与字向量的一样。In the embodiment of the present invention, before the word vector is read, the short message text can also be segmented by using the text segmentation technology in the prior art, that is, the common words are separated, which can have more semantic features. Because a single word in a Chinese kanji often cannot accurately express meaning, a word composed of several Chinese characters can more accurately express a specific meaning. For example, the meanings of "public" and "division" are completely different from "company"; thus, when the word segmentation is performed, the word vector of "company" can be read without having to read "public" and " Division" two word vector. Among them, the processing and operation process after reading the word vector are the same as the word vector.
具体的,本发明实施例中,所述读取所述短信息中除所述预设特征词之外的其余字的字向量的步骤具体为:Specifically, in the embodiment of the present invention, the step of reading the word vector of the remaining words except the preset feature word in the short message is specifically:
根据文本分词技术,获取所述短信息中除所述预设特征词之外的其余字中的词语;Acquiring words in the remaining words of the short message except the preset feature words according to a text segmentation technique;
读取所述获取的词语的词向量和所述短信息中除所述预设特征词及所述获取的词语之外的其余字的字向量。Reading a word vector of the acquired word and a word vector of the remaining words other than the preset feature word and the acquired word in the short message.
这样,能够提高后续的与短信息对应的信息向量的准确度。In this way, the accuracy of the subsequent information vector corresponding to the short message can be improved.
本发明实施例中,在所述步骤106之后,所述方法还包括:In the embodiment of the present invention, after the step 106, the method further includes:
将所述短信息分类保存至其所属的短信息类型中。The short message classification is saved to the short message type to which it belongs.
这样,将接收到的短信息进行分类保存,方便用户的查询整理。In this way, the received short information is classified and saved, which is convenient for the user to query and organize.
本发明实施例中,在所述步骤106之后,所述方法还包括:In the embodiment of the present invention, after the step 106, the method further includes:
输出所述预设特征词中的至少一个。Outputting at least one of the preset feature words.
需要说明的是,此处的输出可以是输出至终端屏幕显示,以提示用户 核查,防止一些误判或漏判,也可以是输出至其他APP应用以使用。It should be noted that the output here can be output to the terminal screen display to prompt the user. Check to prevent some misjudgments or missed judgments, or output them to other APP applications for use.
例如,上述经过特征符号替换后的短信息“您个人信用卡DATE账单CURRENCY,到期还款日DATE。[BANK]”,当识别出是信用卡还款提醒短信息类型时,可将DATE和CURRENCY对应的原始文本,即“11月”、“人民币4818.93”和“11月23日”,输出至终端屏幕显示,以提示用户核查。并且,输出的信息还可进一步存放至终端日程表中,形成一个提醒时间。For example, the above-mentioned short message after the feature symbol replacement "your personal credit card DATE bill CURRENCY, due date DATE. [BANK]", when identifying the credit card repayment reminder short message type, DATE and CURRENCY can be corresponding The original texts, namely "November", "RMB 48.8.93" and "November 23", are output to the terminal screen display to prompt the user to check. Moreover, the outputted information can be further stored in the terminal schedule to form a reminder time.
又例如,终端接收到短信息“您的建行卡积分已达1万分,可兑换5%现金,请登录www.xxxx.com进行兑换,逾期积分清零[xx分行]”,经过特征符号替换后,所述短信息变为“您的建行卡积分已达CURRENCY,可兑换PERCENT现金,请登录URL进行兑换,逾期积分清零[BANK]”;当识别出是垃圾短信息类型时,可将URL对应的原始文本“www.xxxx.com”输出,以提示用户确认核查,防止误判或漏判。For another example, the terminal receives the short message “Your CCB has reached 10,000 points and can be exchanged for 5% cash. Please go to www.xxxx.com for redemption, overdue points will be cleared [xx branches]”, after the feature symbol is replaced. The short message becomes "Your CCB credit has reached CURRENCY, can be exchanged for PERCENT cash, please log in to the URL for redemption, overdue points are cleared [BANK]"; when it is identified as a spam type, the URL can be Corresponding original text "www.xxxx.com" is output to prompt the user to confirm the verification to prevent misjudgment or missed judgment.
参见图2所示,本发明实施例还提供一种短信息分类装置,与图1所示的短信息分类方法相对应,所述装置包括:Referring to FIG. 2, an embodiment of the present invention further provides a short message classification device, which corresponds to the short message classification method shown in FIG. 1, and the device includes:
识别模块21,用于识别接收的短信息中的预设特征词;The identification module 21 is configured to identify a preset feature word in the received short message;
替换模块22,用于将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;The replacement module 22 is configured to replace the preset feature words in the short message with the feature symbols corresponding to the preset feature words;
第一确定模块23,用于确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型;a first determining module 23, configured to determine a first classification model, where the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
第一读取模块24,用于从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;a first reading module 24, configured to read, from a high-frequency word vector library of the first classification model, a symbol vector of the feature symbol and a short message other than the preset feature word in the short message The word vector of the remaining words;
第一运算模块25,用于根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果; The first operation module 25 is configured to perform a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result;
第一判定模块26,用于根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。The first determining module 26 is configured to determine, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
本发明实施例的短信息分类装置,通过预先设置的分类模型,能够对短信息所属的短信息类型进行准确地判定,实现对短信息的智能管理,方便用户对短信息进行查询整理。The short message classification device of the embodiment of the present invention can accurately determine the short message type to which the short message belongs by using the classification model set in advance, thereby realizing intelligent management of the short message, and facilitating the user to query and organize the short message.
具体的,所述装置还包括:Specifically, the device further includes:
第二确定模块,用于在所述短信息的类型为所述非第一短信息类型时,确定第二分类模型,其中,所述第二分类模型对应的短信息类型包括至少一种第二短信息类型和非第二短信息类型;a second determining module, configured to determine a second classification model when the type of the short information is the non-first short information type, where the short information type corresponding to the second classification model includes at least one second Short message type and non-second short message type;
第二读取模块,用于从所述第二分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;a second reading module, configured to read, from a high frequency word vector library of the second classification model, a symbol vector of the feature symbol and a rest of the short information except the preset feature word Word vector of words;
第二运算模块,用于根据所述第二分类模型,对读取的符号向量和字向量进行加权运算,得到第二运算结果;a second operation module, configured to perform a weighting operation on the read symbol vector and the word vector according to the second classification model, to obtain a second operation result;
第二判定模块,用于根据所述第二运算结果,判定所述短信息的类型为所述第二短信息类型或所述非第二短信息类型。And a second determining module, configured to determine, according to the second operation result, that the type of the short message is the second short information type or the non-second short information type.
本发明实施例中,所述第一运算模块包括:In the embodiment of the present invention, the first computing module includes:
处理单元,用于根据所述第一分类模型,对所述读取的符号向量和字向量进行处理,得到与所述短信息对应的信息向量;a processing unit, configured to process the read symbol vector and the word vector according to the first classification model, to obtain an information vector corresponding to the short information;
确定单元,用于确定每种第一短信息类型和所述非第一短信息类型的与所述信息向量对应的权重系数向量,其中,所述信息向量中的信息值与所述权重系数向量中的权重系数一一对应;a determining unit, configured to determine a weight coefficient vector corresponding to the information vector of each of the first short information type and the non-first short information type, wherein the information value in the information vector and the weight coefficient vector The weighting factors in the one-to-one correspondence;
运算单元,用于利用所述信息向量与确定的每种短信息类型的权重系数向量进行加权运算,得到至少两个预测量化值。And an operation unit, configured to perform a weighting operation by using the information vector and the determined weight coefficient vector of each short information type to obtain at least two predicted quantized values.
进一步的,所述第一判定模块包括: Further, the first determining module includes:
比较单元,用于比较所述至少两个预测量化值,得到所述至少两个预测量化值中的最大的预测量化值;a comparing unit, configured to compare the at least two predicted quantized values to obtain a largest predicted quantized value of the at least two predicted quantized values;
判定单元,用于判定所述短信息的类型为所述最大的预测量化值对应的短信息类型。The determining unit is configured to determine that the type of the short message is a short message type corresponding to the largest predicted quantized value.
本发明实施例中,所述装置还包括:In the embodiment of the present invention, the device further includes:
规范处理模块,用于对所述接收的短信息进行规范处理;a specification processing module, configured to perform normal processing on the received short message;
所述识别模块具体用于:识别所述规范处理后的短信息中的预设特征词。The identification module is specifically configured to: identify a preset feature word in the short message after the specification processing.
本发明实施例中,所述读取模块包括:In the embodiment of the present invention, the reading module includes:
获取单元,用于根据文本分词技术,获取所述短信息中除所述预设特征词之外的其余字中的词语;An obtaining unit, configured to acquire, in the short message, words in the remaining words except the preset feature words according to a text segmentation technique;
读取单元,用于读取所述获取的词语的词向量和所述短信息中除所述预设特征词及所述获取的词语之外的其余字的字向量。a reading unit, configured to read a word vector of the acquired word and a word vector of the remaining words of the short message except the preset feature word and the acquired word.
本发明实施例中,所述装置还包括:In the embodiment of the present invention, the device further includes:
分类保存模块,用于将所述短信息分类保存至其所属的短信息类型中。The category saving module is configured to save the short message category into the short message type to which it belongs.
本发明实施例中,所述装置还包括:In the embodiment of the present invention, the device further includes:
输出模块,用于输出所述预设特征词中的至少一个。And an output module, configured to output at least one of the preset feature words.
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.
本发明实施例中,还提供了一种计算机存储介质,该存储介质包括一组指令,当执行所述指令时,引起至少一个处理器执行包括以下的操作:In an embodiment of the present invention, a computer storage medium is further provided, the storage medium comprising a set of instructions, when executed, causing at least one processor to perform operations including:
识别接收的短信息中的预设特征词;将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;Identifying a preset feature word in the received short message; replacing the preset feature word in the short message with the feature symbol corresponding to the preset feature word;
确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括 至少一种第一短信息类型和非第一短信息类型;Determining a first classification model, wherein the short information type corresponding to the first classification model includes At least one first short message type and a non-first short message type;
从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the first classification model;
根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果;根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。Performing a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result; and determining, according to the first operation result, the type of the short message as the first short message type Or the non-first short message type.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
以上所述,仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。 The above is only the preferred embodiment of the present invention and is not intended to limit the scope of the present invention.
工业实用性Industrial applicability
本发明实施例公开了短信息分类方法、装置及计算机存储介质,识别接收的短信息中的预设特征词,将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号,确定第一分类模型,根据所述第一分类模型,得到第一运算结果,根据所述第一运算结果,判定所述短信息的类型。本发明的方案,通过预先设置的分类模型,能够对短信息所属的短信息类型进行准确地判定,实现对短信息的智能管理,方便用户对短信息进行查询整理。 The embodiment of the invention discloses a short message classification method and device, and a computer storage medium, which identifies a preset feature word in the received short message, and replaces the preset feature word in the short message with the preset feature word. The feature symbol determines a first classification model, and according to the first classification model, obtains a first operation result, and determines a type of the short message according to the first operation result. According to the solution of the present invention, the short message type to which the short message belongs can be accurately determined by the classification model set in advance, thereby realizing intelligent management of the short message, and facilitating the user to query and organize the short message.

Claims (17)

  1. 一种短信息分类方法,包括:A short message classification method, including:
    识别接收的短信息中的预设特征词;Identifying a preset feature word in the received short message;
    将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;Substituting the preset feature words in the short message with the feature symbols corresponding to the preset feature words;
    确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型;Determining a first classification model, wherein the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
    从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the first classification model;
    根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果;Performing a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result;
    根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。Determining, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
  2. 根据权利要求1所述的方法,其中,所述方法还包括:The method of claim 1 wherein the method further comprises:
    若所述短信息的类型为所述非第一短信息类型,确定第二分类模型,其中,所述第二分类模型对应的短信息类型包括至少一种第二短信息类型和非第二短信息类型;If the type of the short message is the non-first short message type, determining a second classification model, where the short information type corresponding to the second classification model includes at least one second short information type and a non-second short Type of information;
    从所述第二分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the second classification model;
    根据所述第二分类模型,对读取的符号向量和字向量进行加权运算,得到第二运算结果;Performing a weighting operation on the read symbol vector and the word vector according to the second classification model to obtain a second operation result;
    根据所述第二运算结果,判定所述短信息的类型为所述第二短信息类型或所述非第二短信息类型。And determining, according to the second operation result, that the type of the short information is the second short information type or the non-second short information type.
  3. 根据权利要求1所述的方法,其中,所述根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果的步骤,包 括:The method according to claim 1, wherein said step of performing a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result, the package include:
    根据所述第一分类模型,对所述读取的符号向量和字向量进行处理,得到与所述短信息对应的信息向量;Processing the read symbol vector and the word vector according to the first classification model to obtain an information vector corresponding to the short information;
    确定每种第一短信息类型和所述非第一短信息类型的与所述信息向量对应的权重系数向量,其中,所述信息向量中的信息值与所述权重系数向量中的权重系数一一对应;Determining, for each of the first short message type and the non-first short message type, a weight coefficient vector corresponding to the information vector, wherein the information value in the information vector and the weight coefficient in the weight coefficient vector One-to-one correspondence;
    利用所述信息向量与确定的每种短信息类型的权重系数向量进行加权运算,得到至少两个预测量化值。The weighting operation is performed by using the information vector and the determined weight coefficient vector of each short information type to obtain at least two predicted quantized values.
  4. 根据权利要求3所述的方法,其中,所述根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型的步骤,包括:The method according to claim 3, wherein the step of determining, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type comprises:
    比较所述至少两个预测量化值,得到所述至少两个预测量化值中的最大的预测量化值;Comparing the at least two predicted quantized values to obtain a largest predicted quantized value of the at least two predicted quantized values;
    判定所述短信息的类型为所述最大的预测量化值对应的短信息类型。Determining that the type of the short message is a short message type corresponding to the largest predicted quantized value.
  5. 根据权利要求1所述的方法,其中,所述识别接收的短信息中的预设特征词的步骤之前,所述方法还包括:The method of claim 1, wherein before the step of identifying a preset feature word in the received short message, the method further comprises:
    对所述接收的短信息进行规范处理;Standardizing the received short message;
    所述识别接收的短信息中的预设特征词的步骤包括:The step of identifying a preset feature word in the received short message includes:
    识别所述规范处理后的短信息中的预设特征词。Identifying a preset feature word in the short message processed by the specification.
  6. 根据权利要求1所述的方法,其中,所述读取所述短信息中除所述预设特征词之外的其余字的字向量的步骤,包括:The method of claim 1, wherein the step of reading a word vector of the remaining words other than the preset feature word in the short message comprises:
    根据文本分词技术,获取所述短信息中除所述预设特征词之外的其余字中的词语;Acquiring words in the remaining words of the short message except the preset feature words according to a text segmentation technique;
    读取所述获取的词语的词向量和所述短信息中除所述预设特征词及所述获取的词语之外的其余字的字向量。 Reading a word vector of the acquired word and a word vector of the remaining words other than the preset feature word and the acquired word in the short message.
  7. 根据权利要求1所述的方法,其中,所述根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型的步骤之后,所述方法还包括:The method according to claim 1, wherein said determining, according to said first operation result, said step of said short message type being said first short message type or said non-first short message type The method also includes:
    将所述短信息分类保存至其所属的短信息类型中。The short message classification is saved to the short message type to which it belongs.
  8. 根据权利要求1所述的方法,其中,所述根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型的步骤之后,所述方法还包括:The method according to claim 1, wherein said determining, according to said first operation result, said step of said short message type being said first short message type or said non-first short message type The method also includes:
    输出所述预设特征词中的至少一个。Outputting at least one of the preset feature words.
  9. 一种短信息分类装置,包括:A short message classification device comprising:
    识别模块,配置为识别接收的短信息中的预设特征词;An identification module configured to identify a preset feature word in the received short message;
    替换模块,配置为将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;a replacement module, configured to replace a preset feature word in the short message with a feature symbol corresponding to the preset feature word;
    第一确定模块,配置为确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型;a first determining module, configured to determine a first classification model, where the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
    第一读取模块,配置为从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;a first reading module configured to read a symbol vector of the feature symbol and a remaining of the short information except the preset feature word from a high frequency word vector library of the first classification model Word vector of words;
    第一运算模块,配置为根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果;The first operation module is configured to perform a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result;
    第一判定模块,配置为根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。The first determining module is configured to determine, according to the first operation result, that the type of the short message is the first short information type or the non-first short information type.
  10. 根据权利要求9所述的装置,其中,所述装置还包括:The apparatus of claim 9 wherein said apparatus further comprises:
    第二确定模块,配置为在所述短信息的类型为所述非第一短信息类型时,确定第二分类模型,其中,所述第二分类模型对应的短信息类型包括至少一种第二短信息类型和非第二短信息类型; a second determining module, configured to determine a second classification model when the type of the short information is the non-first short information type, where the short information type corresponding to the second classification model includes at least one second Short message type and non-second short message type;
    第二读取模块,配置为从所述第二分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;a second reading module, configured to read, from a high-frequency word vector library of the second classification model, a symbol vector of the feature symbol and a remaining of the short information except the preset feature word Word vector of words;
    第二运算模块,配置为根据所述第二分类模型,对读取的符号向量和字向量进行加权运算,得到第二运算结果;a second operation module, configured to perform a weighting operation on the read symbol vector and the word vector according to the second classification model to obtain a second operation result;
    第二判定模块,配置为根据所述第二运算结果,判定所述短信息的类型为所述第二短信息类型或所述非第二短信息类型。The second determining module is configured to determine, according to the second operation result, that the type of the short message is the second short information type or the non-second short information type.
  11. 根据权利要求9所述的装置,其中,所述第一运算模块包括:The apparatus of claim 9, wherein the first computing module comprises:
    处理单元,配置为根据所述第一分类模型,对所述读取的符号向量和字向量进行处理,得到与所述短信息对应的信息向量;The processing unit is configured to process the read symbol vector and the word vector according to the first classification model to obtain an information vector corresponding to the short information;
    确定单元,配置为确定每种第一短信息类型和所述非第一短信息类型的与所述信息向量对应的权重系数向量,其中,所述信息向量中的信息值与所述权重系数向量中的权重系数一一对应;a determining unit configured to determine a weight coefficient vector corresponding to the information vector of each of the first short information type and the non-first short information type, wherein the information value in the information vector and the weight coefficient vector The weighting factors in the one-to-one correspondence;
    运算单元,配置为利用所述信息向量与确定的每种短信息类型的权重系数向量进行加权运算,得到至少两个预测量化值。And an operation unit configured to perform a weighting operation by using the information vector and the determined weight coefficient vector of each short information type to obtain at least two predicted quantized values.
  12. 根据权利要求11所述的装置,其中,所述第一判定模块包括:The apparatus of claim 11 wherein said first decision module comprises:
    比较单元,配置为比较所述至少两个预测量化值,得到所述至少两个预测量化值中的最大的预测量化值;a comparing unit configured to compare the at least two predicted quantized values to obtain a largest predicted quantized value of the at least two predicted quantized values;
    判定单元,配置为判定所述短信息的类型为所述最大的预测量化值对应的短信息类型。The determining unit is configured to determine that the type of the short message is a short message type corresponding to the largest predicted quantized value.
  13. 根据权利要求9所述的装置,其中,所述装置还包括:The apparatus of claim 9 wherein said apparatus further comprises:
    规范处理模块,配置为对所述接收的短信息进行规范处理;a specification processing module configured to perform specification processing on the received short message;
    所述识别模块,配置为识别所述规范处理后的短信息中的预设特征词。The identification module is configured to identify a preset feature word in the short message after the specification processing.
  14. 根据权利要求9所述的装置,其中,所述读取模块包括:The apparatus of claim 9 wherein said reading module comprises:
    获取单元,配置为根据文本分词技术,获取所述短信息中除所述预设 特征词之外的其余字中的词语;Obtaining a unit, configured to acquire the short message in addition to the preset according to a text segmentation technique Words in the remaining words other than the feature word;
    读取单元,配置为读取所述获取的词语的词向量和所述短信息中除所述预设特征词及所述获取的词语之外的其余字的字向量。a reading unit configured to read a word vector of the acquired word and a word vector of the remaining words of the short message except the preset feature word and the acquired word.
  15. 根据权利要求9所述的装置,其中,所述装置还包括:The apparatus of claim 9 wherein said apparatus further comprises:
    分类保存模块,配置为将所述短信息分类保存至其所属的短信息类型中。The category saving module is configured to save the short message category into the short message type to which it belongs.
  16. 根据权利要求9所述的装置,其中,所述装置还包括:The apparatus of claim 9 wherein said apparatus further comprises:
    输出模块,配置为输出所述预设特征词中的至少一个。And an output module configured to output at least one of the preset feature words.
  17. 一种计算机存储介质,该存储介质包括一组指令,当执行所述指令时,引起至少一个处理器执行包括以下的操作:A computer storage medium comprising a set of instructions that, when executed, cause at least one processor to perform operations comprising:
    识别接收的短信息中的预设特征词;将所述短信息中的预设特征词替换为与所述预设特征词对应的特征符号;Identifying a preset feature word in the received short message; replacing the preset feature word in the short message with the feature symbol corresponding to the preset feature word;
    确定第一分类模型,其中,所述第一分类模型对应的短信息类型包括至少一种第一短信息类型和非第一短信息类型;Determining a first classification model, wherein the short information type corresponding to the first classification model includes at least one first short information type and a non-first short information type;
    从所述第一分类模型的高频字字向量库中,读取所述特征符号的符号向量和所述短信息中除所述预设特征词之外的其余字的字向量;Reading a symbol vector of the feature symbol and a word vector of the remaining words of the short message except the preset feature word from a high frequency word vector library of the first classification model;
    根据所述第一分类模型,对读取的符号向量和字向量进行加权运算,得到第一运算结果;根据所述第一运算结果,判定所述短信息的类型为所述第一短信息类型或所述非第一短信息类型。 Performing a weighting operation on the read symbol vector and the word vector according to the first classification model to obtain a first operation result; and determining, according to the first operation result, the type of the short message as the first short message type Or the non-first short message type.
PCT/CN2016/105378 2016-08-11 2016-11-10 Method and device for classifying short message and computer storage medium WO2018028065A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610659527.4 2016-08-11
CN201610659527.4A CN107734131B (en) 2016-08-11 2016-08-11 Short message classification method and device

Publications (1)

Publication Number Publication Date
WO2018028065A1 true WO2018028065A1 (en) 2018-02-15

Family

ID=61161749

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/105378 WO2018028065A1 (en) 2016-08-11 2016-11-10 Method and device for classifying short message and computer storage medium

Country Status (2)

Country Link
CN (1) CN107734131B (en)
WO (1) WO2018028065A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241269A (en) * 2018-11-09 2020-06-05 中移(杭州)信息技术有限公司 Short message text classification method and device, electronic equipment and storage medium
CN113657106A (en) * 2021-07-05 2021-11-16 西安理工大学 Feature selection method based on normalized word frequency weight

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929025B (en) * 2018-09-17 2023-04-25 阿里巴巴集团控股有限公司 Junk text recognition method and device, computing equipment and readable storage medium
CN110913354A (en) * 2018-09-17 2020-03-24 阿里巴巴集团控股有限公司 Short message classification method and device and electronic equipment
CN111209751B (en) * 2020-02-14 2023-07-28 全球能源互联网研究院有限公司 Chinese word segmentation method, device and storage medium
CN116468037A (en) * 2023-03-17 2023-07-21 北京深维智讯科技有限公司 NLP-based data processing method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013061757A (en) * 2011-09-13 2013-04-04 Hitachi Solutions Ltd Document sorting method
JP2013120534A (en) * 2011-12-08 2013-06-17 Mitsubishi Electric Corp Related word classification device, computer program, and method for classifying related word
CN103778226A (en) * 2014-01-23 2014-05-07 北京奇虎科技有限公司 Method for establishing language information recognition model and language information recognition device
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
CN104978354A (en) * 2014-04-10 2015-10-14 中电长城网际系统应用有限公司 Text classification method and text classification device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103024746B (en) * 2012-12-30 2015-06-17 清华大学 System and method for processing spam short messages for telecommunication operator
CN105447750B (en) * 2015-11-17 2022-06-03 小米科技有限责任公司 Information identification method and device, terminal and server
CN105488025B (en) * 2015-11-24 2019-02-12 小米科技有限责任公司 Template construction method and device, information identifying method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013061757A (en) * 2011-09-13 2013-04-04 Hitachi Solutions Ltd Document sorting method
JP2013120534A (en) * 2011-12-08 2013-06-17 Mitsubishi Electric Corp Related word classification device, computer program, and method for classifying related word
CN103778226A (en) * 2014-01-23 2014-05-07 北京奇虎科技有限公司 Method for establishing language information recognition model and language information recognition device
CN104978354A (en) * 2014-04-10 2015-10-14 中电长城网际系统应用有限公司 Text classification method and text classification device
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241269A (en) * 2018-11-09 2020-06-05 中移(杭州)信息技术有限公司 Short message text classification method and device, electronic equipment and storage medium
CN111241269B (en) * 2018-11-09 2024-02-23 中移(杭州)信息技术有限公司 Short message text classification method and device, electronic equipment and storage medium
CN113657106A (en) * 2021-07-05 2021-11-16 西安理工大学 Feature selection method based on normalized word frequency weight

Also Published As

Publication number Publication date
CN107734131A (en) 2018-02-23
CN107734131B (en) 2021-02-12

Similar Documents

Publication Publication Date Title
WO2018028065A1 (en) Method and device for classifying short message and computer storage medium
US20230222366A1 (en) Systems and methods for semantic analysis based on knowledge graph
US10447635B2 (en) Filtering electronic messages
CN103064987A (en) Bogus transaction information identification method
WO2017173093A1 (en) Method and device for identifying spam mail
CN110765101A (en) Label generation method and device, computer readable storage medium and server
WO2018028164A1 (en) Text information extracting method, device and mobile terminal
US8620918B1 (en) Contextual text interpretation
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN112818692B (en) Named entity recognition and processing method, named entity recognition and processing device, named entity recognition and processing equipment and readable storage medium
CN110972086A (en) Short message processing method and device, electronic equipment and computer readable storage medium
CN111259207A (en) Short message identification method, device and equipment
CN109101487A (en) Conversational character differentiating method, device, terminal device and storage medium
CN114741501A (en) Public opinion early warning method and device, readable storage medium and electronic equipment
US11681966B2 (en) Systems and methods for enhanced risk identification based on textual analysis
CN112699949B (en) Potential user identification method and device based on social platform data
CN115687754A (en) Active network information mining method based on intelligent conversation
CN113051396B (en) Classification recognition method and device for documents and electronic equipment
CN113472686A (en) Information identification method, device, equipment and storage medium
CN110610213A (en) Mail classification method, device, equipment and computer readable storage medium
KR102451168B1 (en) Method and program for providing fraud information
CN116886817A (en) Business operation reminding method, device, equipment, medium and product
Minhas et al. Linguistic correlates of deception in financial text a corpus linguistics based approach
CN113901817A (en) Document classification method and device, computer equipment and storage medium
CN117633226A (en) Classification method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16912519

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16912519

Country of ref document: EP

Kind code of ref document: A1