WO2022121164A1 - Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium - Google Patents

Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2022121164A1
WO2022121164A1 PCT/CN2021/083489 CN2021083489W WO2022121164A1 WO 2022121164 A1 WO2022121164 A1 WO 2022121164A1 CN 2021083489 W CN2021083489 W CN 2021083489W WO 2022121164 A1 WO2022121164 A1 WO 2022121164A1
Authority
WO
WIPO (PCT)
Prior art keywords
sensitive word
sensitive
preset
prediction
short message
Prior art date
Application number
PCT/CN2021/083489
Other languages
French (fr)
Chinese (zh)
Inventor
程华东
侯翠琴
李剑锋
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121164A1 publication Critical patent/WO2022121164A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs

Abstract

The present application discloses a suspension-causing sensitive word prediction method and apparatus, and a computer device and a storage medium, and mainly aims to improve the screening efficiency and accuracy of suspension-causing sensitive words and reduce the workload of service personnel. The method comprises: obtaining public sensitive words to be predicted; respectively inputting said public sensitive words into different types of preset sensitive word prediction models to perform suspension-causing sensitive word prediction to obtain prediction results output by the different types of preset sensitive word prediction models; and determining, according to the prediction results output by the different types of preset sensitive word prediction models, whether said public sensitive words are suspension-causing sensitive words. The present application is mainly suitable for prediction of the suspension-causing sensitive words.

Description

封停敏感词预测方法、装置、计算机设备及存储介质Block sensitive word prediction method, device, computer equipment and storage medium
本申请要求于2020年12月10日提交中国专利局、申请号为202011434908.5,发明名称为“封停敏感词预测方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 10, 2020 with the application number 202011434908.5 and the title of the invention is "Block sensitive word prediction method, device, computer equipment and storage medium", the entire content of which is Incorporated herein by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其是涉及一种封停敏感词预测方法、装置、计算机设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for predicting blocked sensitive words.
背景技术Background technique
运营商通常都会有自己的封停敏感词表,当用户发送的短信中包含封停敏感词的情况下,会造成号码封停,需要用户去营业厅办理解封业务,号码才可以继续使用,对用户来说十分不便,因此相关的业务公司都需要维护自己的敏感词表,使自己的敏感词表尽可能接近运营商的封停敏感词表,利用敏感词表对公司内发送的短信进行预警,以免造成公司内部号码被封停。Operators usually have their own blocking sensitive word list. When the text message sent by the user contains blocking sensitive words, the number will be blocked. The user needs to go to the business hall to handle the unblocking service before the number can continue to be used. It is very inconvenient for users, so related business companies need to maintain their own sensitive vocabulary, make their sensitive vocabulary as close as possible to the operator's blocked sensitive vocabulary, and use the sensitive vocabulary to carry out text messages sent by the company. Early warning, so as not to cause the company's internal number to be blocked.
发明人意识到,目前,业务公司在维护自己的敏感词表的过程中,通常由业务人员根据历史封停的短信数据从公开的敏感词库中筛选封停敏感词。然而,这种人为筛选封停敏感词的方式,受人为主观因素影响较大,很可能会遗漏封停敏感词或者挑选错误,从而导致封停敏感词的筛选效率和准确度较低,同时大大增加了业务人员的工作量。The inventor realizes that at present, when a business company maintains its own sensitive word list, business personnel usually screen and block sensitive words from an open sensitive word database according to historically blocked short message data. However, this method of artificially screening blocked-sensitive words is greatly affected by human subjective factors, and it is likely to miss blocked-sensitive words or select wrongly, resulting in low screening efficiency and accuracy of blocked-sensitive words. Increased workload of business personnel.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种封停敏感词预测方法、装置、计算机设备及存储介质,主要在于能够提高封停敏感词的筛选效率和准确度,减轻业务人员的工作量。The present application provides a method, device, computer equipment and storage medium for predicting blocked sensitive words, which can improve the screening efficiency and accuracy of blocked sensitive words and reduce the workload of business personnel.
根据本申请的第一个方面,提供一种封停敏感词预测方法,包括:According to a first aspect of the present application, a method for predicting blocking sensitive words is provided, including:
获取待预测的公共敏感词;Obtain the public sensitive words to be predicted;
将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;
根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
根据本申请的第二个方面,提供一种封停敏感词预测装置,包括:According to a second aspect of the present application, a blocking sensitive word prediction device is provided, comprising:
获取单元,用于获取待预测的公共敏感词;The acquisition unit is used to acquire the public sensitive words to be predicted;
预测单元,用于将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;A prediction unit, configured to respectively input the public sensitive words into different types of preset sensitive word prediction models to perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models;
判定单元,用于根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。A determination unit, configured to determine whether the common sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models.
根据本申请的第三个方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:According to a third aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:
获取待预测的公共敏感词;Obtain the public sensitive words to be predicted;
将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;
根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
根据本申请的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下步骤:According to a fourth aspect of the present application, a computer device is provided, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the program:
获取待预测的公共敏感词;Obtain the public sensitive words to be predicted;
将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;
根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
本申请实现了对公共敏感词库中封停敏感词的自动筛选,提高了封停敏感词的筛选效率,同时能够确保筛选结果的准确性,此外,通过构建不同类型的预设敏感词预测模型,能够进一步提升预测结果的准确度,确保筛选结果的可靠性,同时减轻了业务人员的工作负担,降低了人工成本。The present application realizes the automatic screening of blocked sensitive words in the public sensitive lexicon, improves the screening efficiency of blocked sensitive words, and at the same time ensures the accuracy of the screening results. In addition, by constructing different types of preset sensitive word prediction models , which can further improve the accuracy of prediction results, ensure the reliability of screening results, and reduce the workload of business personnel and labor costs.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:
图1示出了本申请实施例提供的一种封停敏感词预测方法流程图;FIG. 1 shows a flowchart of a method for predicting a blocked sensitive word provided by an embodiment of the present application;
图2示出了本申请实施例提供的另一种封停敏感词预测方法流程图;FIG. 2 shows a flowchart of another method for predicting blocked sensitive words provided by an embodiment of the present application;
图3示出了本申请实施例提供的一种封停敏感词预测装置的结构示意图;FIG. 3 shows a schematic structural diagram of a blocking sensitive word prediction device provided by an embodiment of the present application;
图4示出了本申请实施例提供的另一种封停敏感词预测装置的结构示意图;FIG. 4 shows a schematic structural diagram of another block-sensitive word prediction device provided by an embodiment of the present application;
图5示出了本申请实施例提供的一种计算机设备的实体结构示意图。FIG. 5 shows a schematic diagram of an entity structure of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present application will be described in detail with reference to the accompanying drawings and in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.
本申请的技术方案涉及人工智能技术领域,如可具体涉及机器学习技术,以实现对封停敏感词的预测。The technical solution of the present application relates to the field of artificial intelligence technology, such as machine learning technology, so as to realize the prediction of blocking sensitive words.
目前,业务公司在维护自己的敏感词表的过程中,通常由业务人员根据历史封停的短信数据从公开的敏感词库中筛选封停敏感词。然而,这种人为筛选封停敏感词的方式,受人为主观因素影响较大,很可能会遗漏封停敏感词或者挑选错误,从而导致封停敏感词的筛选效率和准确度较低,同时大大增加了业务人员的工作量。At present, when a business company maintains its own sensitive word list, business personnel usually screen and block sensitive words from an open sensitive word database based on historically blocked short message data. However, this method of artificially screening blocked-sensitive words is greatly affected by human subjective factors, and it is likely to miss blocked-sensitive words or select wrongly, resulting in low screening efficiency and accuracy of blocked-sensitive words. Increased workload of business personnel.
为了解决上述问题,本申请实施例提供了一种封停敏感词预测方法,如图1所示,所述方法包括:In order to solve the above problem, an embodiment of the present application provides a method for predicting blocked sensitive words, as shown in FIG. 1 , the method includes:
101、获取待预测的公共敏感词。101. Obtain public sensitive words to be predicted.
其中,待预测的公共敏感词为公共敏感词库中的敏感词,如贷款、银行、系统、卖肾、卖血等,该公共敏感词库记录有大量公共敏感词,词汇量能够达到几十万,但如果业务公司直接使用该公共敏感词库进行短息预警,会使大量短信被拦截无法发送,因此需要从公共敏感词库中筛选封停敏感词,以便得到与运营商的封停敏感词库相同或者相近的敏感词库,为了克服现有技术中人为手动挑选封停敏感词的缺陷,本申请实施例通过构建预设敏感词预测模型,并利用该预设敏感词预测模型对公共敏感词库中的各个敏感词分别进行预测,从而达到自动挑选公共敏感词库中封停敏感词的目的,本申请实施例的执行主体为能够对公共敏感词进行预测的装置或者设备,具体可以设置在客户端或者服务器一侧。Among them, the public sensitive words to be predicted are sensitive words in the public sensitive word database, such as loan, bank, system, selling kidney, selling blood, etc. The public sensitive word database records a large number of public sensitive words, and the vocabulary volume can reach dozens of However, if the business company directly uses the public sensitive thesaurus for short message warning, a large number of short messages will be intercepted and cannot be sent. Sensitive lexicons with the same or similar lexicons, in order to overcome the defect of manually selecting and blocking sensitive words in the prior art, the embodiment of the present application constructs a preset sensitive word prediction model, and uses the preset sensitive word prediction model Each sensitive word in the sensitive thesaurus is predicted separately, so as to achieve the purpose of automatically selecting and blocking the sensitive words in the public sensitive thesaurus. Set on the client or server side.
102、将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果。102. Input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain prediction results output by the different types of preset sensitive word prediction models.
其中,不同类型的预设敏感词预测模型包括预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型,需要说明的是,本申请实施例中不同类型的预设敏感词预测模型并不局限于上述几种,预设支持向量机敏感词预测模型的具体架构如下:Among them, different types of preset sensitive word prediction models include preset support vector machine sensitive word prediction models, preset gradient boosting tree sensitive word prediction models, and preset proximity classification sensitive word prediction models. It should be noted that the embodiments of the present application The different types of preset sensitive word prediction models are not limited to the above-mentioned ones. The specific architecture of the preset SVM sensitive word prediction model is as follows:
y=g w,b(w Tx+b) y=g w,b (w T x+b)
其中,(x,y)为训练样本,将白敏感词样本和黑敏感词样本作为训练样本对初始支持向量机模型进行训练,构建预设支持向量机敏感词预测模型,具体地,预设支持向量机 敏感词预测模型的优化目的是使所有白敏感词样本和黑敏感词样本到分割超平面的最小几何间距最大化,目标函数此时为Among them, (x, y) is a training sample, and the white-sensitive word sample and black-sensitive word sample are used as training samples to train the initial support vector machine model, and a preset support vector machine sensitive word prediction model is constructed. Specifically, the preset support The optimization purpose of the vector machine sensitive word prediction model is to maximize the minimum geometric distance between all white-sensitive word samples and black-sensitive word samples to the segmentation hyperplane, and the objective function is
Figure PCTCN2021083489-appb-000001
Figure PCTCN2021083489-appb-000001
由此通过该目标函数不断优化初始支持向量机模型中的参数w和b,从而最终训练得到预设支持向量机敏感词预测模型。Therefore, the parameters w and b in the initial support vector machine model are continuously optimized through the objective function, and finally the preset support vector machine sensitive word prediction model is obtained by training.
进一步地,预设梯度提升树敏感词预测模型的具体架构如下:Further, the specific architecture of the preset gradient boosting tree-sensitive word prediction model is as follows:
Figure PCTCN2021083489-appb-000002
Figure PCTCN2021083489-appb-000002
其中,T表示决策树,M为决策树的个数,θ表示决策树的参数,x为白敏感词样本和黑敏感词样本,提升树采用前向分部算法,首先确定f 0(x)=0,第m步的模型架构是: Among them, T is the decision tree, M is the number of decision trees, θ is the parameter of the decision tree, x is the white-sensitive word sample and the black-sensitive word sample, the boosting tree adopts the forward division algorithm, first determine f 0 (x) =0, the model architecture of the mth step is:
f m(x)=f m-1(x)+T(x,θ m) f m (x)=f m-1 (x)+T(x, θ m )
对决策树的参数θ的确定采用经验风险最小化来确定,得到目标函数如下:The determination of the parameter θ of the decision tree is determined by empirical risk minimization, and the objective function is obtained as follows:
Figure PCTCN2021083489-appb-000003
Figure PCTCN2021083489-appb-000003
由此将白敏感词样本和黑敏感词样本作为训练集,通过构造的目标函数能够对初始梯度提升树模型中的参数不断优化,最终得到预设梯度提升树敏感词预测模型。Therefore, the white-sensitive word samples and black-sensitive word samples are used as training sets, and the parameters in the initial gradient boosting tree model can be continuously optimized through the constructed objective function, and finally the preset gradient boosting tree-sensitive word prediction model is obtained.
进一步地,针对预设邻近分类敏感词预测算法,由于白敏感词样本中的敏感词不是封停敏感词,而黑敏感词样本中的敏感词是封停敏感词,因此可以分别计算待预测的公共敏感词与白敏感词样本的欧式距离,以及待预测敏感词与黑敏感词样本的欧式距离,如果公共敏感词与白敏感词样本的欧式距离小于其与黑敏感词样本的欧式距离,则可以认为公共敏感词与白敏感词样本属于一类,即公共敏感词不是封停敏感词;如果公共敏感词与白敏感词样本的欧氏距离大于黑敏感词样本的欧式距离,则可以认为公共敏感词与黑敏感词样本属于一类,即公共敏感词是封停敏感词,公共敏感词与白敏感词样本或者黑敏感词样本的欧式距离计算公式如下:Further, for the preset proximity classification sensitive word prediction algorithm, since the sensitive words in the white sensitive word samples are not block sensitive words, and the sensitive words in the black sensitive word samples are blocked sensitive words, the to-be-predicted words can be calculated separately. The Euclidean distance between the common sensitive word and the white sensitive word sample, and the Euclidean distance between the to-be-predicted sensitive word and the black sensitive word sample, if the Euclidean distance between the common sensitive word and the white sensitive word sample is less than the Euclidean distance between it and the black sensitive word sample, then It can be considered that public sensitive words and white sensitive word samples belong to the same category, that is, public sensitive words are not blocked sensitive words; if the Euclidean distance between public sensitive words and white sensitive word samples is greater than the Euclidean distance of black sensitive word samples, it can be considered that public sensitive words are public sensitive words. Sensitive words and black-sensitive word samples belong to the same category, that is, public sensitive words are blocked sensitive words, and the Euclidean distance between public sensitive words and white-sensitive word samples or black-sensitive word samples is calculated as follows:
Figure PCTCN2021083489-appb-000004
Figure PCTCN2021083489-appb-000004
其中,(X 1,X 2,…X n)为待预测的公共敏感词,(x 1,x 2,…x n)白敏感词样本或者黑敏感词样本,d为公共敏感词与任一白敏感词样本或者任一黑敏感词样本的欧式距离,将公共敏感词与各个白敏感词样本之间的欧式距离相加,同时将公平敏感词与各个黑敏感词样本之间的欧式距离相加,并将相加的欧式距离进行比较,判定公共敏感词是与白敏感词样本一类,还是与黑敏感词样本一类,进而根据判定结果确定公共敏感词是否为封停敏感词。 Among them, (X 1 , X 2 ,...X n ) are the public sensitive words to be predicted, (x 1 , x 2 ,... x n ) white sensitive word samples or black sensitive word samples, d is the public sensitive word and any The Euclidean distance of the white-sensitive word sample or any black-sensitive word sample, the Euclidean distance between the common sensitive word and each white-sensitive word sample is added, and the Euclidean distance between the fairness-sensitive word and each black-sensitive word sample is compared. Add and compare the added Euclidean distances to determine whether the public sensitive words are in the same category as white sensitive word samples or black sensitive word samples, and then determine whether the public sensitive words are blocked sensitive words according to the judgment result.
对于本申请实施例,为了能够自动筛选公共敏感词库中的封停敏感词,同时能够确保筛选结果的可靠性,将公共预感词库中待预测的敏感词分别输入至不同类型的预设公共敏感词预测模型进行封停敏感词预测,得到不同类型的预设敏感词预测模型输出的预测结果,具体地,将待预测的敏感词分别别输入至预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型进行封停敏感词预测,得到预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型对应的预测结果,以便根据该预测结果,判定待依存的公共敏感词是否为封停敏感词,从而能够达到自动从公共敏感词库中筛选封停敏感词的目的。For the embodiment of the present application, in order to automatically screen the blocked sensitive words in the public sensitive thesaurus and at the same time ensure the reliability of the screening results, the sensitive words to be predicted in the public premonition thesaurus are respectively input into different types of preset public The sensitive word prediction model predicts the blocked sensitive words, and obtains the prediction results output by different types of preset sensitive word prediction models. Set the gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model to predict the blocking sensitive words, and obtain the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model. The prediction result corresponding to the word prediction model is used to determine whether the public sensitive word to be relied on is a blocked sensitive word according to the predictive result, so as to achieve the purpose of automatically screening the blocked sensitive word from the public sensitive word database.
103、根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。103. According to the prediction results output by the different types of preset sensitive word prediction models, determine whether the public sensitive word is a blocked sensitive word.
其中,如果用户发送的短信中一旦出现封停敏感词,如卖血、卖肾,则发送该短信的号码会被封号,该预测结果包括确定该公共敏感词为封停敏感词和确定该公共敏感词不是 封停敏感词,对于本申请实施例,为了确保封停敏感词筛选结果的准确性,本申请实施例会综合考虑不同类型的预设敏感词预测模型的预测结果,来最终判定待预测的公共敏感词是否为封停敏感词,具体地,如果不同类型的预设敏感词预测模型的输出结果均为公共敏感词为封停敏感词,最终确定公共敏感词为封停敏感词;如果任一类型的预设敏感词预测模型的输出结果为公共敏感词不是封停敏感词,则最终确定公共敏感词不是封停敏感词。Among them, if a blocking sensitive word, such as selling blood or selling kidney, appears in the short message sent by the user, the number that sent the short message will be blocked. The prediction result includes determining that the public sensitive word is a blocking sensitive word and determining the public sensitive word. Sensitive words are not blocked sensitive words. For the embodiment of the present application, in order to ensure the accuracy of the screening results of blocked sensitive words, the embodiment of the present application will comprehensively consider the prediction results of different types of preset sensitive word prediction models to finally determine the to-be-predicted Whether the public sensitive words are blocked sensitive words, specifically, if the output results of different types of preset sensitive word prediction models are all public sensitive words are blocked sensitive words, the public sensitive words are finally determined to be blocked sensitive words; if If the output result of any type of preset sensitive word prediction model is that the public sensitive word is not a blocking sensitive word, it is finally determined that the public sensitive word is not a blocking sensitive word.
例如,如果预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型输出的预测结果均为公共敏感词为封停敏感词,则最终确定公共敏感词为封停敏感词;如果支持向量机敏感词预测模型输出的预测结果为公共敏感词不是封停敏感词,而预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型输出的预测结果均为公共敏感词为封停敏感词,则最终还是认为公共敏感词不是封停敏感词,由此能够综合考虑不同类型的预设敏感词预测模型的预测结果,进一步提升了封停敏感词的预测精度,确保公共敏感词库中封停敏感词筛选结果的准确性。For example, if the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model are all public sensitive words that are blocked sensitive words, then the public sensitive words are finally determined. Sensitive words are blocked sensitive words; if the prediction result output by the support vector machine sensitive word prediction model is that the public sensitive words are not blocked sensitive words, and the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model output If the prediction results are that the public sensitive words are blocked sensitive words, then the public sensitive words are not considered blocked sensitive words in the end, so the prediction results of different types of preset sensitive word prediction models can be comprehensively considered, which further improves the blocking sensitivity. The prediction accuracy of sensitive words ensures the accuracy of the screening results of blocked sensitive words in the public sensitive thesaurus.
本申请实施例提供的一种封停敏感词预测方法,与目前人工筛选封停敏感词的方式相比,本申请能够获取待预测的公共敏感词;并将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;与此同时,根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,由此通过构建预设敏感词预测模型,并利用该预设敏感词预测模型对公共敏感词进行封停敏感词预测,实现了对公共敏感词库中封停敏感词的自动筛选,提高了封停敏感词的筛选效率,同时能够确保筛选结果的准确性,此外,通过构建不同类型的预设敏感词预测模型,能够进一步提升预测结果的准确度,确保筛选结果的可靠性,同时减轻了业务人员的工作负担,降低了人工成本。A method for predicting blocked sensitive words provided by the embodiment of the present application, compared with the current method of manually screening blocked sensitive words, the present application can obtain the public sensitive words to be predicted; Types of preset sensitive word prediction models perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models; at the same time, according to the different types of preset sensitive word prediction models output As a result of the prediction, it is determined whether the public sensitive word is a blocking sensitive word, so that by constructing a preset sensitive word prediction model, and using the preset sensitive word prediction model to predict the blocking sensitive word for the public sensitive word, the prediction of the public sensitive word is realized. The automatic screening of blocking sensitive words in the public sensitive word database improves the screening efficiency of blocking sensitive words and ensures the accuracy of the screening results. In addition, by constructing different types of preset sensitive word prediction models, the prediction can be further improved. The accuracy of the results ensures the reliability of the screening results, while reducing the workload of business personnel and labor costs.
进一步的,为了更好的说明上封停敏感词的预测过程,作为对上述实施例的细化和扩展,本申请实施例提供了另一种封停敏感词预测方法,如图2所示,所述方法包括:Further, in order to better illustrate the prediction process of the upper blocking sensitive words, as a refinement and expansion of the above-mentioned embodiment, the embodiment of the present application provides another blocking sensitive word prediction method, as shown in FIG. 2 , The method includes:
201、确定历史短信数据中黑短信样本和白短信样本,并利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本。201. Determine black short message samples and white short message samples in the historical short message data, and use a preset public sensitive thesaurus to screen black sensitive word samples and white sensitive word samples in the black short message samples and white short message samples respectively.
[根据细则91更正 19.08.2021] 
其中,历史短信数据为公司业务人员发送的短信数据,为了构建预设敏感词预设模型,将历史短信数据作为样本短信数据,黑短信样本为历史短信数据中被运营商封停的号码所发送的短信,白短信样本为历史短信数据中未被运营商封停的号码所发送的短信,即历史短信数据中除黑短信样本之外的剩余短信样本,黑敏感词样本为从黑短信样本中提取出的敏感词,白敏感词样本为从白短信样本中提取出的敏感词,针对黑短信样本、白短信样本、黑敏感词样本和白敏感词样本的获取过程,步骤201具体包括:获取历史封停信息;根据所述历史封停信息中的时间信息和号码信息,确定所述历史短信数据中所述号码信息在所述时间信息下发送的短信数据为黑短信样本,剩余短信数据为白短信样本;对所述黑短信样本和所述白短信样本进行分词处理,得到所述黑短信样本和所述白短信样本分别对应的各个分词;利用所述预设公共敏感词库中的各个公共敏感词,从所述黑短信样本和所述白短信样本分别对应的各个分词中筛选黑敏感词样本和白敏感词样本。其中,历史封停信息为运营商针对该业务公司进行封停的信息,该历史封停信息可以由业务公司向运营商索取,该历史封停信息主要包括被封停的时间信息和号码信息,如,手机号185××××××49在2020.6.29号被封停,预设公共敏感词库中的敏感词可以公开获取,经过去重后可以得到98955个敏感词,这98955个敏感词种有些是封停敏感词,有些不是封停敏感词,因此需要对其进行挑选。
[Correction 19.08.2021 in accordance with Rule 91]
Among them, the historical short message data is the short message data sent by the company's business personnel. In order to build a preset sensitive word preset model, the historical short message data is used as the sample short message data, and the black short message sample is sent by the number that was blocked by the operator in the historical short message data. The white short message sample is the short message sent by the number that has not been blocked by the operator in the historical short message data, that is, the remaining short message sample except the black short message sample in the historical short message data, and the black sensitive word sample is from the black short message sample. The extracted sensitive words, the white sensitive word samples are the sensitive words extracted from the white short message samples. For the acquisition process of the black short message samples, the white short message samples, the black sensitive word samples and the white sensitive word samples, step 201 specifically includes: acquiring Historical blocking information; according to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is white short message sample; perform word segmentation processing on the black short message sample and the white short message sample to obtain each word segment corresponding to the black short message sample and the white short message sample respectively; use each word in the preset public sensitive thesaurus For common sensitive words, the black-sensitive word samples and the white-sensitive word samples are selected from the word segments corresponding to the black short message samples and the white short message samples respectively. Among them, the historical blocking information is the information that the operator has blocked the business company. The historical blocking information can be obtained by the business company from the operator. The historical blocking information mainly includes the time information and number information that were blocked. For example, the mobile phone number 185××××××49 was blocked on June 29, 2020. The pre-set sensitive words in the public sensitive thesaurus can be publicly obtained. After deduplication, 98,955 sensitive words can be obtained. These 98,955 sensitive words Some of the word types are block-sensitive words, and some are not block-sensitive words, so they need to be selected.
[根据细则91更正 19.08.2021] 
具体地,首先业务公司会向运营商索取历史封停信息,由于历史封停信息中只有相关手机号在某一天被封停的信息,即封停信息只能针对到天,不会精确到时、分、秒,更不会精确到具体的短信,因此可以认为历史短信数据中被封停的号码在这一天中发送的所有 短信均为黑短信样本,如手机号185××××××49在2020.6.29号被封停,则确认历史短信数据中185××××××49在2020.6.29号发送的所有短信均为黑短信样本,同时将历史短信数据中剩余的短信数据确定为白短信样本,进一步地,在确定黑短信样本和白短信样本之后,分别对黑短信样本和短信样本进行分词处理,具体可以利用预设条件随机场分词模型对黑短信样本和白短信样本进行分词处理,得到每条黑短信样本对应的各个分词和每条白短信样本对应的各个分词,之后利用预设公共敏感词库中的敏感词,从黑短信样本和白短信样本分别对应的各个分词中筛选黑敏感词样本和白敏感词样本,由于本申请实施例进行敏感词筛选时使用的是公共敏感词库,因此白短信样本中也一定存在敏感词。
[Correction 19.08.2021 in accordance with Rule 91]
Specifically, first of all, the business company will ask the operator for the historical blocking information. Since the historical blocking information only contains the information that the relevant mobile phone number was blocked on a certain day, that is, the blocking information can only be targeted to the day, and will not be accurate. , minutes, seconds, and will not be accurate to specific text messages, so it can be considered that all text messages sent by the blocked numbers in the historical text message data during this day are samples of black text messages, such as mobile phone number 185×××××× 49 is blocked on June 29, 2020, then confirm that all the short messages sent by 185××××××49 in the historical short message data on June 29, 2020 are black short message samples, and at the same time confirm the remaining short message data in the historical short message data. It is a white short message sample. Further, after determining the black short message sample and the white short message sample, word segmentation processing is performed on the black short message sample and the short message sample respectively. Specifically, the black short message sample and the white short message sample can be processed by using the preset condition random field word segmentation model. Word segmentation processing, to obtain each word segment corresponding to each black short message sample and each word segment corresponding to each white short message sample, and then use the sensitive words in the preset public sensitive thesaurus to obtain each word segment corresponding to the black short message sample and the white short message sample respectively. The black-sensitive word samples and the white-sensitive word samples are screened in the middle, since the public sensitive word database is used for screening the sensitive words in the embodiment of the present application, so there must also be sensitive words in the white short message samples.
例如,白短信样本为“穆琼仙您好,目前您的贷款已经逾期超限,系统提示,明日10时,您的分期还款资格将会关闭,您需要一次性处理您的违约全款,与此同时,该金额会一并上传至您的银行征信,后续您的征信会显示逾期关注类,甚至次级类,届时,您与金融机构,尤其是银行的合作将会受限,请知悉”,经过公共敏感词库筛选后,确定白短信样本中存在的敏感词为{77:'银行',116:'银行',13:'贷款',22:'系统',119:'合作'},前面的数字代表键,表示敏感词在白短信样本中的起始位置,数字后面的值表示具体命中了公共敏感词库中的哪个敏感词,但是很显然“银行、“贷款”、“系统”这些不是运营商的封停敏感词。For example, the white text message sample is "Hello Mu Qiongxian, your loan is overdue and overdue at present, the system prompts that at 10:00 tomorrow, your instalment repayment qualification will be closed, and you need to deal with your default payment in one go. At the same time, the amount will be uploaded to your bank credit report, and your subsequent credit report will show overdue concern or even sub-class. At that time, your cooperation with financial institutions, especially banks, will be limited. Please know", after screening the public sensitive word database, it is determined that the sensitive words in the white short message sample are {77:'bank',116:'bank',13:'loan',22:'system',119:' Cooperation'}, the number in front represents the key, indicating the starting position of the sensitive word in the white short message sample, and the value after the number indicates which sensitive word in the public sensitive thesaurus was hit, but obviously "bank, "loan" , "system" These are not operators' blocking sensitive words.
[根据细则91更正 19.08.2021] 
进一步地,在获取黑短信样本和白短信样本中的所有敏感词之后,由于白短信样本中的敏感词一定不是运营商封停词库中的封停敏感词,而针对黑样本短信,在前期确定黑短信样本时是将被封停的号码在这一天中所发的所有短信均认为是黑短信样本,因此黑短信样本中的部分敏感词很可能不是封停敏感词,即不是黑敏感词,不应该包含在黑敏感词样本中,为了提高模型的训练精度,准确地确定黑敏感词样本,所述方法还包括:确定所述黑敏感词样本中与所述白敏感词样本相重合的敏感词样本;在所述黑敏感词样本中排除掉所述相重合的敏感词样本,得到所述黑敏感词样本中的剩余样本。
[Correction 19.08.2021 in accordance with Rule 91]
Further, after obtaining all the sensitive words in the black short message sample and the white short message sample, since the sensitive words in the white short message sample must not be blocked sensitive words in the operator's blocking word database, but for the black sample short message, in the early stage, When determining the black text message sample, all text messages sent by the blocked number in this day are considered to be black text message samples, so some sensitive words in the black text message sample are probably not blocked sensitive words, that is, not black sensitive words. , should not be included in the black-sensitive word samples. In order to improve the training accuracy of the model and accurately determine the black-sensitive word samples, the method further includes: determining the black-sensitive word samples that coincide with the white-sensitive word samples. Sensitive word samples; the coincident sensitive word samples are excluded from the black sensitive word samples, and the remaining samples in the black sensitive word samples are obtained.
例如,确定的白敏感词样本集合为A,黑敏感词样本集合为B,白敏感词样本集合A和黑敏感词样本集合B的交集为C,则确定黑敏感词样本中的剩余样本为B-C,即在黑敏感词样本集合中去掉白敏感词样本的部分。For example, the determined white-sensitive word sample set is A, the black-sensitive word sample set is B, and the intersection of white-sensitive word sample set A and black-sensitive word sample set B is C, then the remaining samples in the black-sensitive word sample are determined to be B-C , that is, the part of white-sensitive word samples is removed from the black-sensitive word sample set.
202、将所述黑敏感词样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。202. Use the black-sensitive word samples and the white-sensitive word samples as training sets, and construct different types of preset sensitive word prediction models according to the training sets.
对于本申请实施例,由于白敏感词样本中的敏感词肯定不是封停敏感词,因此在黑敏感词样本排除掉与白敏感词样本重叠的部分,基于此,步骤202具体包括:将所述剩余样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。For the embodiment of the present application, since the sensitive words in the white-sensitive word samples are definitely not blocked-sensitive words, the parts overlapping with the white-sensitive word samples are excluded from the black-sensitive word samples. Based on this, step 202 specifically includes: The remaining samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
具体地,将剩余样本中的敏感词标注为1,同时将白敏感词样本中的敏感词标注为0,将标注后的剩余样本和白敏感词样本作为样本训练集,分别构建预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型,以便根据不同类型的敏感词预测模型的输出结果,判定公共敏感词库中待预测的敏感词是否为封停敏感词,能够进一步提升封停敏感词的筛选精度。Specifically, the sensitive words in the remaining samples are marked as 1, and the sensitive words in the white-sensitive word samples are marked as 0, and the marked remaining samples and white-sensitive word samples are used as sample training sets to construct preset support vectors respectively. Machine sensitive word prediction model, preset gradient boosting tree sensitive word prediction model and preset proximity classification sensitive word prediction model, so as to determine whether the sensitive word to be predicted in the public sensitive word database is based on the output results of different types of sensitive word prediction models In order to block sensitive words, the screening accuracy of blocked sensitive words can be further improved.
203、获取待预测的公共敏感词。203. Obtain public sensitive words to be predicted.
其中,待预测的公共敏感词为公共敏感词库中的敏感词,如贷款、银行、系统、卖肾、卖血等,为了使业务公司自己的敏感词库与运营商的封停敏感词库更加接近,需要从公共 敏感词库的各个敏感词中挑选封停敏感词,即对公共敏感词库中的各个敏感词进行封停敏感词预测。Among them, the public sensitive words to be predicted are the sensitive words in the public sensitive word database, such as loan, bank, system, selling kidney, selling blood, etc. In order to make the business company's own sensitive word database and the operator's blocked sensitive word database To be closer, it is necessary to select the blocked sensitive words from each sensitive word in the public sensitive thesaurus, that is, to predict the blocked sensitive words for each sensitive word in the public sensitive thesaurus.
204、将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果。204. Input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain prediction results output by the different types of preset sensitive word prediction models.
其中,为了能够提升封停敏感词的筛选精度,可以将公共敏感词输入至不同类型的预设敏感词预测模型进行预测,进而能够得到不同类型的预设敏感词预测模型输出的预测结果,利用预设敏感词预测模型进行封停敏感词预测的具体过程与步骤102完全相同,在此不再赘述。Among them, in order to improve the screening accuracy of blocking sensitive words, common sensitive words can be input into different types of preset sensitive word prediction models for prediction, and then the prediction results output by different types of preset sensitive word prediction models can be obtained. The specific process of predicting the blocked sensitive word by the preset sensitive word prediction model is exactly the same as that of step 102, and details are not repeated here.
205、根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。205. According to the prediction results output by the different types of preset sensitive word prediction models, determine whether the public sensitive words are blocked sensitive words.
对于本申请实施例,为了判定待预测的公共敏感词是否为封停敏感词,步骤205具体包括:若所述不同类型的预设敏感词预测模型输出的预测结果均为所述公共敏感词为封停敏感词,则最终确定所述公共敏感词为封停敏感词;若所述不同类型的预设敏感词预测模型中存在任一类型的预设敏感词预测模型输出的预测结果为所述公共敏感词不是封停敏感词,则最终确定所述公共敏感词不是封停敏感词。其中,不同类型的预设敏感词预测模型可以为但不局限于预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型。For the embodiment of the present application, in order to determine whether the public sensitive word to be predicted is a blocked sensitive word, step 205 specifically includes: if the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are Block sensitive words, then finally determine that the public sensitive words are blocked sensitive words; if there is any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models, the prediction result output by the preset sensitive word prediction model is the above If the public sensitive word is not a blocked sensitive word, it is finally determined that the public sensitive word is not a blocked sensitive word. The different types of preset sensitive word prediction models may be, but are not limited to, preset support vector machine sensitive word prediction models, preset gradient boosting tree sensitive word prediction models, and preset proximity classification sensitive word prediction models.
进一步地,为了从公共敏感词库中挑选出更多的封停敏感词,可以综合考虑不同类型的敏感词预测模型的输出结果,基于此,步骤205还具体包括:Further, in order to select more blocked sensitive words from the public sensitive word database, the output results of different types of sensitive word prediction models can be comprehensively considered. Based on this, step 205 also specifically includes:
确定所述不同类型的预设敏感词预测模型对应的预测权重;根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。其中,所述不同类型的预设敏感词预测模型包括:预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型,所述确定所述不同类型的预设敏感词预测模型对应的预测权重,包括:分别设定所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型对应的预测权重;所述根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词,包括:根据所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。Determine the prediction weights corresponding to the different types of preset sensitive word prediction models; determine whether the public sensitive words are blocked according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights sensitive words. Wherein, the different types of preset sensitive word prediction models include: a preset support vector machine sensitive word prediction model, a preset gradient boosting tree sensitive word prediction model, and a preset proximity classification sensitive word prediction model. The prediction weight corresponding to the preset sensitive word prediction model of the type, including: respectively setting the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word The prediction weight corresponding to the prediction model; the determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights includes: according to the The prediction results and their corresponding prediction weights output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model, and determine whether the public sensitive words are To block sensitive words.
例如,分别设定预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型对应的预测权重为0.5、0.25、0.25,预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型输出的预测结果中该公共敏感词为封停敏感词的概率值为0.8、0.7、0.5,将不同模型的输出结果及其对应的权重值进行加权求和,结果为0.7,因此确定待预测的公共敏感词为封停敏感词。For example, the prediction weights corresponding to the preset SVM-sensitive word prediction model, the preset gradient boosting tree-sensitive word prediction model, and the preset proximity classification-sensitive word prediction model are set to 0.5, 0.25, and 0.25 respectively, and the preset SVM sensitive word prediction model is set to In the prediction results output by the word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model The output result and its corresponding weight value are weighted and summed, and the result is 0.7. Therefore, it is determined that the public sensitive word to be predicted is a blocked sensitive word.
本申请实施例提供的另一种封停敏感词预测方法,与目前人工筛选封停敏感词的方式相比,本申请能够获取待预测的公共敏感词;并将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;与此同时,根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,由此通过构建预设敏感词预测模型,并利用该预设敏感词预测模型对公共敏感词进行封停敏感词预测,实现了对公共敏感词库中封停敏感词的自动筛选,提高了封停敏感词的筛选效率,同时能够确保筛选结果的准确性,此外,通过构建不同类型的预设敏感词预测模型,能够进一步提升预测结果的准确度,确保筛选结果的可靠性,同时减轻了业务人员的工作负担,降低了人工成本。Another method for predicting blocked sensitive words provided by the embodiment of the present application, compared with the current method of manually screening blocked sensitive words, the present application can obtain the public sensitive words to be predicted; and input the public sensitive words into Different types of preset sensitive word prediction models perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models; at the same time, according to the different types of preset sensitive word prediction models output to determine whether the public sensitive word is a blocking sensitive word, and thus by constructing a preset sensitive word prediction model, and using the preset sensitive word prediction model to predict the blocking sensitive word for the public sensitive word, it is possible to achieve The automatic screening of blocked sensitive words in the public sensitive lexicon improves the screening efficiency of blocked sensitive words, while ensuring the accuracy of the screening results. In addition, by building different types of preset sensitive word prediction models, it can be further improved The accuracy of the prediction results ensures the reliability of the screening results, while reducing the workload of business personnel and labor costs.
进一步地,作为图1的具体实现,本申请实施例提供了一种封停敏感词预测装置,如 图3所示,所述装置包括:获取单元31、预测单元32和判定单元33。Further, as a specific implementation of FIG. 1 , an embodiment of the present application provides a block-sensitive word prediction device. As shown in FIG. 3 , the device includes: an acquisition unit 31 , a prediction unit 32 and a determination unit 33 .
所述获取单元31,可以用于获取待预测的公共敏感词。所述获取单元31是本装置中获取待预测的公共敏感词的主要功能模块。The obtaining unit 31 can be used to obtain the public sensitive words to be predicted. The obtaining unit 31 is the main functional module in the device for obtaining the public sensitive words to be predicted.
所述预测单元32,可以用于将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果。所述预测单元32是本装置中将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果的主要功能模块,也是核心模块。The predicting unit 32 can be used to input the common sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain the prediction results output by the different types of preset sensitive word prediction models. . The prediction unit 32 is to input the common sensitive words into different types of preset sensitive word prediction models in the device to perform block sensitive word prediction, and obtain the prediction results output by the different types of preset sensitive word prediction models. The main function module is also the core module.
所述判定单元33,可以用于根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。所述判定单元33是本装置中根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词的主要功能模块,也是核心模块。The determining unit 33 may be configured to determine whether the common sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models. The determining unit 33 is the main function module for determining whether the public sensitive words are blocked sensitive words according to the prediction results output by the different types of preset sensitive word prediction models, and is also the core module.
进一步地,为了判定所述公共敏感词是否为封停敏感词,所述判定单元33,具体可以用于若所述不同类型的预设敏感词预测模型输出的预测结果均为所述公共敏感词为封停敏感词,则最终确定所述公共敏感词为封停敏感词;Further, in order to determine whether the public sensitive words are blocked sensitive words, the determining unit 33 can be specifically configured to be used if the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words. If it is a blocked sensitive word, the public sensitive word is finally determined to be a blocked sensitive word;
若所述不同类型的预设敏感词预测模型中存在任一类型的预设敏感词预测模型输出的预测结果为所述公共敏感词不是封停敏感词,则最终确定所述公共敏感词不是封停敏感词。If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.
进一步地,为了判定所述公共敏感词是否为封停敏感词,如图4所示,所述判定单元33,包括:确定模块331和判定模块332。Further, in order to determine whether the public sensitive word is a blocked sensitive word, as shown in FIG. 4 , the determining unit 33 includes: a determining module 331 and a determining module 332 .
所述确定模块331,可以用于确定所述不同类型的预设敏感词预测模型对应的预测权重。The determining module 331 may be configured to determine prediction weights corresponding to the different types of preset sensitive word prediction models.
所述判定模块332,可以用于根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。The determining module 332 may be configured to determine whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights.
在具体应用场景中,所述不同类型的预设敏感词预测模型包括:预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型,所述确定模块331,具体可以用于分别设定所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型对应的预测权重。In a specific application scenario, the different types of preset sensitive word prediction models include: a preset support vector machine sensitive word prediction model, a preset gradient boosting tree sensitive word prediction model, and a preset proximity classification sensitive word prediction model. The determining module 331 may be specifically configured to set the prediction weights corresponding to the preset SVM-sensitive word prediction model, the preset gradient boosting tree-sensitive word prediction model, and the preset proximity classification-sensitive word prediction model, respectively.
所述判定模块332,具体可以用于根据所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。The determining module 332 can be specifically configured to output the prediction results and Its corresponding prediction weight determines whether the public sensitive word is a blocking sensitive word.
进一步地,为了构建不同类型的预设敏感词预测模型,所述装置还包括:确定单元34、筛选单元35和构建单元36。Further, in order to construct different types of preset sensitive word prediction models, the apparatus further includes: a determination unit 34 , a screening unit 35 and a construction unit 36 .
所述确定单元34,可以用于确定历史短信数据中黑短信样本和白短信样本。The determining unit 34 may be configured to determine black short message samples and white short message samples in the historical short message data.
所述筛选单元35,可以用于利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本。The screening unit 35 may be configured to screen black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database.
所述构建单元36,可以用于将所述黑敏感词样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。The construction unit 36 may be configured to use the black-sensitive word samples and the white-sensitive word samples as training sets, and build different types of preset sensitive word prediction models according to the training sets.
进一步地,为了确定历史短信数据中黑短信样本和白短信样本,所述确定单元34,包括:获取模块341和确定模块342。Further, in order to determine the black short message samples and the white short message samples in the historical short message data, the determining unit 34 includes: an acquisition module 341 and a determination module 342 .
所述获取模块341,可以用于获取历史封停信息。The obtaining module 341 may be used to obtain historical blocking information.
所述确定模块342,可以用于根据所述历史封停信息中的时间信息和号码信息,确定所述历史短信数据中所述号码信息在所述时间信息下发送的短信数据为黑短信样本,剩余短信数据为白短信样本。The determining module 342 may be configured to determine, according to the time information and number information in the historical blocking information, that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, The remaining SMS data are white SMS samples.
进一步地,为了确定黑敏感词样本和白敏感词样本,所述筛选单元35,包括:分词模 块351和筛选模块352。Further, in order to determine black-sensitive word samples and white-sensitive word samples, the screening unit 35 includes: a word segmentation module 351 and a screening module 352.
所述分词模块351,可以用于对所述黑短信样本和所述白短信样本进行分词处理,得到所述黑短信样本和所述白短信样本分别对应的各个分词。The word segmentation module 351 may be configured to perform word segmentation processing on the black short message sample and the white short message sample, and obtain each word segmentation corresponding to the black short message sample and the white short message sample respectively.
所述筛选模块352,可以用于利用所述预设公共敏感词库中的各个公共敏感词,从所述黑短信样本和所述白短信样本分别对应的各个分词中筛选黑敏感词样本和白敏感词样本。The screening module 352 can be configured to use each public sensitive word in the preset public sensitive thesaurus to filter black sensitive word samples and white short message samples from each word segment corresponding to the black short message sample and the white short message sample respectively. Sensitive word samples.
进一步地,为了在所述黑敏感词样本中排除掉所述相重合的敏感词样本,所述装置还包括,排除单元37。Further, in order to exclude the coincident sensitive word samples from the black sensitive word samples, the apparatus further includes an exclusion unit 37 .
所述确定单元34,还可以用于确定所述黑敏感词样本中与所述白敏感词样本相重合的敏感词样本。The determining unit 34 may also be configured to determine the sensitive word samples in the black sensitive word samples that coincide with the white sensitive word samples.
所述排除单元37,可以用于在所述黑敏感词样本中排除掉所述相重合的敏感词样本,得到所述黑敏感词样本中的剩余样本。The exclusion unit 37 may be configured to exclude the coincident sensitive word samples from the black sensitive word samples to obtain the remaining samples in the black sensitive word samples.
所述构建单元36,具体可以用于将所述剩余样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。The construction unit 36 may be specifically configured to use the remaining samples and the white-sensitive word samples as a training set, and build different types of preset sensitive word prediction models according to the training set.
需要说明的是,本申请实施例提供的一种封停敏感词预测装置所涉及各功能模块的其他相应描述,可以参考图1所示方法的对应描述,在此不再赘述。It should be noted that, for other corresponding descriptions of the functional modules involved in the block-sensitive word prediction device provided in the embodiments of the present application, reference may be made to the corresponding descriptions of the method shown in FIG. 1 , and details are not repeated here.
基于上述如图1所示方法,相应的,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:获取待预测的公共敏感词;将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。Based on the above method as shown in FIG. 1 , correspondingly, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining the public data to be predicted. Sensitive words; input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain the prediction results output by the different types of preset sensitive word prediction models; According to the prediction result output by the preset sensitive word prediction model, it is determined whether the public sensitive word is a blocked sensitive word.
可选的,该程序被处理器执行时还可实现上述实施例中方法的其他步骤,这里不再赘述。进一步可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, when the program is executed by the processor, other steps of the method in the foregoing embodiment may also be implemented, which will not be repeated here. Further optionally, the storage medium involved in the present application, such as a computer-readable storage medium, may be non-volatile or volatile.
基于上述如图1所示方法和如图3所示装置的实施例,本申请实施例还提供了一种计算机设备的实体结构图,如图5所示,该计算机设备包括:处理器41、存储器42、及存储在存储器42上并可在处理器上运行的计算机程序,其中存储器42和处理器41均设置在总线43上所述处理器41执行所述程序时实现以下步骤:获取待预测的公共敏感词;将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。Based on the foregoing embodiment of the method shown in FIG. 1 and the apparatus shown in FIG. 3 , an embodiment of the present application further provides a physical structure diagram of a computer device. As shown in FIG. 5 , the computer device includes: a processor 41 , Memory 42, and a computer program stored on the memory 42 and running on the processor, wherein both the memory 42 and the processor 41 are arranged on the bus 43 and the processor 41 implements the following steps when executing the program: obtaining the to-be-predicted the public sensitive words; input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain the prediction results output by the different types of preset sensitive word prediction models; according to the According to the prediction results output by different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
通过本申请的技术方案,本申请能获取待预测的公共敏感词;并将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;与此同时,根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,由此通过构建预设敏感词预测模型,并利用该预设敏感词预测模型对公共敏感词进行封停敏感词预测,实现了对公共敏感词库中封停敏感词的自动筛选,提高了封停敏感词的筛选效率,同时能够确保筛选结果的准确性,此外,通过构建不同类型的预设敏感词预测模型,能够进一步提升预测结果的准确度,确保筛选结果的可靠性,同时减轻了业务人员的工作负担,降低了人工成本。Through the technical solution of the present application, the present application can obtain the public sensitive words to be predicted; respectively input the public sensitive words into different types of preset sensitive word prediction models to predict the blocked sensitive words, and obtain the different types of sensitive words. The prediction result output by the preset sensitive word prediction model; at the same time, according to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive word is a block sensitive word, thereby constructing a pre Set up a sensitive word prediction model, and use the preset sensitive word prediction model to predict public sensitive words to block sensitive words, realize automatic screening of blocked sensitive words in the public sensitive lexicon, and improve the screening of blocked sensitive words In addition, by building different types of preset sensitive word prediction models, the accuracy of the prediction results can be further improved, the reliability of the screening results can be ensured, and the workload of business personnel can be reduced. Reduced labor costs.
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed in a network composed of multiple computing devices Alternatively, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device and executed by the computing device, and in some cases, in a different order than here The steps shown or described are performed either by fabricating them separately into individual integrated circuit modules, or by fabricating multiple modules or steps of them into a single integrated circuit module. As such, the present application is not limited to any particular combination of hardware and software.
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims (20)

  1. 一种封停敏感词预测方法,包括:A block-sensitive word prediction method, comprising:
    获取待预测的公共敏感词;Obtain the public sensitive words to be predicted;
    将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;
    根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
  2. 根据权利要求1所述的方法,其中,所述根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,包括:The method according to claim 1, wherein the determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models comprises:
    若所述不同类型的预设敏感词预测模型输出的预测结果均为所述公共敏感词为封停敏感词,则最终确定所述公共敏感词为封停敏感词;If the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are blocked sensitive words, then finally determine that the public sensitive words are blocked sensitive words;
    若所述不同类型的预设敏感词预测模型中存在任一类型的预设敏感词预测模型输出的预测结果为所述公共敏感词不是封停敏感词,则最终确定所述公共敏感词不是封停敏感词。If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.
  3. 根据权利要求1所述的方法,其中,所述根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,包括:The method according to claim 1, wherein the determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models comprises:
    确定所述不同类型的预设敏感词预测模型对应的预测权重;determining the prediction weights corresponding to the different types of preset sensitive word prediction models;
    根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights, it is determined whether the public sensitive words are blocked sensitive words.
  4. 根据权利要求3所述的方法,其中,所述不同类型的预设敏感词预测模型包括:预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型,所述确定所述不同类型的预设敏感词预测模型对应的预测权重,包括:The method according to claim 3, wherein the different types of preset sensitive word prediction models include: a preset support vector machine sensitive word prediction model, a preset gradient boosting tree sensitive word prediction model, and a preset proximity classification sensitive word A prediction model, the determining the prediction weights corresponding to the different types of preset sensitive word prediction models, including:
    分别设定所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型对应的预测权重;respectively setting the prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model;
    所述根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词,包括:Determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights includes:
    根据所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model and their corresponding prediction weights, determine the public Whether the sensitive word is a blocking sensitive word.
  5. 根据权利要求1所述的方法,其中,在所述将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果之前,所述方法还包括:The method according to claim 1, wherein in the said common sensitive words are respectively input into different types of preset sensitive word prediction models to perform block sensitive word prediction, and said different types of preset sensitive word predictions are obtained Before the prediction result output by the model, the method further includes:
    确定历史短信数据中黑短信样本和白短信样本;Determine the black SMS samples and white SMS samples in the historical SMS data;
    利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本;Screen the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database;
    将所述黑敏感词样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。The black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
  6. 根据权利要求5所述的方法,其中,所述确定历史短信数据中黑短信样本和白短信样本,包括:The method according to claim 5, wherein the determining of the black short message samples and the white short message samples in the historical short message data comprises:
    获取历史封停信息;Obtain historical suspension information;
    根据所述历史封停信息中的时间信息和号码信息,确定所述历史短信数据中所述号码信息在所述时间信息下发送的短信数据为黑短信样本,剩余短信数据为白短信样本;According to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is a white short message sample;
    所述利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本,包括:Said screening the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database, including:
    对所述黑短信样本和所述白短信样本进行分词处理,得到所述黑短信样本和所述白短 信样本分别对应的各个分词;Perform word segmentation processing on the black short message sample and the white short message sample, and obtain each word segmentation corresponding to the black short message sample and the white short message sample respectively;
    利用所述预设公共敏感词库中的各个公共敏感词,从所述黑短信样本和所述白短信样本分别对应的各个分词中筛选黑敏感词样本和白敏感词样本。Using each public sensitive word in the preset public sensitive word database, the black sensitive word sample and the white sensitive word sample are screened from each word segment corresponding to the black short message sample and the white short message sample respectively.
  7. 根据权利要求5所述的方法,其中,在所述利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本之后,所述方法还包括:The method according to claim 5, wherein after screening the black-sensitive word samples and the white-sensitive word samples in the black short message samples and the white short message samples respectively by using a preset public sensitive word database, the method further comprises: :
    确定所述黑敏感词样本中与所述白敏感词样本相重合的敏感词样本;Determine the sensitive word samples that coincide with the white sensitive word samples in the black sensitive word samples;
    在所述黑敏感词样本中排除掉所述相重合的敏感词样本,得到所述黑敏感词样本中的剩余样本;Excluding the coincident sensitive word samples from the black sensitive word samples to obtain the remaining samples in the black sensitive word samples;
    所述将所述黑敏感词样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型,包括:The described black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets, including:
    将所述剩余样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。The remaining samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
  8. 一种封停敏感词预测装置,包括:A blocking-sensitive word prediction device, comprising:
    获取单元,用于获取待预测的公共敏感词;The acquisition unit is used to acquire the public sensitive words to be predicted;
    预测单元,用于将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;A prediction unit, configured to respectively input the public sensitive words into different types of preset sensitive word prediction models to perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models;
    判定单元,用于根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。A determination unit, configured to determine whether the common sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models.
  9. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现以下方法:A computer-readable storage medium on which a computer program is stored, wherein, when the computer program is executed by a processor, the following methods are implemented:
    获取待预测的公共敏感词;Obtain the public sensitive words to be predicted;
    将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;The public sensitive words are respectively input into different types of preset sensitive word prediction models to predict the blocked sensitive words, and the prediction results output by the different types of preset sensitive word prediction models are obtained;
    根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
  10. 根据权利要求9所述的计算机可读存储介质,其中,执行所述根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,包括:The computer-readable storage medium according to claim 9, wherein, executing the prediction results output by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:
    若所述不同类型的预设敏感词预测模型输出的预测结果均为所述公共敏感词为封停敏感词,则最终确定所述公共敏感词为封停敏感词;If the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are blocked sensitive words, then finally determine that the public sensitive words are blocked sensitive words;
    若所述不同类型的预设敏感词预测模型中存在任一类型的预设敏感词预测模型输出的预测结果为所述公共敏感词不是封停敏感词,则最终确定所述公共敏感词不是封停敏感词。If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.
  11. 根据权利要求9所述的计算机可读存储介质,其中,执行所述根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,包括:The computer-readable storage medium according to claim 9, wherein, executing the prediction results output by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:
    确定所述不同类型的预设敏感词预测模型对应的预测权重;determining the prediction weights corresponding to the different types of preset sensitive word prediction models;
    根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights, it is determined whether the public sensitive words are blocked sensitive words.
  12. 根据权利要求11所述的计算机可读存储介质,其中,所述不同类型的预设敏感词预测模型包括:预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型,执行所述确定所述不同类型的预设敏感词预测模型对应的预测权重,包括:The computer-readable storage medium according to claim 11, wherein the different types of preset sensitive word prediction models include: preset support vector machine sensitive word prediction models, preset gradient boosted tree sensitive word prediction models, and preset Proximity classification sensitive word prediction models, performing the described determining of the prediction weights corresponding to the different types of preset sensitive word prediction models, including:
    分别设定所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型对应的预测权重;respectively setting the prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model;
    执行所述根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词,包括:Execute the prediction results and their corresponding prediction weights outputted according to the different types of preset sensitive word prediction models, and determine whether the public sensitive words are blocked sensitive words, including:
    根据所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model and their corresponding prediction weights, determine the public Whether the sensitive word is a blocking sensitive word.
  13. 根据权利要求9所述的计算机可读存储介质,其中,在所述将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果之前,所述计算机程序被处理器执行时还用于实现:The computer-readable storage medium according to claim 9, wherein, after the common sensitive words are respectively input into different types of preset sensitive word prediction models to perform block sensitive word prediction, the different types of pre-set sensitive words are obtained. Before setting the prediction result output by the sensitive word prediction model, the computer program is further used to realize when executed by the processor:
    确定历史短信数据中黑短信样本和白短信样本;Determine the black SMS samples and white SMS samples in the historical SMS data;
    利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本;Screen the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database;
    将所述黑敏感词样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。The black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
  14. 根据权利要求13所述的计算机可读存储介质,其中,执行所述确定历史短信数据中黑短信样本和白短信样本,包括:The computer-readable storage medium according to claim 13, wherein performing the determining of a black short message sample and a white short message sample in the historical short message data comprises:
    获取历史封停信息;Obtain historical suspension information;
    根据所述历史封停信息中的时间信息和号码信息,确定所述历史短信数据中所述号码信息在所述时间信息下发送的短信数据为黑短信样本,剩余短信数据为白短信样本;According to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is a white short message sample;
    执行所述利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本,包括:Perform the screening of black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database, including:
    对所述黑短信样本和所述白短信样本进行分词处理,得到所述黑短信样本和所述白短信样本分别对应的各个分词;Perform word segmentation processing on the black short message sample and the white short message sample to obtain word segmentations corresponding to the black short message sample and the white short message sample respectively;
    利用所述预设公共敏感词库中的各个公共敏感词,从所述黑短信样本和所述白短信样本分别对应的各个分词中筛选黑敏感词样本和白敏感词样本。Using each public sensitive word in the preset public sensitive word database, the black sensitive word sample and the white sensitive word sample are screened from each word segment corresponding to the black short message sample and the white short message sample respectively.
  15. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述计算机程序被处理器执行时实现以下方法:A computer device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the computer program is executed by the processor to implement the following methods:
    获取待预测的公共敏感词;Obtain the public sensitive words to be predicted;
    将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果;Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;
    根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
  16. 根据权利要求15所述的计算机设备,其中,执行所述根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,包括:The computer device according to claim 15, wherein, executing the prediction results outputted by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:
    若所述不同类型的预设敏感词预测模型输出的预测结果均为所述公共敏感词为封停敏感词,则最终确定所述公共敏感词为封停敏感词;If the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are blocked sensitive words, then finally determine that the public sensitive words are blocked sensitive words;
    若所述不同类型的预设敏感词预测模型中存在任一类型的预设敏感词预测模型输出的预测结果为所述公共敏感词不是封停敏感词,则最终确定所述公共敏感词不是封停敏感词。If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.
  17. 根据权利要求15所述的计算机设备,其中,执行所述根据所述不同类型的预设敏感词预测模型输出的预测结果,判定所述公共敏感词是否为封停敏感词,包括:The computer device according to claim 15, wherein, executing the prediction results outputted by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:
    确定所述不同类型的预设敏感词预测模型对应的预测权重;determining the prediction weights corresponding to the different types of preset sensitive word prediction models;
    根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights, it is determined whether the public sensitive words are blocked sensitive words.
  18. 根据权利要求17所述的计算机设备,其中,所述不同类型的预设敏感词预测模型包括:预设支持向量机敏感词预测模型、预设梯度提升树敏感词预测模型和预设邻近分类敏感词预测模型,执行所述确定所述不同类型的预设敏感词预测模型对应的预测权重,包括:The computer device according to claim 17, wherein the different types of preset sensitive word prediction models include: preset support vector machine sensitive word prediction model, preset gradient boosted tree sensitive word prediction model and preset proximity classification sensitive word prediction model A word prediction model, performing the described determining of the prediction weights corresponding to the different types of preset sensitive word prediction models, including:
    分别设定所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型对应的预测权重;respectively setting the prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model;
    执行所述根据所述不同类型的预设敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词,包括:Execute the prediction results and their corresponding prediction weights outputted according to the different types of preset sensitive word prediction models, and determine whether the public sensitive words are blocked sensitive words, including:
    根据所述预设支持向量机敏感词预测模型、所述预设梯度提升树敏感词预测模型和所述预设邻近分类敏感词预测模型输出的预测结果及其对应的预测权重,判定所述公共敏感词是否为封停敏感词。According to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model and their corresponding prediction weights, determine the public Whether the sensitive word is a blocking sensitive word.
  19. 根据权利要求15所述的计算机设备,其中,在所述将所述公共敏感词分别输入至不同类型的预设敏感词预测模型进行封停敏感词预测,得到所述不同类型的预设敏感词预测模型输出的预测结果之前,所述计算机程序被处理器执行时还用于实现:The computer device according to claim 15, wherein, after the common sensitive words are respectively input into different types of preset sensitive word prediction models to perform block sensitive word prediction, the different types of preset sensitive words are obtained Before the prediction result output by the prediction model, when the computer program is executed by the processor, the computer program is further used to realize:
    确定历史短信数据中黑短信样本和白短信样本;Determine the black SMS samples and white SMS samples in the historical SMS data;
    利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本;Screen the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database;
    将所述黑敏感词样本和所述白敏感词样本作为训练集,并根据所述训练集构建不同类型的预设敏感词预测模型。The black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
  20. 根据权利要求19所述的计算机设备,其中,执行所述确定历史短信数据中黑短信样本和白短信样本,包括:The computer device according to claim 19, wherein the determining of the black short message samples and the white short message samples in the historical short message data comprises:
    获取历史封停信息;Obtain historical suspension information;
    根据所述历史封停信息中的时间信息和号码信息,确定所述历史短信数据中所述号码信息在所述时间信息下发送的短信数据为黑短信样本,剩余短信数据为白短信样本;According to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is a white short message sample;
    执行所述利用预设公共敏感词库分别筛选所述黑短信样本和白短信样本中的黑敏感词样本和白敏感词样本,包括:Perform the screening of black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database, including:
    对所述黑短信样本和所述白短信样本进行分词处理,得到所述黑短信样本和所述白短信样本分别对应的各个分词;Perform word segmentation processing on the black short message sample and the white short message sample to obtain word segmentations corresponding to the black short message sample and the white short message sample respectively;
    利用所述预设公共敏感词库中的各个公共敏感词,从所述黑短信样本和所述白短信样本分别对应的各个分词中筛选黑敏感词样本和白敏感词样本。Using each public sensitive word in the preset public sensitive word database, the black sensitive word sample and the white sensitive word sample are screened from each word segment corresponding to the black short message sample and the white short message sample respectively.
PCT/CN2021/083489 2020-12-10 2021-03-29 Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium WO2022121164A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011434908.5A CN112528636A (en) 2020-12-10 2020-12-10 Method and device for predicting stop sensitive words, computer equipment and storage medium
CN202011434908.5 2020-12-10

Publications (1)

Publication Number Publication Date
WO2022121164A1 true WO2022121164A1 (en) 2022-06-16

Family

ID=74998986

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083489 WO2022121164A1 (en) 2020-12-10 2021-03-29 Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112528636A (en)
WO (1) WO2022121164A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089910A (en) * 2023-02-16 2023-05-09 北京计算机技术及应用研究所 Method for detecting security level of electronic document supporting multiple formats

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528636A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Method and device for predicting stop sensitive words, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150007336A1 (en) * 2013-06-27 2015-01-01 Huawei Technologies Co., Ltd. Information processing method, apparatus, and system
CN105404670A (en) * 2015-11-16 2016-03-16 北京奇虎科技有限公司 Harassing text message determining method and apparatus
CN108717408A (en) * 2018-05-11 2018-10-30 杭州排列科技有限公司 A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN110991171A (en) * 2019-09-30 2020-04-10 奇安信科技集团股份有限公司 Sensitive word detection method and device
CN111859093A (en) * 2020-07-30 2020-10-30 中国联合网络通信集团有限公司 Sensitive word processing method and device and readable storage medium
CN112528636A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Method and device for predicting stop sensitive words, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150007336A1 (en) * 2013-06-27 2015-01-01 Huawei Technologies Co., Ltd. Information processing method, apparatus, and system
CN105404670A (en) * 2015-11-16 2016-03-16 北京奇虎科技有限公司 Harassing text message determining method and apparatus
CN108717408A (en) * 2018-05-11 2018-10-30 杭州排列科技有限公司 A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN110991171A (en) * 2019-09-30 2020-04-10 奇安信科技集团股份有限公司 Sensitive word detection method and device
CN111859093A (en) * 2020-07-30 2020-10-30 中国联合网络通信集团有限公司 Sensitive word processing method and device and readable storage medium
CN112528636A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Method and device for predicting stop sensitive words, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089910A (en) * 2023-02-16 2023-05-09 北京计算机技术及应用研究所 Method for detecting security level of electronic document supporting multiple formats
CN116089910B (en) * 2023-02-16 2023-10-20 北京计算机技术及应用研究所 Method for detecting security level of electronic document supporting multiple formats

Also Published As

Publication number Publication date
CN112528636A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
US10943186B2 (en) Machine learning model training method and device, and electronic device
CN110674880B (en) Network training method, device, medium and electronic equipment for knowledge distillation
CN110163234B (en) Model training method and device and storage medium
US10484532B1 (en) System and method detecting fraud using machine-learning and recorded voice clips
CN110472675B (en) Image classification method, image classification device, storage medium and electronic equipment
US11068656B2 (en) Displaying text classification anomalies predicted by a text classification model
WO2022121164A1 (en) Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium
CN110796542A (en) Financial risk control method, financial risk control device and electronic equipment
US11769038B2 (en) Contextually optimizing routings for interactions
US20200394448A1 (en) Methods for more effectively moderating one or more images and devices thereof
CN112508580A (en) Model construction method and device based on rejection inference method and electronic equipment
CN112966865B (en) Number-carrying network-switching prediction method, device and equipment
CN112561685B (en) Customer classification method and device
CN110866832A (en) Risk control method, system, storage medium and computing device
US20230237583A1 (en) System and method for implementing a trust discretionary distribution tool
CN112884569A (en) Credit assessment model training method, device and equipment
CN117114514A (en) Talent information analysis management method, system and device based on big data
CN109101574B (en) Task approval method and system of data leakage prevention system
CN113806501B (en) Training method of intention recognition model, intention recognition method and equipment
GB2600817A (en) Systems and methods for generating dynamic interface options using machine learning models
CN113298121A (en) Message sending method and device based on multi-data source modeling and electronic equipment
CN115795345A (en) Information processing method, device, equipment and storage medium
US20220358505A1 (en) Artificial intelligence (ai)-based detection of fraudulent fund transfers
US20230377004A1 (en) Systems and methods for request validation
US11971900B2 (en) Rule-based data transformation using edge computing architecture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901880

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901880

Country of ref document: EP

Kind code of ref document: A1