WO2022121164A1

WO2022121164A1 - Suspension-causing sensitive word prediction method and apparatus, and computer device and storage medium

Info

Publication number: WO2022121164A1
Application number: PCT/CN2021/083489
Authority: WO
Inventors: 程华东; 侯翠琴; 李剑锋
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-10
Filing date: 2021-03-29
Publication date: 2022-06-16
Also published as: CN112528636A

Abstract

The present application discloses a suspension-causing sensitive word prediction method and apparatus, and a computer device and a storage medium, and mainly aims to improve the screening efficiency and accuracy of suspension-causing sensitive words and reduce the workload of service personnel. The method comprises: obtaining public sensitive words to be predicted; respectively inputting said public sensitive words into different types of preset sensitive word prediction models to perform suspension-causing sensitive word prediction to obtain prediction results output by the different types of preset sensitive word prediction models; and determining, according to the prediction results output by the different types of preset sensitive word prediction models, whether said public sensitive words are suspension-causing sensitive words. The present application is mainly suitable for prediction of the suspension-causing sensitive words.

Description

Block sensitive word prediction method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed on December 10, 2020 with the application number 202011434908.5 and the title of the invention is "Block sensitive word prediction method, device, computer equipment and storage medium", the entire content of which is Incorporated herein by reference.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for predicting blocked sensitive words.

Background technique

Operators usually have their own blocking sensitive word list. When the text message sent by the user contains blocking sensitive words, the number will be blocked. The user needs to go to the business hall to handle the unblocking service before the number can continue to be used. It is very inconvenient for users, so related business companies need to maintain their own sensitive vocabulary, make their sensitive vocabulary as close as possible to the operator's blocked sensitive vocabulary, and use the sensitive vocabulary to carry out text messages sent by the company. Early warning, so as not to cause the company's internal number to be blocked.

The inventor realizes that at present, when a business company maintains its own sensitive word list, business personnel usually screen and block sensitive words from an open sensitive word database according to historically blocked short message data. However, this method of artificially screening blocked-sensitive words is greatly affected by human subjective factors, and it is likely to miss blocked-sensitive words or select wrongly, resulting in low screening efficiency and accuracy of blocked-sensitive words. Increased workload of business personnel.

SUMMARY OF THE INVENTION

The present application provides a method, device, computer equipment and storage medium for predicting blocked sensitive words, which can improve the screening efficiency and accuracy of blocked sensitive words and reduce the workload of business personnel.

According to a first aspect of the present application, a method for predicting blocking sensitive words is provided, including:

Obtain the public sensitive words to be predicted;

Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;

According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.

According to a second aspect of the present application, a blocking sensitive word prediction device is provided, comprising:

The acquisition unit is used to acquire the public sensitive words to be predicted;

A prediction unit, configured to respectively input the public sensitive words into different types of preset sensitive word prediction models to perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models;

A determination unit, configured to determine whether the common sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models.

According to a third aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:

Obtain the public sensitive words to be predicted;

According to a fourth aspect of the present application, a computer device is provided, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the program:

Obtain the public sensitive words to be predicted;

The present application realizes the automatic screening of blocked sensitive words in the public sensitive lexicon, improves the screening efficiency of blocked sensitive words, and at the same time ensures the accuracy of the screening results. In addition, by constructing different types of preset sensitive word prediction models , which can further improve the accuracy of prediction results, ensure the reliability of screening results, and reduce the workload of business personnel and labor costs.

Description of drawings

The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

FIG. 1 shows a flowchart of a method for predicting a blocked sensitive word provided by an embodiment of the present application;

FIG. 2 shows a flowchart of another method for predicting blocked sensitive words provided by an embodiment of the present application;

FIG. 3 shows a schematic structural diagram of a blocking sensitive word prediction device provided by an embodiment of the present application;

FIG. 4 shows a schematic structural diagram of another block-sensitive word prediction device provided by an embodiment of the present application;

FIG. 5 shows a schematic diagram of an entity structure of a computer device provided by an embodiment of the present application.

Detailed ways

Hereinafter, the present application will be described in detail with reference to the accompanying drawings and in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.

The technical solution of the present application relates to the field of artificial intelligence technology, such as machine learning technology, so as to realize the prediction of blocking sensitive words.

At present, when a business company maintains its own sensitive word list, business personnel usually screen and block sensitive words from an open sensitive word database based on historically blocked short message data. However, this method of artificially screening blocked-sensitive words is greatly affected by human subjective factors, and it is likely to miss blocked-sensitive words or select wrongly, resulting in low screening efficiency and accuracy of blocked-sensitive words. Increased workload of business personnel.

In order to solve the above problem, an embodiment of the present application provides a method for predicting blocked sensitive words, as shown in FIG. 1 , the method includes:

101. Obtain public sensitive words to be predicted.

Among them, the public sensitive words to be predicted are sensitive words in the public sensitive word database, such as loan, bank, system, selling kidney, selling blood, etc. The public sensitive word database records a large number of public sensitive words, and the vocabulary volume can reach dozens of However, if the business company directly uses the public sensitive thesaurus for short message warning, a large number of short messages will be intercepted and cannot be sent. Sensitive lexicons with the same or similar lexicons, in order to overcome the defect of manually selecting and blocking sensitive words in the prior art, the embodiment of the present application constructs a preset sensitive word prediction model, and uses the preset sensitive word prediction model Each sensitive word in the sensitive thesaurus is predicted separately, so as to achieve the purpose of automatically selecting and blocking the sensitive words in the public sensitive thesaurus. Set on the client or server side.

102. Input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain prediction results output by the different types of preset sensitive word prediction models.

Among them, different types of preset sensitive word prediction models include preset support vector machine sensitive word prediction models, preset gradient boosting tree sensitive word prediction models, and preset proximity classification sensitive word prediction models. It should be noted that the embodiments of the present application The different types of preset sensitive word prediction models are not limited to the above-mentioned ones. The specific architecture of the preset SVM sensitive word prediction model is as follows:

y=g _w,b (w ^T x+b)

Among them, (x, y) is a training sample, and the white-sensitive word sample and black-sensitive word sample are used as training samples to train the initial support vector machine model, and a preset support vector machine sensitive word prediction model is constructed. Specifically, the preset support The optimization purpose of the vector machine sensitive word prediction model is to maximize the minimum geometric distance between all white-sensitive word samples and black-sensitive word samples to the segmentation hyperplane, and the objective function is

Therefore, the parameters w and b in the initial support vector machine model are continuously optimized through the objective function, and finally the preset support vector machine sensitive word prediction model is obtained by training.

Further, the specific architecture of the preset gradient boosting tree-sensitive word prediction model is as follows:

Among them, T is the decision tree, M is the number of decision trees, θ is the parameter of the decision tree, x is the white-sensitive word sample and the black-sensitive word sample, the boosting tree adopts the forward division algorithm, first determine f ₀ (x) =0, the model architecture of the mth step is:

f _m (x)=f _m-1 (x)+T(x, θ _m )

The determination of the parameter θ of the decision tree is determined by empirical risk minimization, and the objective function is obtained as follows:

Therefore, the white-sensitive word samples and black-sensitive word samples are used as training sets, and the parameters in the initial gradient boosting tree model can be continuously optimized through the constructed objective function, and finally the preset gradient boosting tree-sensitive word prediction model is obtained.

Further, for the preset proximity classification sensitive word prediction algorithm, since the sensitive words in the white sensitive word samples are not block sensitive words, and the sensitive words in the black sensitive word samples are blocked sensitive words, the to-be-predicted words can be calculated separately. The Euclidean distance between the common sensitive word and the white sensitive word sample, and the Euclidean distance between the to-be-predicted sensitive word and the black sensitive word sample, if the Euclidean distance between the common sensitive word and the white sensitive word sample is less than the Euclidean distance between it and the black sensitive word sample, then It can be considered that public sensitive words and white sensitive word samples belong to the same category, that is, public sensitive words are not blocked sensitive words; if the Euclidean distance between public sensitive words and white sensitive word samples is greater than the Euclidean distance of black sensitive word samples, it can be considered that public sensitive words are public sensitive words. Sensitive words and black-sensitive word samples belong to the same category, that is, public sensitive words are blocked sensitive words, and the Euclidean distance between public sensitive words and white-sensitive word samples or black-sensitive word samples is calculated as follows:

Among them, (X ₁ , X ₂ ,...X _n ) are the public sensitive words to be predicted, (x ₁ , x ₂ ,... x _n ) white sensitive word samples or black sensitive word samples, d is the public sensitive word and any The Euclidean distance of the white-sensitive word sample or any black-sensitive word sample, the Euclidean distance between the common sensitive word and each white-sensitive word sample is added, and the Euclidean distance between the fairness-sensitive word and each black-sensitive word sample is compared. Add and compare the added Euclidean distances to determine whether the public sensitive words are in the same category as white sensitive word samples or black sensitive word samples, and then determine whether the public sensitive words are blocked sensitive words according to the judgment result.

For the embodiment of the present application, in order to automatically screen the blocked sensitive words in the public sensitive thesaurus and at the same time ensure the reliability of the screening results, the sensitive words to be predicted in the public premonition thesaurus are respectively input into different types of preset public The sensitive word prediction model predicts the blocked sensitive words, and obtains the prediction results output by different types of preset sensitive word prediction models. Set the gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model to predict the blocking sensitive words, and obtain the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model. The prediction result corresponding to the word prediction model is used to determine whether the public sensitive word to be relied on is a blocked sensitive word according to the predictive result, so as to achieve the purpose of automatically screening the blocked sensitive word from the public sensitive word database.

103. According to the prediction results output by the different types of preset sensitive word prediction models, determine whether the public sensitive word is a blocked sensitive word.

Among them, if a blocking sensitive word, such as selling blood or selling kidney, appears in the short message sent by the user, the number that sent the short message will be blocked. The prediction result includes determining that the public sensitive word is a blocking sensitive word and determining the public sensitive word. Sensitive words are not blocked sensitive words. For the embodiment of the present application, in order to ensure the accuracy of the screening results of blocked sensitive words, the embodiment of the present application will comprehensively consider the prediction results of different types of preset sensitive word prediction models to finally determine the to-be-predicted Whether the public sensitive words are blocked sensitive words, specifically, if the output results of different types of preset sensitive word prediction models are all public sensitive words are blocked sensitive words, the public sensitive words are finally determined to be blocked sensitive words; if If the output result of any type of preset sensitive word prediction model is that the public sensitive word is not a blocking sensitive word, it is finally determined that the public sensitive word is not a blocking sensitive word.

For example, if the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model are all public sensitive words that are blocked sensitive words, then the public sensitive words are finally determined. Sensitive words are blocked sensitive words; if the prediction result output by the support vector machine sensitive word prediction model is that the public sensitive words are not blocked sensitive words, and the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model output If the prediction results are that the public sensitive words are blocked sensitive words, then the public sensitive words are not considered blocked sensitive words in the end, so the prediction results of different types of preset sensitive word prediction models can be comprehensively considered, which further improves the blocking sensitivity. The prediction accuracy of sensitive words ensures the accuracy of the screening results of blocked sensitive words in the public sensitive thesaurus.

A method for predicting blocked sensitive words provided by the embodiment of the present application, compared with the current method of manually screening blocked sensitive words, the present application can obtain the public sensitive words to be predicted; Types of preset sensitive word prediction models perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models; at the same time, according to the different types of preset sensitive word prediction models output As a result of the prediction, it is determined whether the public sensitive word is a blocking sensitive word, so that by constructing a preset sensitive word prediction model, and using the preset sensitive word prediction model to predict the blocking sensitive word for the public sensitive word, the prediction of the public sensitive word is realized. The automatic screening of blocking sensitive words in the public sensitive word database improves the screening efficiency of blocking sensitive words and ensures the accuracy of the screening results. In addition, by constructing different types of preset sensitive word prediction models, the prediction can be further improved. The accuracy of the results ensures the reliability of the screening results, while reducing the workload of business personnel and labor costs.

Further, in order to better illustrate the prediction process of the upper blocking sensitive words, as a refinement and expansion of the above-mentioned embodiment, the embodiment of the present application provides another blocking sensitive word prediction method, as shown in FIG. 2 , The method includes:

201. Determine black short message samples and white short message samples in the historical short message data, and use a preset public sensitive thesaurus to screen black sensitive word samples and white sensitive word samples in the black short message samples and white short message samples respectively.

[Correction 19.08.2021 in accordance with Rule 91]
Among them, the historical short message data is the short message data sent by the company's business personnel. In order to build a preset sensitive word preset model, the historical short message data is used as the sample short message data, and the black short message sample is sent by the number that was blocked by the operator in the historical short message data. The white short message sample is the short message sent by the number that has not been blocked by the operator in the historical short message data, that is, the remaining short message sample except the black short message sample in the historical short message data, and the black sensitive word sample is from the black short message sample. The extracted sensitive words, the white sensitive word samples are the sensitive words extracted from the white short message samples. For the acquisition process of the black short message samples, the white short message samples, the black sensitive word samples and the white sensitive word samples, step 201 specifically includes: acquiring Historical blocking information; according to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is white short message sample; perform word segmentation processing on the black short message sample and the white short message sample to obtain each word segment corresponding to the black short message sample and the white short message sample respectively; use each word in the preset public sensitive thesaurus For common sensitive words, the black-sensitive word samples and the white-sensitive word samples are selected from the word segments corresponding to the black short message samples and the white short message samples respectively. Among them, the historical blocking information is the information that the operator has blocked the business company. The historical blocking information can be obtained by the business company from the operator. The historical blocking information mainly includes the time information and number information that were blocked. For example, the mobile phone number 185××××××49 was blocked on June 29, 2020. The pre-set sensitive words in the public sensitive thesaurus can be publicly obtained. After deduplication, 98,955 sensitive words can be obtained. These 98,955 sensitive words Some of the word types are block-sensitive words, and some are not block-sensitive words, so they need to be selected.

[Correction 19.08.2021 in accordance with Rule 91]
Specifically, first of all, the business company will ask the operator for the historical blocking information. Since the historical blocking information only contains the information that the relevant mobile phone number was blocked on a certain day, that is, the blocking information can only be targeted to the day, and will not be accurate. , minutes, seconds, and will not be accurate to specific text messages, so it can be considered that all text messages sent by the blocked numbers in the historical text message data during this day are samples of black text messages, such as mobile phone number 185×××××× 49 is blocked on June 29, 2020, then confirm that all the short messages sent by 185××××××49 in the historical short message data on June 29, 2020 are black short message samples, and at the same time confirm the remaining short message data in the historical short message data. It is a white short message sample. Further, after determining the black short message sample and the white short message sample, word segmentation processing is performed on the black short message sample and the short message sample respectively. Specifically, the black short message sample and the white short message sample can be processed by using the preset condition random field word segmentation model. Word segmentation processing, to obtain each word segment corresponding to each black short message sample and each word segment corresponding to each white short message sample, and then use the sensitive words in the preset public sensitive thesaurus to obtain each word segment corresponding to the black short message sample and the white short message sample respectively. The black-sensitive word samples and the white-sensitive word samples are screened in the middle, since the public sensitive word database is used for screening the sensitive words in the embodiment of the present application, so there must also be sensitive words in the white short message samples.

For example, the white text message sample is "Hello Mu Qiongxian, your loan is overdue and overdue at present, the system prompts that at 10:00 tomorrow, your instalment repayment qualification will be closed, and you need to deal with your default payment in one go. At the same time, the amount will be uploaded to your bank credit report, and your subsequent credit report will show overdue concern or even sub-class. At that time, your cooperation with financial institutions, especially banks, will be limited. Please know", after screening the public sensitive word database, it is determined that the sensitive words in the white short message sample are {77:'bank',116:'bank',13:'loan',22:'system',119:' Cooperation'}, the number in front represents the key, indicating the starting position of the sensitive word in the white short message sample, and the value after the number indicates which sensitive word in the public sensitive thesaurus was hit, but obviously "bank, "loan" , "system" These are not operators' blocking sensitive words.

[Correction 19.08.2021 in accordance with Rule 91]
Further, after obtaining all the sensitive words in the black short message sample and the white short message sample, since the sensitive words in the white short message sample must not be blocked sensitive words in the operator's blocking word database, but for the black sample short message, in the early stage, When determining the black text message sample, all text messages sent by the blocked number in this day are considered to be black text message samples, so some sensitive words in the black text message sample are probably not blocked sensitive words, that is, not black sensitive words. , should not be included in the black-sensitive word samples. In order to improve the training accuracy of the model and accurately determine the black-sensitive word samples, the method further includes: determining the black-sensitive word samples that coincide with the white-sensitive word samples. Sensitive word samples; the coincident sensitive word samples are excluded from the black sensitive word samples, and the remaining samples in the black sensitive word samples are obtained.

For example, the determined white-sensitive word sample set is A, the black-sensitive word sample set is B, and the intersection of white-sensitive word sample set A and black-sensitive word sample set B is C, then the remaining samples in the black-sensitive word sample are determined to be B-C , that is, the part of white-sensitive word samples is removed from the black-sensitive word sample set.

202. Use the black-sensitive word samples and the white-sensitive word samples as training sets, and construct different types of preset sensitive word prediction models according to the training sets.

For the embodiment of the present application, since the sensitive words in the white-sensitive word samples are definitely not blocked-sensitive words, the parts overlapping with the white-sensitive word samples are excluded from the black-sensitive word samples. Based on this, step 202 specifically includes: The remaining samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.

Specifically, the sensitive words in the remaining samples are marked as 1, and the sensitive words in the white-sensitive word samples are marked as 0, and the marked remaining samples and white-sensitive word samples are used as sample training sets to construct preset support vectors respectively. Machine sensitive word prediction model, preset gradient boosting tree sensitive word prediction model and preset proximity classification sensitive word prediction model, so as to determine whether the sensitive word to be predicted in the public sensitive word database is based on the output results of different types of sensitive word prediction models In order to block sensitive words, the screening accuracy of blocked sensitive words can be further improved.

203. Obtain public sensitive words to be predicted.

Among them, the public sensitive words to be predicted are the sensitive words in the public sensitive word database, such as loan, bank, system, selling kidney, selling blood, etc. In order to make the business company's own sensitive word database and the operator's blocked sensitive word database To be closer, it is necessary to select the blocked sensitive words from each sensitive word in the public sensitive thesaurus, that is, to predict the blocked sensitive words for each sensitive word in the public sensitive thesaurus.

204. Input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain prediction results output by the different types of preset sensitive word prediction models.

Among them, in order to improve the screening accuracy of blocking sensitive words, common sensitive words can be input into different types of preset sensitive word prediction models for prediction, and then the prediction results output by different types of preset sensitive word prediction models can be obtained. The specific process of predicting the blocked sensitive word by the preset sensitive word prediction model is exactly the same as that of step 102, and details are not repeated here.

205. According to the prediction results output by the different types of preset sensitive word prediction models, determine whether the public sensitive words are blocked sensitive words.

For the embodiment of the present application, in order to determine whether the public sensitive word to be predicted is a blocked sensitive word, step 205 specifically includes: if the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are Block sensitive words, then finally determine that the public sensitive words are blocked sensitive words; if there is any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models, the prediction result output by the preset sensitive word prediction model is the above If the public sensitive word is not a blocked sensitive word, it is finally determined that the public sensitive word is not a blocked sensitive word. The different types of preset sensitive word prediction models may be, but are not limited to, preset support vector machine sensitive word prediction models, preset gradient boosting tree sensitive word prediction models, and preset proximity classification sensitive word prediction models.

Further, in order to select more blocked sensitive words from the public sensitive word database, the output results of different types of sensitive word prediction models can be comprehensively considered. Based on this, step 205 also specifically includes:

Determine the prediction weights corresponding to the different types of preset sensitive word prediction models; determine whether the public sensitive words are blocked according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights sensitive words. Wherein, the different types of preset sensitive word prediction models include: a preset support vector machine sensitive word prediction model, a preset gradient boosting tree sensitive word prediction model, and a preset proximity classification sensitive word prediction model. The prediction weight corresponding to the preset sensitive word prediction model of the type, including: respectively setting the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word The prediction weight corresponding to the prediction model; the determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights includes: according to the The prediction results and their corresponding prediction weights output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model, and determine whether the public sensitive words are To block sensitive words.

For example, the prediction weights corresponding to the preset SVM-sensitive word prediction model, the preset gradient boosting tree-sensitive word prediction model, and the preset proximity classification-sensitive word prediction model are set to 0.5, 0.25, and 0.25 respectively, and the preset SVM sensitive word prediction model is set to In the prediction results output by the word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model The output result and its corresponding weight value are weighted and summed, and the result is 0.7. Therefore, it is determined that the public sensitive word to be predicted is a blocked sensitive word.

Another method for predicting blocked sensitive words provided by the embodiment of the present application, compared with the current method of manually screening blocked sensitive words, the present application can obtain the public sensitive words to be predicted; and input the public sensitive words into Different types of preset sensitive word prediction models perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models; at the same time, according to the different types of preset sensitive word prediction models output to determine whether the public sensitive word is a blocking sensitive word, and thus by constructing a preset sensitive word prediction model, and using the preset sensitive word prediction model to predict the blocking sensitive word for the public sensitive word, it is possible to achieve The automatic screening of blocked sensitive words in the public sensitive lexicon improves the screening efficiency of blocked sensitive words, while ensuring the accuracy of the screening results. In addition, by building different types of preset sensitive word prediction models, it can be further improved The accuracy of the prediction results ensures the reliability of the screening results, while reducing the workload of business personnel and labor costs.

Further, as a specific implementation of FIG. 1 , an embodiment of the present application provides a block-sensitive word prediction device. As shown in FIG. 3 , the device includes: an acquisition unit 31 , a prediction unit 32 and a determination unit 33 .

The obtaining unit 31 can be used to obtain the public sensitive words to be predicted. The obtaining unit 31 is the main functional module in the device for obtaining the public sensitive words to be predicted.

The predicting unit 32 can be used to input the common sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain the prediction results output by the different types of preset sensitive word prediction models. . The prediction unit 32 is to input the common sensitive words into different types of preset sensitive word prediction models in the device to perform block sensitive word prediction, and obtain the prediction results output by the different types of preset sensitive word prediction models. The main function module is also the core module.

The determining unit 33 may be configured to determine whether the common sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models. The determining unit 33 is the main function module for determining whether the public sensitive words are blocked sensitive words according to the prediction results output by the different types of preset sensitive word prediction models, and is also the core module.

Further, in order to determine whether the public sensitive words are blocked sensitive words, the determining unit 33 can be specifically configured to be used if the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words. If it is a blocked sensitive word, the public sensitive word is finally determined to be a blocked sensitive word;

If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.

Further, in order to determine whether the public sensitive word is a blocked sensitive word, as shown in FIG. 4 , the determining unit 33 includes: a determining module 331 and a determining module 332 .

The determining module 331 may be configured to determine prediction weights corresponding to the different types of preset sensitive word prediction models.

The determining module 332 may be configured to determine whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights.

In a specific application scenario, the different types of preset sensitive word prediction models include: a preset support vector machine sensitive word prediction model, a preset gradient boosting tree sensitive word prediction model, and a preset proximity classification sensitive word prediction model. The determining module 331 may be specifically configured to set the prediction weights corresponding to the preset SVM-sensitive word prediction model, the preset gradient boosting tree-sensitive word prediction model, and the preset proximity classification-sensitive word prediction model, respectively.

The determining module 332 can be specifically configured to output the prediction results and Its corresponding prediction weight determines whether the public sensitive word is a blocking sensitive word.

Further, in order to construct different types of preset sensitive word prediction models, the apparatus further includes: a determination unit 34 , a screening unit 35 and a construction unit 36 .

The determining unit 34 may be configured to determine black short message samples and white short message samples in the historical short message data.

The screening unit 35 may be configured to screen black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database.

The construction unit 36 may be configured to use the black-sensitive word samples and the white-sensitive word samples as training sets, and build different types of preset sensitive word prediction models according to the training sets.

Further, in order to determine the black short message samples and the white short message samples in the historical short message data, the determining unit 34 includes: an acquisition module 341 and a determination module 342 .

The obtaining module 341 may be used to obtain historical blocking information.

The determining module 342 may be configured to determine, according to the time information and number information in the historical blocking information, that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, The remaining SMS data are white SMS samples.

Further, in order to determine black-sensitive word samples and white-sensitive word samples, the screening unit 35 includes: a word segmentation module 351 and a screening module 352.

The word segmentation module 351 may be configured to perform word segmentation processing on the black short message sample and the white short message sample, and obtain each word segmentation corresponding to the black short message sample and the white short message sample respectively.

The screening module 352 can be configured to use each public sensitive word in the preset public sensitive thesaurus to filter black sensitive word samples and white short message samples from each word segment corresponding to the black short message sample and the white short message sample respectively. Sensitive word samples.

Further, in order to exclude the coincident sensitive word samples from the black sensitive word samples, the apparatus further includes an exclusion unit 37 .

The determining unit 34 may also be configured to determine the sensitive word samples in the black sensitive word samples that coincide with the white sensitive word samples.

The exclusion unit 37 may be configured to exclude the coincident sensitive word samples from the black sensitive word samples to obtain the remaining samples in the black sensitive word samples.

The construction unit 36 may be specifically configured to use the remaining samples and the white-sensitive word samples as a training set, and build different types of preset sensitive word prediction models according to the training set.

It should be noted that, for other corresponding descriptions of the functional modules involved in the block-sensitive word prediction device provided in the embodiments of the present application, reference may be made to the corresponding descriptions of the method shown in FIG. 1 , and details are not repeated here.

Based on the above method as shown in FIG. 1 , correspondingly, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining the public data to be predicted. Sensitive words; input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain the prediction results output by the different types of preset sensitive word prediction models; According to the prediction result output by the preset sensitive word prediction model, it is determined whether the public sensitive word is a blocked sensitive word.

Optionally, when the program is executed by the processor, other steps of the method in the foregoing embodiment may also be implemented, which will not be repeated here. Further optionally, the storage medium involved in the present application, such as a computer-readable storage medium, may be non-volatile or volatile.

Based on the foregoing embodiment of the method shown in FIG. 1 and the apparatus shown in FIG. 3 , an embodiment of the present application further provides a physical structure diagram of a computer device. As shown in FIG. 5 , the computer device includes: a processor 41 , Memory 42, and a computer program stored on the memory 42 and running on the processor, wherein both the memory 42 and the processor 41 are arranged on the bus 43 and the processor 41 implements the following steps when executing the program: obtaining the to-be-predicted the public sensitive words; input the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, and obtain the prediction results output by the different types of preset sensitive word prediction models; according to the According to the prediction results output by different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.

Through the technical solution of the present application, the present application can obtain the public sensitive words to be predicted; respectively input the public sensitive words into different types of preset sensitive word prediction models to predict the blocked sensitive words, and obtain the different types of sensitive words. The prediction result output by the preset sensitive word prediction model; at the same time, according to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive word is a block sensitive word, thereby constructing a pre Set up a sensitive word prediction model, and use the preset sensitive word prediction model to predict public sensitive words to block sensitive words, realize automatic screening of blocked sensitive words in the public sensitive lexicon, and improve the screening of blocked sensitive words In addition, by building different types of preset sensitive word prediction models, the accuracy of the prediction results can be further improved, the reliability of the screening results can be ensured, and the workload of business personnel can be reduced. Reduced labor costs.

Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed in a network composed of multiple computing devices Alternatively, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device and executed by the computing device, and in some cases, in a different order than here The steps shown or described are performed either by fabricating them separately into individual integrated circuit modules, or by fabricating multiple modules or steps of them into a single integrated circuit module. As such, the present application is not limited to any particular combination of hardware and software.

The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims

A block-sensitive word prediction method, comprising:

Obtain the public sensitive words to be predicted;

Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;

According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
The method according to claim 1, wherein the determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models comprises:

If the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are blocked sensitive words, then finally determine that the public sensitive words are blocked sensitive words;

If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.
The method according to claim 1, wherein the determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models comprises:

determining the prediction weights corresponding to the different types of preset sensitive word prediction models;

According to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights, it is determined whether the public sensitive words are blocked sensitive words.
The method according to claim 3, wherein the different types of preset sensitive word prediction models include: a preset support vector machine sensitive word prediction model, a preset gradient boosting tree sensitive word prediction model, and a preset proximity classification sensitive word A prediction model, the determining the prediction weights corresponding to the different types of preset sensitive word prediction models, including:

respectively setting the prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model;

Determining whether the public sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights includes:

According to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model and their corresponding prediction weights, determine the public Whether the sensitive word is a blocking sensitive word.
The method according to claim 1, wherein in the said common sensitive words are respectively input into different types of preset sensitive word prediction models to perform block sensitive word prediction, and said different types of preset sensitive word predictions are obtained Before the prediction result output by the model, the method further includes:

Determine the black SMS samples and white SMS samples in the historical SMS data;

Screen the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database;

The black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
The method according to claim 5, wherein the determining of the black short message samples and the white short message samples in the historical short message data comprises:

Obtain historical suspension information;

According to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is a white short message sample;

Said screening the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database, including:

Perform word segmentation processing on the black short message sample and the white short message sample, and obtain each word segmentation corresponding to the black short message sample and the white short message sample respectively;

Using each public sensitive word in the preset public sensitive word database, the black sensitive word sample and the white sensitive word sample are screened from each word segment corresponding to the black short message sample and the white short message sample respectively.
The method according to claim 5, wherein after screening the black-sensitive word samples and the white-sensitive word samples in the black short message samples and the white short message samples respectively by using a preset public sensitive word database, the method further comprises: :

Determine the sensitive word samples that coincide with the white sensitive word samples in the black sensitive word samples;

Excluding the coincident sensitive word samples from the black sensitive word samples to obtain the remaining samples in the black sensitive word samples;

The described black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets, including:

The remaining samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
A blocking-sensitive word prediction device, comprising:

The acquisition unit is used to acquire the public sensitive words to be predicted;

A prediction unit, configured to respectively input the public sensitive words into different types of preset sensitive word prediction models to perform block sensitive word prediction, and obtain prediction results output by the different types of preset sensitive word prediction models;

A determination unit, configured to determine whether the common sensitive word is a blocking sensitive word according to the prediction results output by the different types of preset sensitive word prediction models.
A computer-readable storage medium on which a computer program is stored, wherein, when the computer program is executed by a processor, the following methods are implemented:

Obtain the public sensitive words to be predicted;

The public sensitive words are respectively input into different types of preset sensitive word prediction models to predict the blocked sensitive words, and the prediction results output by the different types of preset sensitive word prediction models are obtained;

According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
The computer-readable storage medium according to claim 9, wherein, executing the prediction results output by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:

If the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are blocked sensitive words, then finally determine that the public sensitive words are blocked sensitive words;

If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.
The computer-readable storage medium according to claim 9, wherein, executing the prediction results output by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:

determining the prediction weights corresponding to the different types of preset sensitive word prediction models;

According to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights, it is determined whether the public sensitive words are blocked sensitive words.
The computer-readable storage medium according to claim 11, wherein the different types of preset sensitive word prediction models include: preset support vector machine sensitive word prediction models, preset gradient boosted tree sensitive word prediction models, and preset Proximity classification sensitive word prediction models, performing the described determining of the prediction weights corresponding to the different types of preset sensitive word prediction models, including:

respectively setting the prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model;

Execute the prediction results and their corresponding prediction weights outputted according to the different types of preset sensitive word prediction models, and determine whether the public sensitive words are blocked sensitive words, including:

According to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model and their corresponding prediction weights, determine the public Whether the sensitive word is a blocking sensitive word.
The computer-readable storage medium according to claim 9, wherein, after the common sensitive words are respectively input into different types of preset sensitive word prediction models to perform block sensitive word prediction, the different types of pre-set sensitive words are obtained. Before setting the prediction result output by the sensitive word prediction model, the computer program is further used to realize when executed by the processor:

Determine the black SMS samples and white SMS samples in the historical SMS data;

Screen the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database;

The black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
The computer-readable storage medium according to claim 13, wherein performing the determining of a black short message sample and a white short message sample in the historical short message data comprises:

Obtain historical suspension information;

According to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is a white short message sample;

Perform the screening of black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database, including:

Perform word segmentation processing on the black short message sample and the white short message sample to obtain word segmentations corresponding to the black short message sample and the white short message sample respectively;

Using each public sensitive word in the preset public sensitive word database, the black sensitive word sample and the white sensitive word sample are screened from each word segment corresponding to the black short message sample and the white short message sample respectively.
A computer device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the computer program is executed by the processor to implement the following methods:

Obtain the public sensitive words to be predicted;

Inputting the public sensitive words into different types of preset sensitive word prediction models respectively to predict the blocked sensitive words, to obtain prediction results output by the different types of preset sensitive word prediction models;

According to the prediction results output by the different types of preset sensitive word prediction models, it is determined whether the public sensitive words are blocked sensitive words.
The computer device according to claim 15, wherein, executing the prediction results outputted by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:

If the prediction results output by the different types of preset sensitive word prediction models are all the public sensitive words are blocked sensitive words, then finally determine that the public sensitive words are blocked sensitive words;

If the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a blocking sensitive word, then it is finally determined that the public sensitive word is not a blocking sensitive word Stop sensitive words.
The computer device according to claim 15, wherein, executing the prediction results outputted by the different types of preset sensitive word prediction models to determine whether the public sensitive words are blocked sensitive words, comprising:

determining the prediction weights corresponding to the different types of preset sensitive word prediction models;

According to the prediction results output by the different types of preset sensitive word prediction models and their corresponding prediction weights, it is determined whether the public sensitive words are blocked sensitive words.
The computer device according to claim 17, wherein the different types of preset sensitive word prediction models include: preset support vector machine sensitive word prediction model, preset gradient boosted tree sensitive word prediction model and preset proximity classification sensitive word prediction model A word prediction model, performing the described determining of the prediction weights corresponding to the different types of preset sensitive word prediction models, including:

respectively setting the prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model;

Execute the prediction results and their corresponding prediction weights outputted according to the different types of preset sensitive word prediction models, and determine whether the public sensitive words are blocked sensitive words, including:

According to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient boosting tree sensitive word prediction model and the preset proximity classification sensitive word prediction model and their corresponding prediction weights, determine the public Whether the sensitive word is a blocking sensitive word.
The computer device according to claim 15, wherein, after the common sensitive words are respectively input into different types of preset sensitive word prediction models to perform block sensitive word prediction, the different types of preset sensitive words are obtained Before the prediction result output by the prediction model, when the computer program is executed by the processor, the computer program is further used to realize:

Determine the black SMS samples and white SMS samples in the historical SMS data;

Screen the black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database;

The black-sensitive word samples and the white-sensitive word samples are used as training sets, and different types of preset sensitive word prediction models are constructed according to the training sets.
The computer device according to claim 19, wherein the determining of the black short message samples and the white short message samples in the historical short message data comprises:

Obtain historical suspension information;

According to the time information and number information in the historical blocking information, it is determined that the short message data sent by the number information in the historical short message data under the time information is a black short message sample, and the remaining short message data is a white short message sample;

Perform the screening of black-sensitive word samples and white-sensitive word samples in the black short message samples and white short message samples respectively by using a preset public sensitive word database, including:

Perform word segmentation processing on the black short message sample and the white short message sample to obtain word segmentations corresponding to the black short message sample and the white short message sample respectively;

Using each public sensitive word in the preset public sensitive word database, the black sensitive word sample and the white sensitive word sample are screened from each word segment corresponding to the black short message sample and the white short message sample respectively.