CN112528636A

CN112528636A - Method and device for predicting stop sensitive words, computer equipment and storage medium

Info

Publication number: CN112528636A
Application number: CN202011434908.5A
Authority: CN
Inventors: 程华东; 侯翠琴; 李剑锋
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-19
Also published as: WO2022121164A1

Abstract

The invention discloses a method and a device for predicting stop sensitive words, computer equipment and a storage medium, which mainly aim at improving the screening efficiency and accuracy of stop sensitive words and reducing the workload of business personnel. The method comprises the following steps: obtaining public sensitive words to be predicted; respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models; and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models. The method is mainly suitable for predicting the stop sensitive words.

Description

Method and device for predicting stop sensitive words, computer equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for predicting stop sensitive words, computer equipment and a storage medium.

Background

Operators usually have their own sensitive word lists for blocking, when the short messages sent by users contain the sensitive words for blocking, the numbers are blocked, the users need to go to business halls to handle unblocking services, the numbers can be continuously used, and the users are very inconvenient, so that related service companies need to maintain their own sensitive word lists, the sensitive word lists are close to the sensitive word lists for blocking of the operators as much as possible, and the sensitive word lists are used for early warning the short messages sent in the companies, so that the internal numbers of the companies are prevented from being blocked.

At present, in the process of maintaining a sensitive word list of a business company, business personnel usually screen a closed sensitive word from an open sensitive word bank according to historical closed short message data. However, the manual method for screening the stop-and-go sensitive words is greatly influenced by human subjective factors, and the stop-and-go sensitive words are likely to be missed or selected incorrectly, so that the screening efficiency and accuracy of the stop-and-go sensitive words are low, and the workload of business personnel is greatly increased.

Disclosure of Invention

The invention provides a method and a device for predicting stop sensitive words, computer equipment and a storage medium, which mainly aim at improving the screening efficiency and accuracy of stop sensitive words and reducing the workload of business personnel.

According to a first aspect of the present invention, there is provided a method for predicting stop-sensitive words, comprising:

obtaining public sensitive words to be predicted;

respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models;

and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models.

According to a second aspect of the present invention, there is provided a stop-sensitive word prediction apparatus comprising:

the device comprises an acquisition unit, a prediction unit and a prediction unit, wherein the acquisition unit is used for acquiring public sensitive words to be predicted;

the prediction unit is used for respectively inputting the public sensitive words into different types of preset sensitive word prediction models to carry out stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models;

and the judging unit is used for judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models.

According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

obtaining public sensitive words to be predicted;

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:

obtaining public sensitive words to be predicted;

Compared with the current manual blocking sensitive word screening mode, the blocking sensitive word prediction method, the blocking sensitive word prediction device, the computer equipment and the storage medium can obtain the public sensitive words to be predicted; respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models; meanwhile, whether the public sensitive words are stop-sensitive words or not is judged according to the prediction results output by the different types of preset sensitive word prediction models, so that the automatic screening of the stop-sensitive words in the public sensitive word bank is realized by constructing the preset sensitive word prediction models and utilizing the preset sensitive word prediction models to carry out stop-sensitive word prediction on the public sensitive words, the screening efficiency of the stop-sensitive words is improved, and meanwhile, the accuracy of the screening results can be ensured.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flowchart illustrating a method for predicting stop-sensitive words according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for predicting stop-sensitive words according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram illustrating a blocking sensitive word prediction apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of another blocking sensitive word prediction apparatus according to an embodiment of the present invention;

fig. 5 shows a physical structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In order to solve the above problem, an embodiment of the present invention provides a method for predicting a stop-sensitive word, as shown in fig. 1, where the method includes:

101. and acquiring the public sensitive words to be predicted.

The common sensitive words to be predicted are sensitive words in a common sensitive word bank, such as loan, bank, system, kidney selling, blood selling and the like, a large number of common sensitive words are recorded in the common sensitive word bank, the vocabulary can reach hundreds of thousands, but if a business company directly uses the common sensitive word bank to carry out short message early warning, a large number of short messages can be intercepted and cannot be sent, therefore, the stop sensitive words need to be screened from the common sensitive word bank so as to obtain a sensitive word bank which is the same as or similar to the stop sensitive word bank of an operator, in order to overcome the defect that stop sensitive words are manually selected in the prior art, the embodiment of the invention respectively predicts each sensitive word in the common sensitive word bank by constructing a preset sensitive word prediction model, and further achieves the purpose of automatically selecting the stop sensitive words in the common sensitive word bank by utilizing the preset sensitive word prediction model, the execution subject of the embodiment of the invention is a device or equipment capable of predicting the public sensitive words, and the device or equipment can be specifically arranged at one side of a client or a server.

102. And respectively inputting the public sensitive words into different types of preset sensitive word prediction models to carry out stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models.

The different types of preset sensitive word prediction models comprise a preset support vector machine sensitive word prediction model, a preset gradient lifting tree sensitive word prediction model and a preset adjacent classification sensitive word prediction model, and it should be noted that the different types of preset sensitive word prediction models in the embodiment of the present invention are not limited to the above types, and the specific architecture of the preset support vector machine sensitive word prediction model is as follows:

y＝g_w,b(w^Tx+b)

wherein (x, y) is a training sample, the white sensitive word sample and the black sensitive word sample are used as the training samples to train the initial support vector machine model, and a preset support vector machine sensitive word prediction model is constructed, specifically, the preset support vector machine sensitive word prediction model is optimized to maximize the minimum geometric distance from all the white sensitive word samples and the black sensitive word samples to the segmentation hyperplane, and the target function is the minimum geometric distance from all the white sensitive word samples and the black sensitive word samples to the segmentation hyperplane at the moment

And continuously optimizing parameters w and b in the initial support vector machine model through the objective function, and finally training to obtain a preset support vector machine sensitive word prediction model.

Further, the specific architecture of the preset gradient lifting tree sensitive word prediction model is as follows:

wherein T represents a decision tree, M represents the number of the decision tree, theta represents parameters of the decision tree, x represents white sensitive word samples and black sensitive word samples, a forward subdivision algorithm is adopted for a lifting tree, and f is firstly determined₀(x) The model architecture of the mth step is 0:

f_m(x)＝f_m-1(x)+T(x,θ_m)

the determination of the parameter θ of the decision tree is determined by empirical risk minimization, and the objective function is obtained as follows:

therefore, white sensitive word samples and black sensitive word samples are used as training sets, parameters in the initial gradient lifting tree model can be continuously optimized through the constructed target function, and finally the preset gradient lifting tree sensitive word prediction model is obtained.

Further, aiming at the preset adjacent classified sensitive word prediction algorithm, because the sensitive words in the white sensitive word samples are not stop sensitive words, and the sensitive words in the black sensitive word samples are stop sensitive words, the Euclidean distance between the public sensitive words to be predicted and the white sensitive word samples and the Euclidean distance between the sensitive words to be predicted and the black sensitive word samples can be respectively calculated, if the Euclidean distance between the public sensitive words and the white sensitive word samples is smaller than the Euclidean distance between the public sensitive words and the black sensitive word samples, the public sensitive words and the white sensitive word samples can be considered to belong to one class, namely the public sensitive words are not stop sensitive words; if the Euclidean distance between the public sensitive word and the white sensitive word sample is greater than the Euclidean distance between the black sensitive word sample, the public sensitive word and the black sensitive word sample can be considered to belong to one class, namely the public sensitive word is a stop sensitive word, and the Euclidean distance between the public sensitive word and the white sensitive word sample or the black sensitive word sample is calculated according to the following formula:

wherein (X)₁,X₂,…X_n) For the common sensitive word to be predicted, (x)₁,x₂,…x_n) The method comprises the steps that white sensitive word samples or black sensitive word samples are obtained, d is the Euclidean distance between a public sensitive word and any white sensitive word sample or any black sensitive word sample, the Euclidean distances between the public sensitive word and each white sensitive word sample are added, meanwhile, the Euclidean distances between a fair sensitive word and each black sensitive word sample are added, the added Euclidean distances are compared, whether the public sensitive word is similar to the white sensitive word sample or the black sensitive word sample is judged, and whether the public sensitive word is a stop sensitive word is determined according to the judgment result.

For the embodiment of the invention, in order to automatically screen the stop sensitive words in the public sensitive word bank and ensure the reliability of the screening result, the sensitive words to be predicted in the public sensitive word bank are respectively input into the different types of preset public sensitive word prediction models to carry out stop sensitive word prediction to obtain the prediction results output by the different types of preset sensitive word prediction models, specifically, the sensitive words to be predicted are respectively input into the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classified sensitive word prediction model to carry out stop sensitive word prediction to obtain the prediction results corresponding to the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classified sensitive word prediction model so as to judge whether the public sensitive words to be depended on are stop sensitive words according to the prediction results, therefore, the purpose of automatically screening the stop sensitive words from the public sensitive word bank can be achieved.

103. And judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models.

If a stop sensitive word appears in a short message sent by a user, such as blood selling and kidney selling, the number of the short message is marked, and the prediction result comprises the steps of determining that the public sensitive word is the stop sensitive word and determining that the public sensitive word is not the stop sensitive word; and finally determining that the public sensitive word is not the stop sensitive word if the output result of the preset sensitive word prediction model of any type is that the public sensitive word is not the stop sensitive word.

For example, if the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classification sensitive word prediction model are all that the public sensitive word is a stop sensitive word, the public sensitive word is finally determined to be a stop sensitive word; if the prediction result output by the support vector machine sensitive word prediction model is that the public sensitive word is not the stop sensitive word, and the prediction results output by the preset gradient lifting tree sensitive word prediction model and the preset adjacent classification sensitive word prediction model are that the public sensitive word is the stop sensitive word, the public sensitive word is finally considered not to be the stop sensitive word, so that the prediction results of the different types of preset sensitive word prediction models can be comprehensively considered, the prediction precision of the stop sensitive word is further improved, and the accuracy of the screening result of the stop sensitive word in the public sensitive word bank is ensured.

Compared with the conventional method for manually screening the stop sensitive words, the stop sensitive word prediction method provided by the embodiment of the invention can acquire the public sensitive words to be predicted; respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models; meanwhile, whether the public sensitive words are stop-sensitive words or not is judged according to the prediction results output by the different types of preset sensitive word prediction models, so that the automatic screening of the stop-sensitive words in the public sensitive word bank is realized by constructing the preset sensitive word prediction models and utilizing the preset sensitive word prediction models to carry out stop-sensitive word prediction on the public sensitive words, the screening efficiency of the stop-sensitive words is improved, and meanwhile, the accuracy of the screening results can be ensured.

Further, in order to better explain the prediction process of the stop-and-go sensitive word, as a refinement and an extension to the above embodiment, an embodiment of the present invention provides another stop-and-go sensitive word prediction method, as shown in fig. 2, where the method includes:

201. determining black short message samples and white short message samples in historical short message data, and respectively screening the black sensitive word samples and the white sensitive word samples in the black short message samples and the white short message samples by utilizing a preset public sensitive word bank.

The method includes that historical short message data is short message data sent by company service personnel, in order to construct a preset sensitive word preset model, the historical short message data is used as sample short message data, a black short message sample is a short message sent by a number blocked by an operator in the historical short message data, a white short message sample is a short message sent by a number not blocked by the operator in the historical short message data, namely, the remaining short message samples except the black short message sample in the historical short message data, the black sensitive word sample is a sensitive word extracted from the black short message sample, the white sensitive word sample is a sensitive word extracted from the white short message sample, and the step 201 specifically includes: acquiring historical stop information; according to the time information and the number information in the historical stop information, determining that short message data sent by the number information under the time information in the historical short message data is a black short message sample, and determining that the rest short message data is a white short message sample; performing word segmentation processing on the black short message sample and the white short message sample to obtain each word segmentation corresponding to the black short message sample and the white short message sample respectively; and screening a black sensitive word sample and a white sensitive word sample from each participle corresponding to the black short message sample and the white short message sample respectively by using each public sensitive word in the preset public sensitive word bank. The historical blocking information is information for blocking the service company by the operator, the historical blocking information can be requested from the service company to the operator, the historical blocking information mainly comprises time information and number information of the blocked service company, for example, the mobile phone number 18589014949 is blocked at the number 2020.6.29, sensitive words in a preset public sensitive word bank can be obtained in a public mode, 98955 sensitive words can be obtained after duplication removal, some of the 98955 sensitive words are blocking sensitive words, and some of the 98955 sensitive words are not blocking sensitive words, so that the blocking sensitive words need to be selected.

Specifically, firstly, a service company requests historical blocking information from an operator, because only the information that the relevant mobile phone number is blocked in the historical blocking information in a certain day, namely the blocking information only aims at the day, and cannot be accurate to time, minutes, seconds, or even specific short messages, all the short messages sent by the number that is blocked in the historical short message data in the day can be considered to be black short message samples, if the mobile phone number 18589014949 is blocked at 2020.6.29, all the short messages sent by 18589014949 in the historical short message data at 2020.6.29 are confirmed to be black short message samples, meanwhile, the remaining short message data in the historical short message data are determined to be white short message samples, further, after the black short message samples and the white short message samples are determined, word segmentation processing is respectively performed on the black short message samples and the white message samples, specifically, word segmentation processing can be performed on the black short message samples and the white message samples by using a preset condition random word segmentation model, the method comprises the steps of obtaining each participle corresponding to each black short message sample and each participle corresponding to each white short message sample, and then screening the black sensitive word sample and the white sensitive word sample from each participle corresponding to the black short message sample and the white short message sample respectively by using the sensitive words in a preset public sensitive word bank.

For example, the white message sample is ' Mujon good ', the current loan is overdue and exceeded, the system prompts that at tomorrow 10, the installment repayment qualification of you is closed, you need to process your default full money at one time, at the same time, the money amount is uploaded to your bank credit together, the following credit shows overdue concern class and even sub class, at the time, the cooperation of your bank with financial institutions, especially banks is limited, please know ', after the screening of the public sensitive word bank, the sensitive word existing in the white message sample is determined to be {77: ' bank ',116: ' bank ',13: ' loan ',22: ' system ',119: ' cooperation ', the front number represents the key, represents the initial position of the sensitive word in the white message sample, the value behind the number represents which sensitive word in the public sensitive word bank is hit specifically, but obviously, These "loan", "system" are not operator sensitive terms of parking.

Further, after obtaining all the sensitive words in the black short message sample and the white short message sample, because the sensitive words in the white short message sample are not necessarily the stop sensitive words in the operator stop word bank, and for the black sample short messages, all the short messages sent by the number to be stopped in the day are considered as the black short message sample when the black short message sample is determined in the previous period, part of the sensitive words in the black short message sample are likely not the stop sensitive words, i.e. not the black sensitive words, and should not be included in the black sensitive word sample, in order to improve the training precision of the model, the method accurately determines the black sensitive word sample, and further includes: determining a sensitive word sample coincident with the white sensitive word sample in the black sensitive word sample; and eliminating the coincident sensitive word samples from the black sensitive word samples to obtain the residual samples in the black sensitive word samples.

For example, if the determined white sensitive word sample set is a, the determined black sensitive word sample set is B, and the intersection of the white sensitive word sample set a and the black sensitive word sample set B is C, the remaining samples in the black sensitive word samples are determined to be B-C, that is, the white sensitive word sample is removed from the black sensitive word sample set.

202. And taking the black sensitive word samples and the white sensitive word samples as training sets, and constructing different types of preset sensitive word prediction models according to the training sets.

For the embodiment of the present invention, since the sensitive word in the white sensitive word sample is definitely not the stop sensitive word, the overlapping portion with the white sensitive word sample is excluded from the black sensitive word sample, and based on this, step 202 specifically includes: and taking the residual samples and the white sensitive word samples as training sets, and constructing different types of preset sensitive word prediction models according to the training sets.

Specifically, the sensitive words in the residual samples are labeled as 1, meanwhile, the sensitive words in the white sensitive word samples are labeled as 0, the labeled residual samples and the white sensitive word samples are used as sample training sets, and a preset support vector machine sensitive word prediction model, a preset gradient lifting tree sensitive word prediction model and a preset adjacent classification sensitive word prediction model are respectively constructed, so that whether the sensitive words to be predicted in the public sensitive word bank are the stop sensitive words or not is judged according to the output results of the sensitive word prediction models of different types, and the screening precision of the stop sensitive words can be further improved.

203. And acquiring the public sensitive words to be predicted.

The common sensitive words to be predicted are sensitive words in a common sensitive word bank, such as loan, bank, system, kidney selling, blood selling and the like, and in order to make the sensitive word bank of a business company closer to the parking sensitive word bank of an operator, the parking sensitive words need to be selected from the sensitive words in the common sensitive word bank, namely, the parking sensitive word prediction is performed on the sensitive words in the common sensitive word bank.

204. And respectively inputting the public sensitive words into different types of preset sensitive word prediction models to carry out stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models.

In order to improve the screening precision of the stop sensitive words, the common sensitive words can be input into the different types of preset sensitive word prediction models for prediction, so that prediction results output by the different types of preset sensitive word prediction models can be obtained, and the specific process of using the preset sensitive word prediction models for stop sensitive word prediction is completely the same as that in step 102, and is not repeated here.

205. And judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models.

For the embodiment of the present invention, in order to determine whether the public sensitive word to be predicted is a stop sensitive word, step 205 specifically includes: if the prediction results output by the different types of preset sensitive word prediction models are that the public sensitive word is a stop sensitive word, finally determining that the public sensitive word is a stop sensitive word; and finally determining that the public sensitive word is not a stop sensitive word if the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a stop sensitive word. The different types of preset sensitive word prediction models can be, but are not limited to, a preset support vector machine sensitive word prediction model, a preset gradient lifting tree sensitive word prediction model and a preset adjacent classification sensitive word prediction model.

Further, in order to sort out more stop-sensitive words from the public sensitive word library, the output results of the sensitive word prediction models of different types may be considered comprehensively, and based on this, step 205 further includes: determining the prediction weight corresponding to the different types of preset sensitive word prediction models; and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models and the prediction weights corresponding to the prediction results. The different types of preset sensitive word prediction models comprise: the method for predicting the sensitive words of the support vector machine includes the steps of presetting a sensitive word prediction model of the support vector machine, a sensitive word prediction model of a gradient lifting tree and a sensitive word prediction model of a neighboring classification, and determining prediction weights corresponding to the different types of preset sensitive word prediction models includes: respectively setting prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classified sensitive word prediction model; the step of judging whether the public sensitive word is a stop sensitive word or not according to the prediction result output by the different types of preset sensitive word prediction models and the prediction weight corresponding to the prediction result comprises the following steps: and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classification sensitive word prediction model and the corresponding prediction weights thereof.

For example, the prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classification sensitive word prediction model are respectively set to be 0.5, 0.25 and 0.25, the probability values of the common sensitive words as stop sensitive words in the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classification sensitive word prediction model are 0.8, 0.7 and 0.5, the output results of different models and the corresponding weight values are subjected to weighted summation to obtain a result of 0.7, and therefore the common sensitive words to be predicted are determined to be stop sensitive words.

Compared with the conventional manual blocking sensitive word screening mode, the another blocking sensitive word prediction method provided by the embodiment of the invention can obtain the public sensitive words to be predicted; respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models; meanwhile, whether the public sensitive words are stop-sensitive words or not is judged according to the prediction results output by the different types of preset sensitive word prediction models, so that the automatic screening of the stop-sensitive words in the public sensitive word bank is realized by constructing the preset sensitive word prediction models and utilizing the preset sensitive word prediction models to carry out stop-sensitive word prediction on the public sensitive words, the screening efficiency of the stop-sensitive words is improved, and meanwhile, the accuracy of the screening results can be ensured.

Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a device for predicting stop-sensitive words, where as shown in fig. 3, the device includes: an acquisition unit 31, a prediction unit 32, and a determination unit 33.

The obtaining unit 31 may be configured to obtain a common sensitive word to be predicted. The acquiring unit 31 is a main functional module in the device for acquiring the public sensitive words to be predicted.

The prediction unit 32 may be configured to input the common sensitive word to different types of preset sensitive word prediction models respectively to perform stop sensitive word prediction, so as to obtain prediction results output by the different types of preset sensitive word prediction models. The prediction unit 32 is a main function module, which is also a core module, for inputting the public sensitive words into the different types of preset sensitive word prediction models respectively to perform the sensitive word prediction, so as to obtain prediction results output by the different types of preset sensitive word prediction models.

The determining unit 33 may be configured to determine whether the public sensitive word is a stop sensitive word according to a prediction result output by the different types of preset sensitive word prediction models. The determining unit 33 is a main function module, which determines whether the public sensitive word is a stop sensitive word according to the prediction result output by the prediction models of the different types of preset sensitive words in the apparatus, and is also a core module.

Further, in order to determine whether the public sensitive word is a stop sensitive word, the determining unit 33 may be specifically configured to finally determine that the public sensitive word is a stop sensitive word if the prediction results output by the different types of preset sensitive word prediction models are that the public sensitive word is a stop sensitive word; and finally determining that the public sensitive word is not a stop sensitive word if the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a stop sensitive word.

Further, in order to determine whether the public sensitive word is a stop sensitive word, as shown in fig. 4, the determining unit 33 includes: a determination module 331 and a decision module 332.

The determining module 331 may be configured to determine the prediction weights corresponding to the different types of preset sensitive word prediction models.

The determining module 332 may be configured to determine whether the public sensitive word is a stop sensitive word according to the prediction result output by the different types of preset sensitive word prediction models and the prediction weight corresponding to the prediction result.

In a specific application scenario, the different types of preset sensitive word prediction models include: the determining module 331 may be specifically configured to set prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model, and the preset proximity classification sensitive word prediction model respectively.

The determining module 332 may be specifically configured to determine whether the public sensitive word is a stop sensitive word according to prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model, and the preset adjacent classification sensitive word prediction model and prediction weights corresponding to the prediction results.

Further, in order to construct different types of predictive models of preset sensitive words, the device further comprises: a determination unit 34, a screening unit 35 and a construction unit 36.

The determining unit 34 may be configured to determine black short message samples and white short message samples in the historical short message data.

The screening unit 35 may be configured to screen the black sensitive word sample and the white sensitive word sample in the black short message sample and the white short message sample respectively by using a preset public sensitive word library.

The constructing unit 36 may be configured to use the black sensitive word samples and the white sensitive word samples as training sets, and construct different types of preset sensitive word prediction models according to the training sets.

Further, in order to determine black short message samples and white short message samples in the historical short message data, the determining unit 34 includes: an acquisition module 341 and a determination module 342.

The obtaining module 341 may be configured to obtain historical shutdown information.

The determining module 342 may be configured to determine, according to the time information and the number information in the historical stop information, that short message data sent by the number information in the historical short message data under the time information is a black short message sample, and that remaining short message data is a white short message sample.

Further, in order to determine a black sensitive word sample and a white sensitive word sample, the screening unit 35 includes: a segmentation module 351 and a filtering module 352.

The word segmentation module 351 may be configured to perform word segmentation processing on the black short message sample and the white short message sample to obtain respective words corresponding to the black short message sample and the white short message sample.

The screening module 352 may be configured to screen a black sensitive word sample and a white sensitive word sample from each participle corresponding to the black short message sample and the white short message sample, respectively, by using each public sensitive word in the preset public sensitive word bank.

Further, in order to exclude the coincident sensitive word samples from the black sensitive word samples, the apparatus further includes an excluding unit 37.

The determining unit 34 may be further configured to determine a sensitive word sample that coincides with the white sensitive word sample in the black sensitive word samples.

The excluding unit 37 may be configured to exclude the overlapped sensitive word samples from the black sensitive word samples, so as to obtain remaining samples in the black sensitive word samples.

The constructing unit 36 may be specifically configured to use the remaining samples and the white sensitive word samples as training sets, and construct different types of preset sensitive word prediction models according to the training sets.

It should be noted that other corresponding descriptions of the functional modules related to the device for predicting stop-sensitive words provided in the embodiment of the present invention may refer to the corresponding description of the method shown in fig. 1, and are not described herein again.

Based on the method shown in fig. 1, correspondingly, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps: obtaining public sensitive words to be predicted; respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models; and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models.

Based on the above embodiments of the method shown in fig. 1 and the apparatus shown in fig. 3, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 5, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43 such that when the processor 41 executes the program, the following steps are performed: obtaining public sensitive words to be predicted; respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models; and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models.

By the technical scheme, public sensitive words to be predicted can be obtained; respectively inputting the public sensitive words into different types of preset sensitive word prediction models to perform stop sensitive word prediction, and obtaining prediction results output by the different types of preset sensitive word prediction models; meanwhile, whether the public sensitive words are stop-sensitive words or not is judged according to the prediction results output by the different types of preset sensitive word prediction models, so that the automatic screening of the stop-sensitive words in the public sensitive word bank is realized by constructing the preset sensitive word prediction models and utilizing the preset sensitive word prediction models to carry out stop-sensitive word prediction on the public sensitive words, the screening efficiency of the stop-sensitive words is improved, and meanwhile, the accuracy of the screening results can be ensured.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for predicting stop-sensitive words, comprising:

obtaining public sensitive words to be predicted;

2. The method according to claim 1, wherein the determining whether the common sensitive word is a stop sensitive word according to the prediction result output by the prediction models of the different types of preset sensitive words comprises:

if the prediction results output by the different types of preset sensitive word prediction models are that the public sensitive word is a stop sensitive word, finally determining that the public sensitive word is a stop sensitive word;

and finally determining that the public sensitive word is not a stop sensitive word if the prediction result output by any type of preset sensitive word prediction model in the different types of preset sensitive word prediction models is that the public sensitive word is not a stop sensitive word.

3. The method according to claim 1, wherein the determining whether the common sensitive word is a stop sensitive word according to the prediction result output by the prediction models of the different types of preset sensitive words comprises:

determining the prediction weight corresponding to the different types of preset sensitive word prediction models;

and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the different types of preset sensitive word prediction models and the prediction weights corresponding to the prediction results.

4. The method according to claim 3, wherein the different types of pre-set sensitive word prediction models comprise: the method for predicting the sensitive words of the support vector machine includes the steps of presetting a sensitive word prediction model of the support vector machine, a sensitive word prediction model of a gradient lifting tree and a sensitive word prediction model of a neighboring classification, and determining prediction weights corresponding to the different types of preset sensitive word prediction models includes:

respectively setting prediction weights corresponding to the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classified sensitive word prediction model;

the step of judging whether the public sensitive word is a stop sensitive word or not according to the prediction result output by the different types of preset sensitive word prediction models and the prediction weight corresponding to the prediction result comprises the following steps:

and judging whether the public sensitive words are stop sensitive words or not according to the prediction results output by the preset support vector machine sensitive word prediction model, the preset gradient lifting tree sensitive word prediction model and the preset adjacent classification sensitive word prediction model and the corresponding prediction weights thereof.

5. The method according to claim 1, wherein before the common sensitive words are respectively input to the preset sensitive word prediction models of different types for sensitive word prediction, and prediction results output by the preset sensitive word prediction models of different types are obtained, the method further comprises:

determining black short message samples and white short message samples in historical short message data;

respectively screening black sensitive word samples and white sensitive word samples in the black short message samples and the white short message samples by using a preset public sensitive word bank;

and taking the black sensitive word samples and the white sensitive word samples as training sets, and constructing different types of preset sensitive word prediction models according to the training sets.

6. The method of claim 5, wherein the determining black and white text message samples in the historical text message data comprises:

acquiring historical stop information;

according to the time information and the number information in the historical stop information, determining that short message data sent by the number information under the time information in the historical short message data is a black short message sample, and determining that the rest short message data is a white short message sample;

the method for respectively screening the black sensitive word samples and the white sensitive word samples in the black short message samples and the white short message samples by using the preset public sensitive word bank comprises the following steps:

performing word segmentation processing on the black short message sample and the white short message sample to obtain each word segmentation corresponding to the black short message sample and the white short message sample respectively;

and screening a black sensitive word sample and a white sensitive word sample from each participle corresponding to the black short message sample and the white short message sample respectively by using each public sensitive word in the preset public sensitive word bank.

7. The method of claim 5, wherein after the black sensitive word sample and the white sensitive word sample in the black short message sample and the white short message sample are respectively filtered by using a preset public sensitive word bank, the method further comprises:

determining a sensitive word sample coincident with the white sensitive word sample in the black sensitive word sample;

removing the coincident sensitive word samples from the black sensitive word samples to obtain residual samples in the black sensitive word samples;

the method for constructing the preset sensitive word prediction models of different types by using the black sensitive word samples and the white sensitive word samples as training sets and according to the training sets comprises the following steps:

and taking the residual samples and the white sensitive word samples as training sets, and constructing different types of preset sensitive word prediction models according to the training sets.

8. A stop-sensitive word prediction apparatus, comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.