CN113948065A

CN113948065A - Method and system for screening error blocking words based on n-gram model

Info

Publication number: CN113948065A
Application number: CN202111020788.9A
Authority: CN
Inventors: 冉小龙; 唐会军; 刘拴林; 梁堃; 陈建
Original assignee: Beijing Nextdata Times Technology Co ltd
Current assignee: Beijing Nextdata Times Technology Co ltd
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2022-01-18
Anticipated expiration: 2041-09-01
Also published as: CN113948065B

Abstract

The invention discloses a method and a system for screening error blocking words based on an n-gram model, and relates to the technical field of network security. The method comprises the following steps: acquiring audio translation text data intercepted based on interception words under a specific label; processing the text data through an n-gram model, and screening out data which are not stored in a specific label from the text data as backspacing information; and determining a sentence containing the error interception word according to the backspacing information. The method is suitable for intercepting forbidden words and sensitive words, particularly the forbidden words and sensitive words of the audio translation text data, wrong sentences and wrong intercepted words can be found quickly, and subsequently, the forbidden word library can be perfected and optimized according to the obtained wrong intercepted words, so that the interception accuracy rate of the corresponding intercepted words and the overall interception accuracy rate are improved.

Description

Method and system for screening error blocking words based on n-gram model

Technical Field

The invention relates to the technical field of network security, in particular to a method and a system for screening error blocking words based on an n-gram model.

Background

Content on the internet is increasing and often contains illegal and illegal information, so that the content needs to be audited and filtered to ensure a secure internet environment and business requirements.

Currently, the auditing mode is usually a mode of setting a forbidden word bank and a user-defined black/white word bank to intercept forbidden words and sensitive words. However, the existing interception method only intercepts words, and is difficult to mine the semantics of context, so that the interception accuracy is low, and particularly for data interception of a voice-to-text, the interception accuracy is further reduced due to the existence of homophones, words and dialects with similar pronunciations, and the like.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides the wrong interception word screening system based on the n-gram model, and the interception accuracy rate of the corresponding interception words and the overall interception accuracy rate can be improved by screening the wrong interception words.

The technical scheme for solving the technical problems is as follows:

an error blocking word screening method based on an n-gram model comprises the following steps:

acquiring audio translation text data intercepted based on interception words under a specific label;

processing the text data through an n-gram model, and screening out data which are not stored in the specific label from the text data as backspacing information;

and determining a sentence containing an error interception word according to the backspacing information.

Another technical solution of the present invention for solving the above technical problems is as follows:

an n-gram model-based error blocking word screening system, comprising:

the acquisition unit is used for acquiring audio translation text data intercepted based on intercepting words under a specific label;

the processing unit is used for processing the text data through an n-gram model and screening out data which are not stored in the specific label from the text data as backspacing information;

and the screening unit is used for determining sentences containing error blocking words according to the backspacing information.

The invention has the beneficial effects that: the method and the system for screening the error interception words are suitable for intercepting the forbidden words and the sensitive words, particularly the forbidden words and the sensitive words of the audio translation text data, the backspacing information is determined by using the n-gram model, the sentences containing the error interception words are determined according to the backspacing information, the wrong sentences and the wrong interception words can be quickly found, and the forbidden word library can be perfected and optimized according to the obtained error interception words, so that the interception accuracy rate of the corresponding interception words and the whole interception accuracy rate are improved.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a schematic flow chart diagram of a method for screening an error-blocking word according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a ppl scoring result provided by an embodiment of the method for screening an erroneous blocking word according to the present invention;

fig. 3 is a schematic structural framework diagram provided by an embodiment of the error-blocking word screening system of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

As shown in fig. 1, a schematic flow chart provided by an embodiment of the method for screening the error-blocking word according to the present invention is implemented based on an n-gram model, and includes:

s1, acquiring audio translation text data intercepted based on the interception word under the specific label;

it should be noted that the specific tag type may be set according to actual service requirements, for example, the tags may be simply divided into 3 categories, which are an a-field sensitive tag, a B-field sensitive tag, and a normal tag, and the interception word of the tag of each category may be set according to actual requirements, for example, the interception word of the a-field sensitive tag may be: a1, A2 and A3, wherein A1, A2 and A3 are words to be intercepted in the A field respectively.

For the audio translation text data, interception errors may occur, for example, the reading and gambling are harmonious, and if the reading occurs in the audio and the reading is translated into gambling, the reading is assumed to be a blocking word under a certain label, and if the reading occurs in error in conversion, the translated text data is intercepted, thereby affecting the accuracy of interception.

Specifically, those skilled in the art may implement audio translation into text data through an acoustic model, and the specific acoustic model may be selected according to implementation requirements and is not described herein again.

S2, processing the text data through an n-gram model, and screening out data which are not stored in a specific label from the text data as backspacing information;

it should be noted that the n-gram model is a probability model for predicting that the current word is only related to the first n-1 words. The basic idea is to perform a sliding window operation of size n on the content in the text according to bytes, and form a byte fragment sequence with length n.

Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension.

The model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times n words occur simultaneously directly from the corpus.

For example, assuming that, for a label in the law and regulation aspect, a text for designing a legal label can be intercepted based on its specified intercepting word, then we can get "i know that today i am coming today once by criminals" by interpreting the intercepted text example based on ASR at step S1, and then process it to get "i know that today i am coming once by criminals" and then score this processed sentence by ppl using the 4-gram language model, and the result is shown in fig. 2.

In fig. 2, each row represents the probability of calculating the word, e.g., taking p (i | …) as an example, which calculates the probability of getting the word "i" to be 0.0452354, since this is a 4gram, only related to the first 3 words.

The first column [ xgram ] after the equal sign indicates the probability that xgram is used when the word is calculated, and if 1gram is shown here, it is proved that there is no corresponding sentence or phrase in the corpus of the model. The occurrence of the word is a pure probability spell, and when the training data of the n-gram language model contains more intercepted words with specific labels, the accuracy of the intercepted words with the specific labels is reduced when the training data falls back to 1 gram; aiming at the phenomenon, the invention provides the data screening scheme to optimize the accuracy rate of the label.

And S3, determining the sentence containing the error interception word according to the rollback information.

For example, the text interior trim part can be screened by the backspace information to obtain a sentence containing the error interception word.

The method and the system for screening the error blocking words are suitable for blocking the forbidden words and the sensitive words, especially the forbidden words and the sensitive words of the audio translation text data, the backspacing information is determined by using the n-gram model, the sentences containing the error blocking words are determined according to the backspacing information, the wrong sentences and the wrong blocking words can be quickly found, and the forbidden word library can be perfected and optimized according to the obtained error blocking words, so that the blocking accuracy rate of the corresponding blocking words and the blocking accuracy rate of the whole block are improved.

Optionally, in some possible embodiments, the processing is performed on the text data through an n-gram model, and data that is not stored in a specific tag is screened out from the text data as rollback information, which specifically includes:

preprocessing the text data;

performing ppl scoring on the preprocessed text data through an n-gram model;

according to the result of the ppl scoring, taking data corresponding to the 1-gram as backspacing information;

the preprocessing mode is the same as the processing mode of training data when the n-gram model is trained.

It should be understood that if the reality is 1gram, as shown in fig. 2, it indicates that the word of the criminal has no corresponding sentence or phrase in the field of the legal label, the occurrence of the word of the criminal is a pure probability problem, and therefore, the sentence with the interception word returned to 1gram can be screened out by using the corresponding interception word of each sentence, thereby optimizing the interception accuracy under the legal label.

It should be noted that, in order to enable the n-gram model to accurately recognize text data, the input data is usually preprocessed before being input into the n-gram model, for example, taking "i know that today, i come and come once by criminals" as an example, the vocabulary of the sentence needs to be split to obtain "i know that today, i come and come once by criminals", and therefore, when processing data, the input text data needs to be processed in the same preprocessing mode as that in training.

By preprocessing the text data, the processing efficiency and accuracy of the n-gram model can be improved.

Optionally, in some possible embodiments, determining a sentence containing an error blocking word according to the fallback information specifically includes:

and screening out sentences of which the interception words are returned to 1gram by using the interception words corresponding to each sentence in the text data.

Optionally, in some possible embodiments, the method further includes:

and marking the screened sentences containing the error interception words, and adding acoustic training.

Through labeling the screened sentences containing the error blocking words and performing acoustic training, the subsequent models can be translated more accurately when encountering the sentences.

Optionally, in some possible embodiments, labeling the selected sentences containing the error blocking words, and adding acoustic training, specifically including:

modifying the screened sentences containing the error blocking words to ensure that the sentences containing the error blocking words have the same content as the translated audio;

training an acoustic model through the labeled sentences containing the error blocking words.

It is to be understood that some or all of the various embodiments described above may be included in some embodiments.

As shown in fig. 3, a schematic structural framework diagram provided for an embodiment of the system for screening error-intercepting words of the present invention is implemented based on an n-gram model, and includes:

an obtaining unit 10, configured to obtain audio translation text data intercepted based on a blocking word under a specific label;

the processing unit 20 is used for processing the text data through an n-gram model and screening data which are not stored in the specific label from the text data as backspacing information;

and the screening unit 30 is used for determining the sentences containing the error blocking words according to the backspacing information.

Optionally, in some possible embodiments, the processing unit 20 is specifically configured to perform preprocessing on the text data;

performing ppl scoring on the preprocessed text data through an n-gram model;

Optionally, in some possible embodiments, the filtering unit 30 is specifically configured to filter out sentences with the interception word returned to 1gram using the interception word corresponding to each sentence in the text data.

Optionally, in some possible embodiments, the method further includes:

and the training unit is used for labeling the screened sentences containing the error interception words and adding acoustic training.

Optionally, in some possible embodiments, the training unit is specifically configured to modify the filtered sentences containing the error blocking words so that the sentences containing the error blocking words are the same as the translated audio content;

It should be noted that the above embodiments are product embodiments corresponding to previous method embodiments, and for the description of the product embodiments, reference may be made to corresponding descriptions in the above method embodiments, and details are not repeated here.

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described method embodiments are merely illustrative, and for example, the division of steps into only one logical functional division may be implemented in practice in another way, for example, multiple steps may be combined or integrated into another step, or some features may be omitted, or not implemented.

The above method, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for screening error interception words based on an n-gram model is characterized by comprising the following steps:

2. The method for screening the error intercepting words based on the n-gram model according to claim 1, wherein the text data is processed through the n-gram model, and data which is not stored in the specific tag is screened out from the text data as rollback information, and specifically the method comprises the following steps:

preprocessing the text data;

performing ppl scoring on the preprocessed text data through an n-gram model;

and the preprocessing mode is the same as the processing mode of training data when the n-gram model is trained.

3. The method for screening the error blocking words based on the n-gram model according to claim 2, wherein determining the sentence containing the error blocking word according to the backspacing information specifically comprises:

4. The method for screening the n-gram model-based error intercepting words according to any one of claims 1 to 3, further comprising:

5. The method for screening the error intercepting words based on the n-gram model according to claim 4, wherein the method for labeling the screened sentences containing the error intercepting words and adding acoustic training specifically comprises the following steps:

and training an acoustic model through the labeled sentences containing the error blocking words.

6. An error interception word screening system based on an n-gram model is characterized by comprising:

7. The n-gram model-based false intercepted word screening system according to claim 6, wherein the processing unit is specifically configured to pre-process the text data;

performing ppl scoring on the preprocessed text data through an n-gram model;

8. The n-gram model-based error intercepting word screening system of claim 7, wherein the screening unit is specifically configured to screen out sentences with intercepting words returned to 1gram using intercepting words corresponding to each sentence in the text data.

9. The n-gram model-based error-intercepting word screening system according to any one of claims 6 to 8, further comprising:

10. The n-gram model-based error intercepting word screening system of claim 9, wherein the training unit is specifically configured to modify the screened sentences containing error intercepting words so that the sentences containing error intercepting words are identical to the translated audio content;