CN114969349A

CN114969349A - Text processing method and device, electronic equipment and medium

Info

Publication number: CN114969349A
Application number: CN202210910508.XA
Authority: CN
Inventors: 叶欣; 赵强; 琚诚诚
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-08-30
Anticipated expiration: 2042-07-29
Also published as: CN114969349B

Abstract

The disclosure relates to a text processing method, a text processing device, an electronic device and a medium. The text processing method comprises the following steps: acquiring a text set, wherein the text set comprises a plurality of texts, the plurality of texts comprise a first text and a second text, the first text is a text comprising the determined sensitive words, and the second text is a text to be processed; dividing texts of which the text similarity meets a first preset requirement in the plurality of texts into the same category to obtain text clusters of a plurality of categories; when the number and/or the proportion of the first texts in the text clusters of any category meet second preset requirements, determining any category as a first category; determining a new sensitive word based on the second text in the first category of text cluster. The text processing method can accurately and efficiently mine new sensitive words from the text.

Description

Text processing method and device, electronic equipment and medium

Technical Field

The present disclosure relates generally to the field of computer technology, and more particularly, to a text processing method, apparatus, electronic device, and medium.

Background

According to the current technical development, the new sensitive words are mined in the text mainly by means of the linguistic data, the word segmentation technology and the new word discovery algorithm model to realize the classification of the words, and the new sensitive words in the linguistic data are discovered on the basis. However, the existing technology does not well utilize the determined sensitive words in the history sensitive word stock, so that no matter the situation that the existing quantity of the history sensitive word stock is small and no label exists, or the situation that the existing quantity of the history sensitive word stock is sufficient and the label exists, the ideal mining effect can not be achieved.

Disclosure of Invention

The present disclosure provides a text processing method, apparatus, electronic device, and medium to solve at least the problems in the related art described above, and may not solve any of the problems described above.

According to a first aspect of the embodiments of the present disclosure, there is provided a text processing method, including: acquiring a text set, wherein the text set comprises a plurality of texts, the plurality of texts comprise a first text and a second text, the first text is a text comprising the determined sensitive words, and the second text is a text to be processed; dividing texts of which the text similarity meets a first preset requirement in the plurality of texts into the same category to obtain text clusters of a plurality of categories; when the number and/or the proportion of the first texts in the text clusters of any category meet second preset requirements, determining any category as a first category; determining a new sensitive word based on the second text in the first category of text cluster.

Optionally, the dividing, to the same category, texts of which the text similarity satisfies a first preset requirement in the plurality of texts includes: clustering first texts in the plurality of texts, so that the first texts with text similarity meeting the first preset requirement are divided into the same category; and clustering second texts in the texts according to the clustering result of the first texts in the texts, so that the second texts with the text similarity to the first texts meeting the first preset requirement are divided into corresponding categories corresponding to the first texts.

Optionally, the dividing, to the same category, texts of which the text similarity satisfies a first preset requirement in the plurality of texts includes: when at least one category of text cluster exists, aiming at any one text in the text set, respectively calculating the average text similarity of the any one text relative to the current text cluster of each category; when the maximum value in the average text similarity of the any text relative to the text cluster of each current category is larger than or equal to a first threshold value, dividing the any text into the category corresponding to the maximum value; and when the maximum value is smaller than the first threshold value, newly creating a category and dividing the any text into the newly created category.

Optionally, the calculating, for any one of the texts in the text set, an average text similarity of the any one of the texts with respect to a text cluster of each current category includes: respectively calculating the text similarity between any text and all seed texts in the text cluster of any current category aiming at the text cluster of any current category, wherein the seed texts comprise the first text and/or first texts divided into various categories; and taking the average value of the text similarity of the any text and all the seed texts in the current text cluster of any category as the average text similarity of the any text relative to the current text cluster of any category.

Optionally, the calculating, for the current text cluster of any one category, the text similarities between the any one text and all the seed texts in the current text cluster of any one category respectively includes: and under the condition that the number of the first texts is smaller than or equal to a second threshold, calculating the font similarity of any text and any seed text aiming at any seed text in the current text cluster of any category, and taking the font similarity as the text similarity of any text and any seed text.

Optionally, after determining a new sensitive word based on the second text in the first category of text cluster, further comprising: updating a second text including the determined new sensitive word to a first text and updating a number of the first text in the text set based on a determination result of the new sensitive word.

Optionally, when the updated number of the first texts is greater than a second threshold, performing vectorization processing on each text in the text set to obtain a word vector sequence corresponding to each text, where the dividing of the texts, of which the text similarity satisfies a first preset requirement, in the plurality of texts into the same category further includes: calculating the probability that each text belongs to a preset category based on the word vector sequence corresponding to each text; and dividing the plurality of texts into text clusters of the preset category based on the probability that each text belongs to the preset category.

Optionally, the calculating, for the current text cluster of any one category, the text similarity between the any one text and all the seed texts in the current text cluster of any one category respectively further includes: and calculating the semantic similarity between any one text and any one seed text based on the word vector sequence corresponding to the any one text and the any one seed text respectively aiming at any one seed text in the current any one category of text clusters, and taking the semantic similarity as the text similarity between the any one text and the any one seed text.

Optionally, the first text comprises a label of the determined sensitive word, wherein the first threshold is determined based on the first text of the same label being classified into the same category.

Optionally, when the number and/or the proportion of the first texts in the text cluster of any category meets a second preset requirement, determining any category as a first category includes: determining the quantity and proportion of the first text in the text cluster of any category; determining the any one category as the first category when the number is greater than a third threshold and the ratio is greater than a fourth threshold.

According to a second aspect of the embodiments of the present disclosure, there is provided a text processing apparatus including: the text acquisition unit is configured to acquire a text set, wherein the text set comprises a plurality of texts, the plurality of texts comprise a first text and a second text, the first text is a text comprising the determined sensitive words, and the second text is a text to be processed; the classification unit is configured to classify texts of which the text similarity meets a first preset requirement in the plurality of texts into the same classification so as to obtain text clusters of a plurality of classifications; the first determining unit is configured to determine any category as a first category when the number and/or the proportion of the first texts in the text clusters of the category meet second preset requirements; a second determination unit configured to determine a new sensitive word based on the second text in the first category of text cluster.

Optionally, the category classification unit is configured to: clustering first texts in the plurality of texts, so that the first texts with text similarity meeting the first preset requirement are divided into the same category; and clustering second texts in the texts according to the clustering result of the first texts in the texts, so that the second texts with the text similarity to the first texts meeting the first preset requirement are divided into corresponding categories corresponding to the first texts.

Optionally, the category classification unit is configured to: when at least one category of text cluster exists, aiming at any one text in the text set, respectively calculating the average text similarity of the any one text relative to the current text cluster of each category; when the maximum value in the average text similarity of the any text relative to the text cluster of each current category is larger than or equal to a first threshold value, dividing the any text into the category corresponding to the maximum value; and when the maximum value is smaller than the first threshold value, newly creating a category and dividing the any text into the newly created category.

Optionally, the category dividing unit is further configured to: respectively calculating the text similarity between any text and all seed texts in the text cluster of any current category aiming at the text cluster of any current category, wherein the seed texts comprise the first text and/or first texts divided into various categories; and taking the average value of the text similarity of the any text and all the seed texts in the current text cluster of any category as the average text similarity of the any text relative to the current text cluster of any category.

Optionally, the category classification unit is further configured to: and under the condition that the number of the first texts is smaller than or equal to a second threshold, calculating the font similarity of any text and any seed text aiming at any seed text in the current text cluster of any category, and taking the font similarity as the text similarity of any text and any seed text.

Optionally, the method further comprises: a text update unit configured to: updating a second text including the determined new sensitive word to a first text and updating a number of the first text in the text set based on a determination result of the new sensitive word.

Optionally, the method further comprises: a vectorization unit configured to, in a case that the updated number of the first texts is greater than a second threshold, perform vectorization processing on each text in the text set to obtain a word vector sequence corresponding to each text, wherein the category dividing unit is further configured to: calculating the probability that each text belongs to a preset category based on the word vector sequence corresponding to each text; and dividing the plurality of texts into text clusters of the preset category based on the probability of each text belonging to the preset category.

Optionally, the category classification unit is further configured to: and aiming at any seed text in the current any category of text clusters, calculating the semantic similarity between the any text and the any seed text based on the word vector sequence corresponding to the any text and the any seed text, and taking the semantic similarity as the text similarity between the any text and the any seed text.

Optionally, the first determining unit is configured to: determining the quantity and proportion of the first text in the text cluster of any category; determining the any one category as the first category when the number is greater than a third threshold and the ratio is greater than a fourth threshold.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement a text processing method according to the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a text processing method as described above according to the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a text processing method according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the text processing method, the text processing device, the electronic equipment and the medium, the determined sensitive words in the historical sensitive word bank can be fully utilized, new sensitive words can be accurately and efficiently mined from the text, on one hand, the historical sensitive word bank can be rapidly expanded, and on the other hand, the accuracy and the efficiency of sensitive word mining can be further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a flowchart illustrating a text processing method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating unsupervised clustering of multiple texts according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating supervised classification of a plurality of texts according to an exemplary embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating unsupervised clustering based on Single-Pass according to an exemplary embodiment of the present disclosure;

FIG. 5 is an illustration showing a text processing flow at one stage according to an exemplary embodiment of the present disclosure;

FIG. 6 is an illustration showing a text processing flow at another stage according to an exemplary embodiment of the present disclosure;

fig. 7 is a block diagram illustrating a text processing apparatus according to an exemplary embodiment of the present disclosure;

fig. 8 is a block diagram illustrating an electronic device 800 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of step one and step two is performed", which means the following three parallel cases: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

The present disclosure provides a text processing method, device, electronic device, and medium, which can accurately and efficiently mine new sensitive words from a text for different application scenarios, such as search and comment application scenarios.

Hereinafter, a text processing method, apparatus, and electronic device according to exemplary embodiments of the present disclosure will be described in detail with reference to fig. 1 to 8.

Fig. 1 is a flowchart illustrating a text processing method according to an exemplary embodiment of the present disclosure.

Referring to fig. 1, in step S101, a text set may be acquired. Here, the text set may include a plurality of texts, the plurality of texts may include a first text and a second text, the first text may be a text including the determined sensitive word, and the second text may be a text to be processed, i.e., whether the sensitive word is included in the second text is not confirmed. Further, the text may be a short text, such as a word, a phrase, or a sentence, or may also be a long text, such as a paragraph, and therefore, a person skilled in the art may put the texts in the text set in advance according to actual needs, for example, perform a word segmentation process on each text. In addition, the determined sensitive word may be a sensitive word in a pre-sorted historical sensitive word bank. Further, the text set may be preprocessed, specifically, the original text set may be obtained first; then, the weight of each word in the original text set can be calculated; and then, filtering the words with the weight less than a preset threshold value in the original text set to obtain a text set.

As an example, the original text set may be preprocessed by using TF-IDF (Term Frequency-Inverse Document Frequency), so as to obtain a text set, and further dig out target words and phrases in the text set based on the text set. TF-IDF is a statistical method based on weighting techniques to assess the importance of a word to one of a set of texts or one of a corpus of texts. In general, the importance of a word is proportional to the number of times it appears in the text, but at the same time is inversely proportional to the frequency with which it appears in the corpus. The TF-IDF is divided into TF (Term Frequency) and IDF (Inverse Document Frequency), where TF represents the Frequency of a word appearing in a text, and considering that a word may have a higher Frequency in a long text than in a short text regardless of the importance of the word, normalization (generally, the word Frequency divided by the total word number of the text) is usually performed to prevent the word from being biased toward the long text when calculating the word Frequency. Therefore, TF can be represented by the following formula (1):

（1）

here, the first and second liquid crystal display panels are,

can express words

In the text

The number of times of occurrence of (a),

the total number of words of the text may be represented,

can express words

In the text

The frequency of occurrence of (a).

The IDF represents the popularity of a word, and if the text containing a certain word is less, the IDF is larger, which indicates that the word has good category distinguishing capability. The IDF of a word can be obtained by dividing the total number of texts by the number of texts containing the word and then taking the logarithm, i.e. the IDF can be expressed by the following formula (2):

（2）

here, the number of the first and second electrodes,

the number of all the texts can be represented,

can mean to include words

The amount of text of (c).

On the basis, TF and IDF can be multiplied to obtain TF-IDF as the weight of the corresponding word, and it can be seen that for a high-frequency word in a certain text, if the frequency of the high-frequency word appearing in the whole file set is low, the TF-IDF value of the word is large, so that TF-IDF tends to filter out common words and keep important words. Here, TF-IDF can be represented by the following formula (3):

（3）

next, in step S102, the texts with the text similarity satisfying the first preset requirement in the plurality of texts may be classified into the same category to obtain a plurality of categories of text clusters. Here, the texts whose text similarity satisfies the first preset requirement may be considered as similar texts, and then unsupervised clustering may be performed on the plurality of texts or supervised classification may be performed on the plurality of texts according to the situation, so as to obtain text clusters of a plurality of categories. For example, under the condition that the inventory of the history sensitive words is small and no label exists, unsupervised clustering can be carried out on a plurality of texts on the basis of a Single-Pass model so as to rapidly expand the history sensitive words; under the condition that the inventory of the history sensitive words is sufficient and the history sensitive words have labels, unsupervised clustering can be carried out on the plurality of texts based on a Single-Pass model, or supervised classification can be carried out on the plurality of texts based on a TextCNN model, so that the accuracy and the efficiency of sensitive word mining are further improved. The present disclosure is not limited thereto, and a person skilled in the art may select a specific model for clustering or classification according to actual circumstances. It should be understood that the specific manner for determining whether the text similarity meets the first preset requirement may be set by a person skilled in the art according to an actually selected clustering or classification model, which is not limited in the present disclosure, and as an example, in the case of clustering, the text similarity between corresponding texts may be calculated, and then the texts whose text similarity meets the threshold condition may be taken as similar texts and classified into the same category; as another example, in the case of classification, the text similarity between a certain text and a certain category of text may be predicted and embodied in the form of prediction probability, and then the text with the prediction probability meeting the threshold condition may be regarded as a text similar to the category of text and classified into the category.

According to an exemplary embodiment of the present disclosure, in the case of unsupervised clustering, a first text in a plurality of texts may be clustered, so that the first text whose text similarity satisfies a first preset requirement is classified into the same category; then, according to the clustering result of the first text in the plurality of texts, clustering can be performed on the second text in the plurality of texts, so that the second text with the text similarity meeting the first preset requirement with the first text is divided into corresponding categories corresponding to the first text. The first text containing the determined sensitive words is processed firstly, and then the second text to be mined is processed, so that the pre-sorted historical sensitive word bank can be fully utilized, and the clustering result is more in line with the requirement. A process of unsupervised clustering of multiple texts according to an exemplary embodiment of the present disclosure is described below in conjunction with fig. 2.

Fig. 2 is a flowchart illustrating unsupervised clustering of multiple texts according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, in step S201, when there is a text cluster of at least one category, an average text similarity of any text with respect to the current text cluster of each category may be calculated for any text in the text set. Here, for a current text cluster of any one category, the text similarity between any one text and all seed texts in the current text cluster of any one category can be respectively calculated; then, the average of the text similarity of any text and all the seed texts in the text cluster of any current category can be used as the average text similarity of any text relative to the text cluster of any current category. Further, the seed text may include the first text and/or the first text divided into various categories. The first text containing the determined sensitive words is used as a seed text, and then the average text similarity is calculated on the basis, so that the determined sensitive words in the historical sensitive word bank can be further utilized in the clustering process, and the clustering result is more in line with the requirement.

Next, in step S202, when the maximum value of the average text similarity of any text with respect to the text cluster of each current category is greater than or equal to the first threshold, any text may be classified into the category corresponding to the maximum value. Alternatively, in step S203, when the maximum value is smaller than the first threshold value, a category is newly created and any one of the texts is classified into the newly created category. Here, in the case where the first text includes a label of the determined sensitive word, the first threshold may be determined based on the first text of the same label being classified into the same category. In other words, in the case that the inventory of the history sensitive word stock is sufficient and there are tags, the first threshold may be determined according to the consistency of the clustering result and the tags, and if the clustering result conforms to the category division in the tags, the current first threshold is determined to be valid, but the present disclosure is not limited thereto, and those skilled in the art may determine the first threshold according to the actual situation.

According to the exemplary embodiment of the disclosure, in the case that the number of the first texts is less than or equal to the second threshold, the font similarity between any one text and any one seed text may be calculated for any one seed text in any one current category of text clusters, and the font similarity may be used as the text similarity between any one text and any one seed text. Here, the glyph similarity of any one text and any one seed text may be calculated based on a fuzzy string matching tool, which may be, for example, a fuzzy matching module fuzzy wuzzy under python, so that the glyph similarity may be calculated using a configuration of similarity =0.8 × fuzzy. The present disclosure is not limited thereto, and those skilled in the art can calculate the above-mentioned glyph similarity using different tools and configurations according to different situations. In addition, the skilled person can determine the second threshold according to the actual situation to determine the specific situation that needs to calculate the similarity of the glyphs. The classification is divided by calculating the font similarity of the text, so that the method is suitable for multi-language text mining, the compatibility of text processing is improved, and the calculation of the font similarity is relatively easy to realize, so that the history sensitive word bank can be efficiently expanded under the condition of small inventory of the history sensitive words.

Referring back to fig. 1, in step S103, when the number and/or proportion of the first texts in the text cluster of any category satisfies the second preset requirement, any category may be determined as the first category. Here, the number and proportion of the first text in any category of text clusters may be determined; when the number is greater than the third threshold and the ratio is greater than the fourth threshold, either category may be determined to be the first category. However, the present disclosure is not limited thereto, and the above-mentioned second preset requirement for determining the first category may be set by those skilled in the art according to actual situations, for example, the number of the first texts in the text cluster of any category may be determined, and when the number is greater than the third threshold, any category is determined as the first category; for another example, a proportion of the first text in the text cluster of any category may be determined, and when the proportion is greater than a fourth threshold, any category is determined as the first category. Further, the third threshold and the fourth threshold may be set by those skilled in the art according to practical situations, for example, the third threshold may be 5, and the fourth threshold may be 80%, but is not limited thereto.

Next, in step S104, a new sensitive word may be determined based on the second text in the first category of text cluster. Here, as an example, the second text in the first category of text cluster may be set in the target database, and then a person skilled in the art may determine a specific way to further determine the new sensitive word from the target database according to the actual situation, for example, the new sensitive word may be determined in the target database in a selected way through a manual review or a specific algorithm, or the like, or the non-sensitive word may be excluded in a deleted way to finally determine the new sensitive word, but the disclosure is not limited thereto. Sensitive word mining is carried out in the text cluster determined as the first category in the text clusters of the multiple categories, so that the number of texts needing to be processed can be effectively reduced, and the sensitive word mining efficiency is further improved. According to an exemplary embodiment of the present disclosure, after determining a new sensitive word based on the second text in the first category of text cluster, the second text including the determined new sensitive word may be further updated to the first text and the number of the first texts in the text set may be updated based on the determination result of the new sensitive word. Further, under the condition that the number of the updated first texts is larger than the second threshold, the historical sensitive word stock is considered to be sufficient, so that each text in the text set can be vectorized, and a word vector sequence corresponding to each text is obtained. Here, the above-described vectorization process may be performed using, but not limited to, a LaBSE (Language-agnostic BERT Sentence Embedding) model to support semantic analysis of multiple languages. On the basis, texts with text similarity meeting a first preset requirement in the plurality of texts can be classified into the same category in a supervised classification mode.

A process of supervised classification of a plurality of texts according to an exemplary embodiment of the present disclosure is described below with reference to fig. 3.

Fig. 3 is a flowchart illustrating supervised classification of a plurality of texts according to an exemplary embodiment of the present disclosure.

Referring to fig. 3, in step S301, a probability that each text belongs to a preset category may be calculated based on a word vector sequence corresponding to each text. Here, as an example, a word vector sequence may be input to a TextCNN model trained in advance based on manually labeled category labels, thereby outputting a probability that each text belongs to a preset category using the TextCNN model. Specifically, the word vector sequence can be input into a convolution layer of the TextCNN model to extract the characteristics of the text, and a new matrix can be obtained after convolution operation; the new matrix can then be input into a max-pooling layer, thereby reducing the number of parameters while preserving the main features to further speed up the computation while reducing the risk of over-fitting; then, the result of the max-posing layer can be spliced and input into the softmax layer, so as to finally obtain the probability of the preset category. In addition, the TextCNN model can calculate a loss function according to the probability predicted by the model and the manually labeled class label during training, and adjust the model parameters according to the value of the loss function to complete the training.

Next, in step S302, the plurality of texts may be divided into text clusters of the preset category based on the probability that each text belongs to the preset category. Under the condition that the inventory of the historical sensitive word stock is sufficient and the labels are available, the efficiency of sensitive word mining can be further improved and the accuracy of sensitive word mining can be improved by carrying out supervised classification on a plurality of texts.

According to the exemplary embodiment of the present disclosure, in the case that the text set is subjected to the vectorization processing described above, the texts with the text similarity meeting the first preset requirement in the plurality of texts may still be classified into the same category in an unsupervised clustering manner. Specifically, when the text similarity between any text and all the seed texts in the text cluster of any current category is calculated for the text cluster of any current category, the semantic similarity between any text and any seed text can be calculated for any seed text in the text cluster of any current category based on the word vector sequence corresponding to any text and any seed text, and the semantic similarity is used as the text similarity between any text and any seed text. By performing semantic analysis after text vectorization, the method can be suitable for multi-language text mining, and improves the compatibility of text processing. Further, as an example, the above semantic similarity may be calculated by using a cosine similarity, which may be expressed by the following equation (4):

（4）

here, the first and second liquid crystal display panels are,

and

the values in the different vectors used for calculating the cosine similarity may be represented separately,nthe dimensions of the vector may be represented.

For ease of understanding, the process of unsupervised clustering based on Single-Pass according to an exemplary embodiment of the present disclosure is described below in conjunction with fig. 4.

Fig. 4 is a flowchart illustrating unsupervised clustering based on Single-Pass according to an exemplary embodiment of the present disclosure.

Referring to FIG. 4, a Single-Pass model may be utilized to unsupervised cluster a plurality of texts in a corpus of texts, as an example. Specifically, for a text one in the text set, after the text one enters the Single-Pass model, the average text similarity of the text one with respect to the text cluster of each category at present can be calculated based on the current seed script. If the text I is not subjected to vectorization processing, a fuzzy character string matching tool is called to calculate the font similarity; and if the first text is subjected to vectorization processing, calculating cosine similarity as semantic similarity based on the vectorized first text. Next, the maximum value of the respective average text similarities may be taken for the size determination with the first threshold as described above. When the maximum value is greater than or equal to the first threshold value, the first text may be classified into a category corresponding to the maximum value. And then judging whether the first text is the first text, and when the first text is the first text, taking the first text as the seed text of the category corresponding to the maximum value, otherwise, not belonging to the seed text. In addition, when the maximum value is smaller than the first threshold value, a new category is created by the first text, the first text is divided into the new category, and the first text is used as a seed text of the new category. Next, after the seed text is updated on the basis of the judgment, the clustering operation for the text one at this time can be ended, and the next text two is waited to enter the Single-Pass model. By performing the clustering operation as described above on each text in the text set using Single-Pass, the unsupervised clustering task of the present disclosure can be efficiently completed.

According to an example embodiment of the present disclosure, a text set for text processing may include basic corpora, such as top-ranked search term predictions, high-risk account number producer corpora, screen shot search term corpora, history sensitive term corpora, and history audit search term corpora. In the process of text processing of the basic corpus, a stage with small and no tags in the historical sensitive word stock and a stage with sufficient and tags in the historical sensitive word stock are considered, so that the text processing flow can be correspondingly adjusted according to the characteristics of each stage, and sensitive words can be better mined. Text processing flows at different stages according to exemplary embodiments of the present disclosure are described below in conjunction with fig. 5 and 6.

Fig. 5 is an explanatory view showing a text processing flow at one stage according to an exemplary embodiment of the present disclosure, and fig. 6 is an explanatory view showing a text processing flow at another stage according to an exemplary embodiment of the present disclosure.

Referring to fig. 5, for a stage where the historical sensitive word stock is small and there is no tag, the preprocessed basic corpus is input into a clustering model (i.e., Single-Pass model) for unsupervised clustering, a clustering result of the basic corpus is obtained based on text matching analysis of font similarity, a new sensitive word is determined according to the clustering result, and the historical sensitive word bank can be updated according to the determined new sensitive word, so that the historical sensitive word bank is efficiently expanded. After the history sensitive word bank is expanded, the label of the history sensitive word can be labeled, so that a text processing stage with sufficient storage amount of the history sensitive word and the label is started. Referring to fig. 6, aiming at the stage that the historical sensitive words have sufficient storage and labels, the BERT model is utilized to vectorize the preprocessed basic corpus, the vectorized basic corpus is input into a clustering model (namely, a Single-Pass model) to perform unsupervised clustering, a clustering result of the basic corpus is obtained based on semantic analysis of semantic similarity, and then new sensitive words are determined according to the clustering result; or inputting the vectorized basic corpus into a trained classification model (namely a TextCNN model) for supervised classification, performing semantic analysis based on semantic similarity to obtain a classification result of the basic corpus, and determining new sensitive words according to the classification result. It should be understood that the above text processing flows for different stages are only examples, and those skilled in the art can also make corresponding adjustments according to actual situations.

According to the text processing method disclosed by the invention, the determined sensitive words in the historical sensitive word bank can be fully utilized, new sensitive words can be accurately and efficiently mined from the text, on one hand, the historical sensitive word bank can be rapidly expanded under the condition that the inventory of the historical sensitive words is small and no label exists, and on the other hand, the accuracy and the efficiency of sensitive word mining can be further improved under the condition that the inventory of the historical sensitive word bank is sufficient and the label exists; in addition, the font similarity of the text is calculated in a fuzzy character string matching mode or semantic analysis is carried out after the text is vectorized, so that the method is suitable for mining of multi-language texts, and the compatibility of text processing is improved.

Fig. 7 is a block diagram illustrating a text processing apparatus according to an exemplary embodiment of the present disclosure. Referring to fig. 7, the text processing apparatus 700 may include: a text acquisition unit 701, a category classification unit 702, a first determination unit 703, and a second determination unit 704.

The text acquisition unit 701 may acquire a text set. Here, the text set may include a plurality of texts, and the plurality of texts may include a first text and a second text, and as described above, the first text may be a text including the determined sensitive word, and the second text may be a text to be processed.

The category dividing unit 702 may divide the texts with the text similarity satisfying the first preset requirement among the plurality of texts into the same category to obtain a plurality of categories of text clusters.

When the number and/or the ratio of the first texts in the text cluster of any category satisfies the second preset requirement, the first determining unit 703 may determine any category as the first category.

The second determining unit 704 may determine a new sensitive word based on the second text in the text cluster of the first category.

According to an exemplary embodiment of the present disclosure, in the case of unsupervised clustering, the category classification unit 702 may cluster a first text of the plurality of texts, so that the first text whose text similarity satisfies a first preset requirement is classified into the same category; then, according to the clustering result of the first text in the plurality of texts, clustering can be performed on the second text in the plurality of texts, so that the second text with the text similarity meeting the first preset requirement with the first text is divided into corresponding categories corresponding to the first text.

According to an exemplary embodiment of the present disclosure, when there is a text cluster of at least one category, the category dividing unit 702 may calculate, for any one text in the text set, an average text similarity of the any one text with respect to the text cluster of each current category, respectively; when the maximum value of the average text similarity of any text with respect to the text cluster of each current category is greater than or equal to the first threshold, the category classification unit 702 may classify any text into the category corresponding to the maximum value; when the maximum value is smaller than the first threshold value, the category classification unit 702 may newly create a category and classify any one text into the newly created category. Here, in the case where the first text includes a label of the determined sensitive word, the first threshold may be determined based on the first text of the same label being classified into the same category.

According to an exemplary embodiment of the present disclosure, the category dividing unit 702 may further calculate, for a current text cluster of any one category, text similarities between any one text and all seed texts in the current text cluster of any one category; then, the average of the text similarity of any text and all the seed texts in the text cluster of any current category can be used as the average text similarity of any text relative to the text cluster of any current category. Here, the seed text may include the first text and/or the first text classified into various categories.

According to an exemplary embodiment of the present disclosure, in a case that the number of the first texts is less than or equal to the second threshold, the category classification unit 702 may further calculate a font similarity between any one text and any one seed text for any one seed text in a text cluster of any one category at present, and take the font similarity as a text similarity between any one text and any one seed text.

According to an exemplary embodiment of the present disclosure, the text processing apparatus 700 may further include a text updating unit (not shown). The text updating unit may update the second text including the determined new sensitive word to the first text and update the number of the first texts in the text set based on a determination result of the new sensitive word.

According to an exemplary embodiment of the present disclosure, the text processing apparatus 700 may further include a vectorization unit (not shown). The vectorization unit may perform vectorization processing on each text in the text set to obtain a word vector sequence corresponding to each text, when the number of the updated first texts is greater than the second threshold. On this basis, the category classification unit 702 may calculate the probability that each text belongs to a preset category based on the word vector sequence corresponding to each text; then, the plurality of texts may be divided into text clusters of the preset category based on a probability that each text belongs to the preset category.

According to an exemplary embodiment of the present disclosure, when the text similarity between any text and all the seed texts in any current category of text clusters is calculated respectively for any current category of text clusters, the category classification unit 702 may further calculate the semantic similarity between any text and any seed text based on the word vector sequence corresponding to any text and any seed text for any seed text in any current category of text clusters, and use the semantic similarity as the text similarity between any text and any seed text.

According to an exemplary embodiment of the present disclosure, the first determination unit 703 may determine the number and proportion of the first texts in any category of text clusters; when the number is greater than the third threshold and the ratio is greater than the fourth threshold, the first determination unit 703 may determine any one of the categories as the first category.

According to an exemplary embodiment of the present disclosure, an electronic device may be provided. Fig. 8 is a block diagram of an electronic device 800 including at least one memory 801 and at least one processor 802 having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform a text processing method according to an embodiment of the disclosure, according to an embodiment of the disclosure.

By way of example, the electronic device 800 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. Here, the electronic device 800 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 800, the processor 802 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, the processor 802 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 802 may execute instructions or code stored in memory, wherein the memory 801 may also store data. The instructions and data may also be transmitted or received over a network via the network interface device, which may employ any known transmission protocol.

The memory 801 may be integrated with the processor 802, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 801 may include a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 801 and the processor 802 may be operatively coupled or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 802 can read files stored in the memory 801.

In addition, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when executed by at least one processor, instructions in the computer-readable storage medium cause the at least one processor to perform the text processing method of the embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there is provided a computer program product including computer instructions that, when executed by a processor, implement a text processing method of an embodiment of the present disclosure.

According to the text processing method, the device, the electronic equipment and the medium, the determined sensitive words in the historical sensitive word bank can be fully utilized, new sensitive words can be accurately and efficiently mined from the text, on one hand, the historical sensitive word bank can be rapidly expanded under the condition that the inventory of the historical sensitive word bank is small and no label exists, and on the other hand, the accuracy and the efficiency of sensitive word mining can be further improved under the condition that the inventory of the historical sensitive word bank is sufficient and the label exists; in addition, the font similarity of the text is calculated in a fuzzy character string matching mode or semantic analysis is carried out after the text is vectorized, so that the method is suitable for mining of multi-language texts, and the compatibility of text processing is improved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of text processing, comprising:

acquiring a text set, wherein the text set comprises a plurality of texts, the plurality of texts comprise a first text and a second text, the first text is a text comprising the determined sensitive words, and the second text is a text to be processed;

dividing texts of which the text similarity meets a first preset requirement in the plurality of texts into the same category to obtain a plurality of categories of text clusters;

when the number and/or the proportion of the first texts in the text clusters of any category meet second preset requirements, determining any category as a first category;

determining a new sensitive word based on the second text in the first category of text cluster.

2. The method for processing texts according to claim 1, wherein the classifying texts of which the similarity between texts meets a first preset requirement into the same category comprises:

clustering first texts in the plurality of texts, so that the first texts with text similarity meeting the first preset requirement are divided into the same category;

and clustering second texts in the texts according to the clustering result of the first texts in the texts, so that the second texts with the text similarity to the first texts meeting the first preset requirement are divided into corresponding categories corresponding to the first texts.

3. The method for processing texts according to claim 1, wherein the classifying texts of which the similarity between texts meets a first preset requirement into the same category comprises:

when at least one category of text cluster exists, respectively calculating the average text similarity of any text relative to the current text cluster of each category aiming at any text in the text set;

when the maximum value in the average text similarity of the any text relative to the text cluster of each current category is larger than or equal to a first threshold value, dividing the any text into the category corresponding to the maximum value;

and when the maximum value is smaller than the first threshold value, newly creating a category and dividing the any text into the newly created category.

4. The method according to claim 3, wherein said calculating, for any one text in the text set, an average text similarity of the any one text with respect to a text cluster of each current category respectively comprises:

respectively calculating the text similarity between the any text and all seed texts in the text cluster of any current category aiming at the text cluster of any current category, wherein the seed texts comprise the first text and/or a first text divided into each category;

and taking the average value of the text similarity of the any text and all the seed texts in the current text cluster of any category as the average text similarity of the any text relative to the current text cluster of any category.

5. The text processing method according to claim 4, wherein the calculating the text similarity of the any one text and all the seed texts in the text cluster of any one current category for the text cluster of any one current category respectively comprises:

and under the condition that the number of the first texts is smaller than or equal to a second threshold, calculating the font similarity of any text and any seed text aiming at any seed text in the current text cluster of any category, and taking the font similarity as the text similarity of any text and any seed text.

6. The text processing method of claim 4, further comprising, after determining a new sensitive word based on the second text in the first category of text cluster:

updating a second text including the determined new sensitive word to a first text and updating a number of the first text in the text set based on a determination result of the new sensitive word.

7. The text processing method of claim 6, further comprising:

performing vectorization processing on each text in the text set under the condition that the number of the updated first texts is greater than a second threshold value to obtain a word vector sequence corresponding to each text,

wherein, the dividing the texts with the text similarity meeting the first preset requirement in the plurality of texts into the same category further comprises:

calculating the probability that each text belongs to a preset category based on the word vector sequence corresponding to each text;

and dividing the plurality of texts into text clusters of the preset category based on the probability that each text belongs to the preset category.

8. The text processing method according to claim 7, wherein the calculating the text similarity between the any text and all the seed texts in the text cluster of any current category for the text cluster of any current category respectively further comprises:

and aiming at any seed text in the current any category of text clusters, calculating the semantic similarity between the any text and the any seed text based on the word vector sequence corresponding to the any text and the any seed text, and taking the semantic similarity as the text similarity between the any text and the any seed text.

9. The text processing method of claim 3, wherein the first text includes a label of the determined sensitive word, wherein the first threshold is determined based on the first text of the same label being classified into the same category.

10. The text processing method according to claim 1, wherein when the number and/or the proportion of the first text in the text cluster of any category meets a second preset requirement, determining the any category as a first category comprises:

determining the quantity and proportion of the first text in the text cluster of any category;

determining the any one category as the first category when the number is greater than a third threshold and the ratio is greater than a fourth threshold.

11. A text processing apparatus, comprising:

the text acquisition unit is configured to acquire a text set, wherein the text set comprises a plurality of texts, the plurality of texts comprise a first text and a second text, the first text is a text comprising the determined sensitive words, and the second text is a text to be processed;

the classification unit is configured to classify texts of which the text similarity meets a first preset requirement in the plurality of texts into the same classification so as to obtain text clusters of a plurality of classifications;

the first determining unit is configured to determine any one category as a first category when the number and/or the proportion of the first texts in the text clusters of the any category meet second preset requirements;

a second determination unit configured to determine a new sensitive word based on the second text in the first category of text cluster.

12. The text processing apparatus according to claim 11, wherein the category classification unit is configured to:

13. The text processing apparatus according to claim 11, wherein the category classification unit is configured to:

14. The text processing apparatus according to claim 13, wherein the category classification unit is further configured to:

respectively calculating the text similarity between any text and all seed texts in the text cluster of any current category aiming at the text cluster of any current category, wherein the seed texts comprise the first text and/or first texts divided into various categories;

15. The text processing apparatus according to claim 14, wherein the category classification unit is further configured to:

16. The text processing apparatus of claim 14, further comprising:

a text update unit configured to: updating a second text including the determined new sensitive word to a first text and updating a number of the first text in the text set based on a determination result of the new sensitive word.

17. The text processing apparatus of claim 16, further comprising:

a vectorization unit configured to perform vectorization processing on each text in the text set to obtain a word vector sequence corresponding to each text if the updated number of the first texts is greater than a second threshold,

wherein the category classification unit is further configured to:

18. The text processing apparatus according to claim 17, wherein the category classification unit is further configured to:

19. The text processing apparatus of claim 13, wherein the first text comprises a label of the determined sensitive word, wherein the first threshold is determined based on the first text of the same label being classified into the same category.

20. The text processing apparatus according to claim 11, wherein the first determination unit is configured to:

21. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the text processing method of any of claims 1 to 10.

22. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the text processing method of any of claims 1 to 10.