CN115983253A - Illegal word expansion method, device, equipment and storage medium - Google Patents
Illegal word expansion method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN115983253A CN115983253A CN202111195304.4A CN202111195304A CN115983253A CN 115983253 A CN115983253 A CN 115983253A CN 202111195304 A CN202111195304 A CN 202111195304A CN 115983253 A CN115983253 A CN 115983253A
- Authority
- CN
- China
- Prior art keywords
- word
- expansion
- words
- preset
- industry
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 238000013145 classification model Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims description 20
- 230000008030 elimination Effects 0.000 claims description 12
- 238000003379 elimination reaction Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 235000002595 Solanum tuberosum Nutrition 0.000 description 2
- 244000061456 Solanum tuberosum Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of computers, and discloses a method, a device, equipment and a storage medium for expanding illegal words. The method comprises the following steps: generating a plurality of expansion words respectively corresponding to preset word roots through a preset illegal word expansion model; determining similarity scores corresponding to the expansion words and preset roots; determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model; determining corresponding weight according to the expansion type corresponding to the preset root word; determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight; and taking each expansion word as a violation word of the corresponding industry according to the final industry score. Through the method, the preset root words are automatically expanded, industries corresponding to the expanded words are distinguished, data support is provided for judging violation of the advertisement, filling efficiency of the violation word bank is improved, the phenomenon that the violation words are searched from a large amount of text data repeatedly by manpower is avoided, and labor cost is reduced.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method, a device, equipment and a storage medium for expanding illegal words.
Background
The existing illegal words are generally matched with the advertisements in the existing illegal rule violation judgment, and the existing illegal words are manually and repeatedly searched for the illegal words with high illegal property from a large amount of text data, so that the method has high subjectivity, low word stock filling efficiency and high labor consumption.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a method, a device, equipment and a storage medium for expanding illegal words, and aims to solve the technical problems that how to automatically expand advertisement illegal words is realized, and the illegal words are prevented from being searched from a large amount of text data manually and repeatedly, so that the labor cost is reduced.
In order to achieve the purpose, the invention provides a method for expanding illegal words, which comprises the following steps:
generating a plurality of expansion words respectively corresponding to preset word roots through a preset illegal word expansion model;
determining similarity scores corresponding to the expansion words and the preset root words;
determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;
determining corresponding weight according to the expansion type corresponding to the preset root word;
determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight;
and taking each expansion word as the violation word of the corresponding industry according to the industry final score.
Optionally, the generating, by the preset violating word expansion model, a plurality of expansion words corresponding to preset roots of words includes:
acquiring a plurality of related words corresponding to a preset root of a word through a preset illegal word expansion model;
and carrying out duplication elimination processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.
Optionally, the removing duplication of the plurality of related words according to a preset database to obtain a plurality of expansion words includes:
determining a plurality of current words from a preset database;
respectively determining editing distances between the current words and the related words;
and carrying out duplicate removal processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.
Optionally, the performing deduplication processing on the plurality of related words according to the edit distance to obtain a plurality of expansion words includes:
deleting the target related words when the target editing distance corresponding to the target related words is smaller than a preset distance threshold;
and taking the rest related words as a plurality of expansion words.
Optionally, after each expansion word is used as an illegal word of the corresponding industry according to the industry final score, the method further includes:
and storing each expansion word and the corresponding industry into the preset database.
Optionally, after each expansion word is used as an illegal word of the corresponding industry according to the industry final score, the method further includes:
acquiring a deleting instruction input by a user, and deleting the corresponding target expansion word according to the deleting instruction;
and storing the remaining expansion words and the corresponding industries in the preset database.
Optionally, the determining the similarity score between each expansion word and the preset root includes:
determining probability values corresponding to the preset root of each expansion word output by the preset illegal word expansion model;
respectively determining the current editing distance between each expansion word and the preset root;
and determining the similarity score of each expansion word corresponding to the preset root word according to the probability value and the current editing distance.
Optionally, the preset illegal Word expansion model comprises a preset Word2Vec model and a preset Bert model.
Optionally, the determining, according to the probability value and the current editing distance, a similarity score corresponding to each expansion word and the preset root word includes:
determining corresponding control parameters according to a preset illegal word expansion model for generating each expansion word;
and determining similarity scores corresponding to the expansion words and the preset root words according to the control parameters, the probability values and the current editing distance.
Optionally, before generating a plurality of expansion words respectively corresponding to the preset root of word through the preset illegal word expansion model, the method further includes:
training the initial Word2Vec model according to the text data to obtain a trained preset Word2Vec model;
and training the initial Bert model according to the text data, and finely adjusting the trained Bert model based on text classification to obtain a preset Bert model.
In addition, in order to achieve the above object, the present invention further provides a device for expanding illegal words, where the device for expanding illegal words includes:
the generating module is used for generating a plurality of expansion words corresponding to the preset root respectively through the preset illegal word expansion model;
the determining module is used for determining similarity scores corresponding to the expansion words and the preset root;
the industry classification module is used for determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;
the determining module is further configured to determine a corresponding weight according to the expansion type corresponding to the preset root word;
the industry classification module is further used for determining industry final scores corresponding to the expansion words according to the similarity scores, the industry initial scores and the weights;
and the expansion module is used for taking each expansion word as the violation word of the corresponding industry according to the industry final score.
Optionally, the generating module is further configured to obtain a plurality of related words corresponding to the preset root of word through a preset violating word expansion model; and carrying out duplication elimination processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.
Optionally, the generating module is further configured to determine a plurality of current words from a preset database; respectively determining editing distances between the current words and the related words; and carrying out duplication elimination processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.
Optionally, the generating module is further configured to delete the target related word when a target edit distance corresponding to the target related word is smaller than a preset distance threshold; and taking the rest related words as a plurality of expansion words.
Optionally, the violating word expanding device further includes a storage module;
the storage module is used for storing each expansion word and the corresponding industry into the preset database.
Optionally, the illegal word expansion device further comprises a storage module;
the storage module is used for acquiring a deleting instruction input by a user and deleting the corresponding target expansion word according to the deleting instruction; and storing the remaining expansion words and the corresponding industries in the preset database.
Optionally, the determining module is further configured to determine a probability value corresponding to each expansion word output by the preset illegal word expansion model and the preset root word; respectively determining the current editing distance between each expansion word and the preset root; and determining similarity scores corresponding to the expansion words and the preset root words according to the probability values and the current editing distance.
Optionally, the preset illegal Word expansion model includes a preset Word2Vec model and a preset Bert model.
In addition, in order to achieve the above object, the present invention further provides an illegal word expansion device, where the illegal word expansion device includes: the system comprises a memory, a processor and an illegal word expanding program which is stored on the memory and can run on the processor, wherein the illegal word expanding program is configured to realize the illegal word expanding method.
In addition, in order to achieve the above object, the present invention further provides a storage medium, where an illegal word expansion program is stored on the storage medium, and when the illegal word expansion program is executed by a processor, the illegal word expansion method is implemented as described above.
Generating a plurality of expansion words respectively corresponding to preset word roots through a preset illegal word expansion model; determining similarity scores corresponding to the expansion words and preset roots; determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model; determining corresponding weight according to the expansion type corresponding to the preset root word; determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight; and taking each expansion word as a violation word of the corresponding industry according to the final industry score. Through the method, the preset root words are automatically expanded, industries corresponding to the expanded words are distinguished according to the similarity of the expanded words and the expansion types, data support is provided for judging the violation of the advertisement, the filling efficiency of the violation word bank is improved, the phenomenon that the violation words are searched from a large amount of text data repeatedly by manpower is avoided, and the labor cost is reduced.
Drawings
Fig. 1 is a schematic structural diagram of an illegal word expansion device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first exemplary embodiment of a method for extending an offending word in accordance with the present invention;
FIG. 3 is a schematic flow chart illustrating a process of determining an industry score according to an embodiment of the illegal word expansion method of the present invention;
FIG. 4 is a flowchart illustrating a second exemplary embodiment of a method for extending an offending word in accordance with the present invention;
FIG. 5 is a flowchart illustrating a third exemplary embodiment of a method for extending an offending word in accordance with the present invention;
FIG. 6 is a schematic diagram of an expansion flow of an embodiment of a method for expanding a violating word according to the present invention;
fig. 7 is a block diagram illustrating a structure of a violating term expansion device according to a first embodiment of the present invention.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an illegal word expansion device of a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the infrastructural word development apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard); optionally, the user interface 1003 may also include a standard wired interface, a wireless interface. Optionally, the network interface 1004 includes a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as a disk Memory. Alternatively, the memory 1005 may be a storage device independent of the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the offender expansion apparatus, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and an illegal word extension program.
In the offending word expansion apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the illegal word expansion device according to the present invention may be disposed in the illegal word expansion device, and the illegal word expansion device calls the illegal word expansion program stored in the memory 1005 through the processor 1001 and executes the illegal word expansion method according to the present invention.
An embodiment of the present invention provides a method for expanding a violating word, and referring to fig. 2, fig. 2 is a flowchart of a first embodiment of the method for expanding violating words of the present invention.
In this embodiment, the method for expanding the violating terms includes the following steps:
step S10: and generating a plurality of expansion words respectively corresponding to the preset root of a word through a preset illegal word expansion model.
It can be understood that the execution subject of the embodiment is an illegal word expansion device, and the illegal word expansion device may be a device such as a computer and a server, and may also be another device with the same or similar function, which is not limited in this embodiment.
It should be noted that the preset violating Word expansion model may be a Word2Vec model or a Bert model, and a plurality of expansion words related to the preset root Word are recalled through the Word2Vec model and/or the Bert model, in a specific implementation, in order to further improve the expansion efficiency of the violating Word library, a plurality of preset root words may be input into the preset violating Word expansion model, so as to generate a large number of expansion words, and the preset root Word may be obtained from a preset database, where the preset database stores a large number of existing violating words.
Step S20: and determining the similarity score corresponding to each expansion word and the preset root.
It can be understood that the similarity score may be determined based on an editing distance between each expansion word and the preset root, and the closer the editing distance between each expansion word and the preset root, the smaller the similarity score. In specific implementation, the output probability of the preset illegal word expansion model is further considered, and the similarity score is determined according to the output probability and the editing distance, wherein the larger the output probability is, the larger the editing distance is, and the higher the similarity score is.
Step S30: and determining the industry initial scores of the expansion words in multiple industries based on a preset industry classification model.
It should be noted that the preset industry classification model may be a lightweight fastText model, and the industry classification is performed on each expansion word to determine the industry initial score of each expansion word in each preset industry.
Step S40: and determining corresponding weight according to the expansion type corresponding to the preset root word.
It can be understood that, in this embodiment, the violation words are divided into three categories, which respectively include a category a word, a category B word, and a category C word, where the category a word is derived from a word with high violation and high triggering properties; b-type words are words derived from high-violation and low-trigger properties; the class C word is a new word collected by the customer demand and the business department. When different types of illegal words are expanded, a differentiation expansion mode is adopted, wherein the A-type words are high illegal and high triggering-type words, and the A-type expansion mode is to generalize the industry of the A-type words into various industries except the industry of the industry, for example: buying a house, handling a user opening for free help, wherein the item industry belongs to real estate, the word of the illegal word of the industry belongs to the user opening for free help, if the word is generalized, the word of the illegal word of the industry is obtained, the industry of the illegal word is generalized from real estate to education and training, and the illegal word of the education and training industry is obtained. The type B word may be of a higher illegal nature, but the triggering performance in the industry is usually lower, and the B expansion type is to continue to expand the word of the item in the industry, for example, in the above example, the illegal word industry belongs to "real estate", and if the triggering performance is lower, the illegal word in the "real estate" industry is enriched by continuing to expand in the "real estate" industry, so that the triggering performance is improved, for example, an illegal word obtained by expanding that "it is not expensive to handle the household" still belongs to the "real estate" industry. For the type C words, the type C expansion is to expand the violation matching amount from the perspective of enriching the violation word library, for example, if the word "hundred percent" belongs to the advertisement violation words in almost all industries, the expanded words with positive meanings such as "affirmation", "certain implementation", and the like are all the violation words.
It should be noted that, in this embodiment, different industry weights are set according to the extension type of the preset root word, for example, when the extension is performed by the extension type a, the weight corresponding to the industry is much smaller than the weights corresponding to other industries, when the extension is performed by the extension type B, the weight corresponding to the industry is much larger than the weights corresponding to other industries, and when the extension is performed by the extension type C, the weights corresponding to the industries are similar.
Step S50: and determining the industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight.
It should be understood that assume the similarity score is denoted sim score and the industry initial score is denoted F (q, q) i Cate), the weight is represented as γ i And delta i Determining the industry final score corresponding to each expansion word according to a formula (1):
final_score=γ i *sim_score+δ i *F(q,q i ,cate) (1)
step S60: and taking each expansion word as the violation word of the corresponding industry according to the industry final score.
It should be noted that, in this embodiment, the industry final score is restricted according to the expansion type and the similarity score, so as to determine the industry final score corresponding to each expansion word in each industry, sort the industry final scores corresponding to each expansion word, determine the score with the largest value and the corresponding industry, and bind each expansion word and the corresponding industry to obtain the violation word obtained by the expansion in each industry.
For example, referring to fig. 3, fig. 3 is a schematic flow diagram illustrating a process of determining an industry score according to an embodiment of the illegal word expansion method of the present invention, where a preset root is an illegal root q selected from a preset database, a corresponding word category is an a-class word, and belongs to industry 1, and a plurality of expansion words respectively corresponding to the preset root are generated according to a preset illegal word expansion model to obtain a candidate set of expansion words, where the candidate set includes a plurality of expansion words q1, q2, q3 …, similarity scores s1, s2, s3 … corresponding to each expansion word and the preset root q are determined, the method comprises the steps of determining industry initial scores f1, f2 and f3 … of all expansion words in industry 1-industry n based on a preset industry classification model, determining weights, namely coefficients according to expansion types corresponding to the A-type words, and taking the expansion words as violation words of corresponding industries based on industry final scores according to similarity scores s1, s2, s3 …, the coefficients and industry final scores fs1, fs2 and fs3 … corresponding to the expansion words of the industry initial scores f1, f2 and f3 …, wherein the expansion words are q3, q5 and q1 … corresponding to the industry 1.
In the embodiment, a plurality of expansion words respectively corresponding to preset roots are generated through a preset illegal word expansion model; determining similarity scores corresponding to the expansion words and preset roots; determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model; determining corresponding weight according to the expansion type corresponding to the preset root word; determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight; and taking each expansion word as a violation word of the corresponding industry according to the final industry score. Through the method, the preset word root is automatically expanded, industries corresponding to the expanded words are distinguished according to the expanded word similarity and the expanded types, data support is provided for judging the violation of the advertisement, the filling efficiency of the violation word bank is improved, the phenomenon that the violation words are searched from a large amount of text data repeatedly by manpower is avoided, and the labor cost is reduced.
Referring to fig. 4, fig. 4 is a flowchart illustrating a method for expanding a violating word according to a second embodiment of the present invention.
Based on the first embodiment, the step S10 of the method for expanding illegal words in this embodiment includes:
step S101: and acquiring a plurality of related words corresponding to the preset root of word through a preset illegal word expansion model.
Step S102: and carrying out duplication elimination processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.
It can be understood that a large number of existing violation words are stored in the preset database, and related words which are the same as the existing violation words are deleted to obtain a plurality of expansion words.
Further, the step S102 includes: determining a plurality of current words from a preset database; respectively determining editing distances between the current words and the related words; and carrying out duplication elimination processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.
Further, the removing duplication of the plurality of related words according to the edit distance to obtain a plurality of expansion words includes: deleting the target related words when the target editing distance corresponding to the target related words is smaller than a preset distance threshold; and taking the rest related words as a plurality of expansion words.
It should be noted that, in order to further reduce repeated related words and avoid the waste of computing resources caused by performing industry division on the repeated related words, in this embodiment, the related words are deduplicated by determining the edit distance between the related words and the existing illegal words, so as to obtain a plurality of expanded words.
It can be understood that the related words with the editing distance smaller than the preset distance threshold are deleted, where the preset distance threshold may be set autonomously according to business requirements to distinguish whether the related words and the existing violating words are highly similar, and the editing distance is used to perform deduplication operation to remove the related words with highly similar character strings, for example, to expand "what the potato is" into "what the potato is dry", and to remove the expanded related words.
Further, after the step S60, the method further includes: and storing each expansion word and the corresponding industry into the preset database.
The method includes the steps of storing expansion words obtained through expansion and corresponding industries in a preset database, further supplementing the existing illegal word bank, obtaining preset word roots from the preset database when illegal words are expanded, selecting the preset word roots according to business requirements from various industries according to user selection, carrying out expansion type labeling on the selected preset word roots by users, expanding the illegal words through expansion modes corresponding to expansion types, prompting auditors to label the expansion words when the expansion words are stored, determining word categories corresponding to the expansion words, namely determining whether the expansion words belong to A-type words, B-type words or C-type words, storing the expansion words, the corresponding industries and the corresponding word categories in the preset database, obtaining the preset word roots and the corresponding word categories from the preset database when the illegal words are expanded, and expanding the illegal words according to the expansion types corresponding to the word categories. In specific implementation, the expansion words can be classified directly according to the expansion type of the preset root word, namely the expansion words obtained by expanding the A-class words are marked as the A-class words.
Further, after the step S60, the method further includes: acquiring a deleting instruction input by a user, and deleting the corresponding target expansion word according to the deleting instruction; and storing the remaining expansion words and the corresponding industries in the preset database.
It should be noted that, in the embodiment, a manual review process is provided, and after a plurality of expansion words are obtained through expansion, a manually input deletion instruction is received, so that duplication of violation words is further avoided, and data accuracy of a violation word bank is ensured.
In the embodiment, a plurality of related words corresponding to a preset root are obtained through a preset illegal word expansion model; carrying out duplicate removal processing on the plurality of related words according to a preset database to obtain a plurality of expansion words; determining similarity scores corresponding to the expansion words and preset roots; determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model; determining corresponding weight according to the expansion type corresponding to the preset root word; determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight; and taking each expansion word as a violation word of the corresponding industry according to the final industry score. Through the mode, the preset word roots are automatically expanded, duplication removing processing is carried out according to the preset database, the illegal words are prevented from being repeated, the filling efficiency of the illegal word bank is further improved, industries corresponding to the expanded words are distinguished according to the similarity of the expanded words and the expansion types, data support is provided for judging the violation of the advertisement, the phenomenon that the illegal words are searched from a large amount of text data through manual repetition is avoided, and the labor cost is reduced.
Referring to fig. 5, fig. 5 is a flowchart illustrating a method for expanding a violating word according to a third embodiment of the present invention.
Based on the first embodiment, step S20 of the method for expanding illegal words in this embodiment includes:
step S201: and determining the probability value corresponding to each expansion word output by the preset illegal word expansion model and the preset root.
Step S202: and respectively determining the current editing distance between each expansion word and the preset root.
Step S203: and determining similarity scores corresponding to the expansion words and the preset root words according to the probability values and the current editing distance.
It will be appreciated that the probability values for the assumed model outputs are denoted as M (q, q) i ) The current edit distance is denoted as L (q, q) i ) Wherein q is the input preset root word, q i Determining similarity scores of the expansion words corresponding to the preset root words according to a formula (2) for the expansion words recalled by the model:
sim_score=αM(q,q i )+βL(q,q i ) (2)
further, the preset illegal Word expansion model comprises a preset Word2Vec model and a preset Bert model.
Specifically, the step S203 includes: determining corresponding control parameters according to a preset illegal word expansion model for generating each expansion word; and determining similarity scores corresponding to the expansion words and the preset root words according to the control parameters, the probability values and the current editing distance.
It should be noted that there is a difference between the preset Word2Vec model and the preset Bert model, the control parameter α and the control parameter β corresponding to the preset Word2Vec model and the preset Bert model are stored in a preset storage region in advance, the corresponding α and β are determined according to the model for generating each expansion Word, and the probability value M (q, q) and the control parameters α and β are determined according to the probability value M (q, and β) i ) And the current edit distance L (q, q) i ) And (3) determining the similarity score corresponding to each expansion word and the preset root according to the formula (2).
Further, before the step S10, the method further includes: training the initial Word2Vec model according to the text data to obtain a trained preset Word2Vec model; and training the initial Bert model according to the text data, and finely adjusting the trained Bert model based on text classification to obtain a preset Bert model.
It can be understood that before the illegal Word expansion is carried out, a large amount of text data is used for carrying out unsupervised training on the Word2Vec model, and the Bert model is finely adjusted in a text classification mode on the basis of the pre-training model Bert, so that the preset Word2Vec model and the preset Bert model are obtained.
For example, referring to fig. 6, fig. 6 is a schematic view of an expansion flow of an embodiment of a method for expanding a violating word according to the present invention; inputting the rule-breaking Word roots of the classes A, B and C into a rule-breaking Word expansion model obtained through training of a large amount of text data to obtain a plurality of approximate expansion words, wherein the rule-breaking Word expansion model comprises a Word2Vec model and a Bert model, performing duplication removing operation on the plurality of approximate expansion words according to an existing rule-breaking Word library, classifying the expansion words according to a fastText model, determining an industry initial score corresponding to each expansion Word, determining the industry corresponding to each expansion Word based on the corresponding expansion category of the rule-breaking Word, manually screening whether the rule-breaking Word obtained through expansion is illegal or not, marking the corresponding category, and storing the rule-breaking Word into the existing rule-breaking Word library.
In the embodiment, a plurality of expansion words respectively corresponding to preset roots are generated through a preset illegal word expansion model; determining probability values of expansion words output by a preset illegal word expansion model and corresponding to preset word roots; respectively determining the current editing distance between each expansion word and a preset root; determining similarity scores corresponding to the expansion words and preset roots according to the probability values and the current editing distance; determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model; determining corresponding weight according to the expansion type corresponding to the preset root word; determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight; and taking each expansion word as a violation word of the corresponding industry according to the final industry score. Through the method, the preset word root is automatically expanded, the similarity is determined according to the model output probability value and the editing distance, the industries corresponding to the expanded words are distinguished according to the expanded word similarity and the expanded types, the industry scores are limited through the similarity and the expanded types, the industry division accuracy is improved, data support is provided for judging the violation of the advertisement, the filling efficiency of the violation word bank is improved, the phenomenon that the violation words are searched from a large amount of text data manually and repeatedly is avoided, and the labor cost is reduced.
In addition, an embodiment of the present invention further provides a storage medium, where an illegal word expansion program is stored on the storage medium, and when being executed by a processor, the illegal word expansion program implements the illegal word expansion method described above.
Since the storage medium adopts all technical solutions of all the embodiments, at least all the beneficial effects brought by the technical solutions of the embodiments are achieved, and no further description is given here.
Referring to fig. 7, fig. 7 is a block diagram illustrating a structure of a first embodiment of an illegal word expansion device according to the present invention.
As shown in fig. 7, the apparatus for expanding an illegal word according to the embodiment of the present invention includes:
the generating module 10 is configured to generate a plurality of expansion words corresponding to the preset root respectively through the preset illegal word expansion model.
And the determining module 20 is configured to determine similarity scores corresponding to the extended words and the preset root.
And the industry classification module 30 is used for determining industry initial scores of the expansion words in the industries based on a preset industry classification model.
The determining module 20 is further configured to determine a corresponding weight according to the extension type corresponding to the preset root.
The industry classification module 30 is further configured to determine an industry final score corresponding to each expansion term according to the similarity score, the industry initial score, and the weight.
And the expansion module 40 is used for taking each expansion word as the illegal word of the corresponding industry according to the final industry score.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.
In the embodiment, a plurality of expansion words respectively corresponding to preset roots are generated through a preset illegal word expansion model; determining similarity scores corresponding to the expansion words and the preset root words; determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model; determining corresponding weight according to the expansion type corresponding to the preset root word; determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight; and taking each expansion word as a violation word of the corresponding industry according to the final industry score. Through the method, the preset word root is automatically expanded, industries corresponding to the expanded words are distinguished according to the expanded word similarity and the expanded types, data support is provided for judging the violation of the advertisement, the filling efficiency of the violation word bank is improved, the phenomenon that the violation words are searched from a large amount of text data repeatedly by manpower is avoided, and the labor cost is reduced.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may be referred to a method for expanding a violating word provided in any embodiment of the present invention, and are not described herein again.
In an embodiment, the generating module 10 is further configured to obtain a plurality of related words corresponding to a preset root of a word through a preset violating word expansion model; and carrying out duplication elimination processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.
In an embodiment, the generating module 10 is further configured to determine a plurality of current words from a preset database; respectively determining editing distances between the current words and the related words; and carrying out duplication elimination processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.
In an embodiment, the generating module 10 is further configured to delete the target related word when a target editing distance corresponding to the target related word is smaller than a preset distance threshold; and taking the rest related words as a plurality of expansion words.
In an embodiment, the illegal word expanding device further comprises a storage module;
the storage module is used for storing each expansion word and the corresponding industry into the preset database.
In an embodiment, the illegal word expanding device further comprises a storage module;
the storage module is used for acquiring a deleting instruction input by a user and deleting the corresponding target expansion word according to the deleting instruction; and storing the remaining expansion words and the corresponding industries in the preset database.
In an embodiment, the determining module 20 is further configured to determine a probability value corresponding to each expansion word output by the preset illegal word expansion model and the preset root word; respectively determining the current editing distance between each expansion word and the preset root; and determining similarity scores corresponding to the expansion words and the preset root words according to the probability values and the current editing distance.
In an embodiment, the preset illegal Word expansion model includes a preset Word2Vec model and a preset Bert model.
In an embodiment, the determining module 20 is further configured to determine a corresponding control parameter according to a preset illegal term expansion model for generating each expansion term; and determining similarity scores corresponding to the expansion words and the preset root words according to the control parameters, the probability values and the current editing distance.
In an embodiment, the illegal word expanding device further comprises a storage module;
the storage module is used for training the initial Word2Vec model according to the text data to obtain a trained preset Word2Vec model; and training the initial Bert model according to the text data, and finely adjusting the trained Bert model based on text classification to obtain a preset Bert model.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.
The invention discloses A1 and a method for expanding illegal words, wherein the method for expanding the illegal words comprises the following steps:
generating a plurality of expansion words respectively corresponding to preset word roots through a preset illegal word expansion model;
determining similarity scores corresponding to the expansion words and the preset root words;
determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;
determining corresponding weight according to the expansion type corresponding to the preset root word;
determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight;
and taking each expansion word as the violation word of the corresponding industry according to the industry final score.
The method for expanding the illegal word according to the method A1 includes the steps of generating a plurality of expansion words corresponding to preset word roots through a preset illegal word expansion model, and the expansion words include:
acquiring a plurality of related words corresponding to a preset root of a word through a preset illegal word expansion model;
and carrying out duplication elimination processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.
A3, according to the method for expanding the illegal word as described in the A2, the duplication elimination processing is performed on the plurality of related words according to a preset database to obtain a plurality of expanded words, and the method comprises the following steps:
determining a plurality of current words from a preset database;
respectively determining editing distances between the current words and the related words;
and carrying out duplication elimination processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.
A4, the method for expanding illegal words according to A3, wherein the removing of the duplication of the plurality of related words according to the editing distance to obtain a plurality of expanded words comprises the following steps:
deleting the target related words when the target editing distance corresponding to the target related words is smaller than a preset distance threshold;
and taking the rest related words as a plurality of expansion words.
A5, the method for expanding illegal words according to A2, wherein after each expanded word is used as the illegal word of the corresponding industry according to the industry final score, the method further comprises:
and storing each expansion word and the corresponding industry into the preset database.
A6, the method for expanding illegal words according to the A2, wherein after each expanded word is used as the illegal word of the corresponding industry according to the final industry score, the method further comprises the following steps:
acquiring a deleting instruction input by a user, and deleting the corresponding target expansion word according to the deleting instruction;
and storing the remaining expansion words and the corresponding industries into the preset database.
The method for expanding the illegal word according to the method A1 includes the following steps of:
determining probability values corresponding to the preset root of each expansion word output by the preset illegal word expansion model;
respectively determining the current editing distance between each expansion word and the preset root;
and determining similarity scores corresponding to the expansion words and the preset root words according to the probability values and the current editing distance.
A8, the method for expanding the illegal Word according to A7, wherein the preset illegal Word expansion model comprises a preset Word2Vec model and a preset Bert model.
A9, the method for expanding illegal words according to A8, wherein determining similarity scores corresponding to the preset root words for each expanded word according to the probability value and the current editing distance includes:
determining corresponding control parameters according to a preset illegal word expansion model for generating each expansion word;
and determining similarity scores corresponding to the expansion words and the preset root words according to the control parameters, the probability values and the current editing distance.
A10, the method for expanding illegal words according to A8, before generating a plurality of expansion words respectively corresponding to preset roots of words through a preset illegal word expansion model, the method further includes:
training the initial Word2Vec model according to the text data to obtain a trained preset Word2Vec model;
and training the initial Bert model according to the text data, and finely adjusting the trained Bert model based on text classification to obtain a preset Bert model.
The invention also discloses B11 and a violation word expansion device, wherein the violation word expansion device comprises:
the generating module is used for generating a plurality of expansion words corresponding to the preset root respectively through the preset illegal word expansion model;
the determining module is used for determining similarity scores corresponding to the expansion words and the preset root;
the industry classification module is used for determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;
the determining module is further configured to determine a corresponding weight according to the expansion type corresponding to the preset root word;
the industry classification module is further used for determining industry final scores corresponding to the expansion words according to the similarity scores, the industry initial scores and the weights;
and the expansion module is used for taking each expansion word as the violation word of the corresponding industry according to the industry final score.
The device for expanding the illegal words according to B12 and B11, wherein the generating module is further configured to obtain a plurality of related words corresponding to the preset root word through a preset illegal word expansion model; and carrying out duplication elimination processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.
The device for expanding the illegal word according to the method B13 and the device for expanding the illegal word according to the method B12 are characterized in that the generating module is also used for determining a plurality of current words from a preset database; respectively determining editing distances between the current words and the related words; and carrying out duplicate removal processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.
B14, as for the illegal word expanding device described in B13, the generating module is further configured to delete the target related word when a target editing distance corresponding to the target related word is smaller than a preset distance threshold; and taking the rest related words as a plurality of expansion words.
B15, the illegal word expanding device as B12, wherein the illegal word expanding device further comprises a storage module;
the storage module is used for storing each expansion word and the corresponding industry into the preset database.
B16, the illegal word expanding device as B12, wherein the illegal word expanding device further comprises a storage module;
the storage module is used for acquiring a deleting instruction input by a user and deleting the corresponding target expansion word according to the deleting instruction; and storing the remaining expansion words and the corresponding industries in the preset database.
B17, as for the illegal word expansion device described in B11, the determining module is further configured to determine a probability value corresponding to the preset root word for each expanded word output by the preset illegal word expansion model; respectively determining the current editing distance between each expansion word and the preset root; and determining similarity scores corresponding to the expansion words and the preset root words according to the probability values and the current editing distance.
And B18, the illegal Word expansion device as B17, wherein the preset illegal Word expansion model comprises a preset Word2Vec model and a preset Bert model.
The invention also discloses C19 and illegal word expanding equipment, wherein the equipment comprises: the device comprises a memory, a processor and an illegal word expanding program which is stored on the memory and can run on the processor, wherein the illegal word expanding program is configured to realize the illegal word expanding method according to any one of A1-A10.
The invention also discloses a D20 and a storage medium, wherein the storage medium is stored with an illegal word expansion program, and the illegal word expansion program is executed by a processor to realize the illegal word expansion method according to any one of A1 to A10.
Claims (10)
1. A method for expanding illegal words is characterized by comprising the following steps:
generating a plurality of expansion words respectively corresponding to preset word roots through a preset illegal word expansion model;
determining similarity scores corresponding to the expansion words and the preset root words;
determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;
determining corresponding weight according to the expansion type corresponding to the preset root word;
determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight;
and taking each expansion word as a violation word of the corresponding industry according to the final industry score.
2. The method for expanding illegal words according to claim 1, wherein the generating of the plurality of expansion words respectively corresponding to the preset root of word by the preset illegal word expansion model comprises:
acquiring a plurality of related words corresponding to a preset root of a word through a preset illegal word expansion model;
and carrying out duplicate removal processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.
3. The method for expanding illegal words according to claim 2, wherein the step of performing deduplication processing on the plurality of related words according to a preset database to obtain a plurality of expanded words comprises the steps of:
determining a plurality of current words from a preset database;
respectively determining editing distances between the current words and the related words;
and carrying out duplication elimination processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.
4. The method for expanding illegal words according to claim 3, wherein the removing the duplicate of the related words according to the edit distance to obtain expanded words comprises:
deleting the target related words when the target editing distance corresponding to the target related words is smaller than a preset distance threshold;
and taking the rest related words as a plurality of expansion words.
5. The method for expanding illegal words according to claim 2, wherein after each expansion word is used as the illegal word of the corresponding industry according to the industry final score, the method further comprises:
and storing each expansion word and the corresponding industry into the preset database.
6. The method for expanding illegal words according to claim 2, wherein after each expanded word is used as the illegal word of the corresponding industry according to the industry final score, the method further comprises:
acquiring a deleting instruction input by a user, and deleting the corresponding target expansion word according to the deleting instruction;
and storing the remaining expansion words and the corresponding industries in the preset database.
7. The method for extending illegal words according to claim 1, wherein the determining the similarity score of each extended word corresponding to the preset root word comprises:
determining probability values corresponding to the expansion words output by the preset illegal word expansion model and the preset root of word;
respectively determining the current editing distance between each expansion word and the preset root;
and determining the similarity score of each expansion word corresponding to the preset root word according to the probability value and the current editing distance.
8. An illegal word expansion device, characterized in that the illegal word expansion device comprises:
the generating module is used for generating a plurality of expansion words corresponding to the preset root respectively through the preset illegal word expansion model;
the determining module is used for determining similarity scores corresponding to the expansion words and the preset root;
the industry classification module is used for determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;
the determining module is further configured to determine a corresponding weight according to the expansion type corresponding to the preset root word;
the industry classification module is further used for determining industry final scores corresponding to the expansion words according to the similarity scores, the industry initial scores and the weights;
and the expansion module is used for taking each expansion word as the violation word of the corresponding industry according to the industry final score.
9. An illegal word expansion device, characterized in that the device comprises: a memory, a processor, and an offending word extension program stored on the memory and executable on the processor, the offending word extension program configured to implement the offending word extension method of any of claims 1-7.
10. A storage medium having stored thereon an illegal word expansion program that, when executed by a processor, implements an illegal word expansion method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111195304.4A CN115983253A (en) | 2021-10-13 | 2021-10-13 | Illegal word expansion method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111195304.4A CN115983253A (en) | 2021-10-13 | 2021-10-13 | Illegal word expansion method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115983253A true CN115983253A (en) | 2023-04-18 |
Family
ID=85972553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111195304.4A Pending CN115983253A (en) | 2021-10-13 | 2021-10-13 | Illegal word expansion method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115983253A (en) |
-
2021
- 2021-10-13 CN CN202111195304.4A patent/CN115983253A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111797210A (en) | Information recommendation method, device and equipment based on user portrait and storage medium | |
CN111767716B (en) | Method and device for determining enterprise multi-level industry information and computer equipment | |
KR102104316B1 (en) | Apparatus for predicting stock price of company by analyzing news and operating method thereof | |
CN114119058B (en) | User portrait model construction method, device and storage medium | |
CN111078835A (en) | Resume evaluation method and device, computer equipment and storage medium | |
US20120046937A1 (en) | Semantic classification of variable data campaign information | |
CN110110218A (en) | A kind of Identity Association method and terminal | |
US20090094174A1 (en) | Method, system and program product for on demand data mining server with dynamic mining models | |
CN114625834A (en) | Enterprise industry information determination method and device and electronic equipment | |
CN107402886B (en) | Storehouse analysis method and relevant apparatus | |
CN113095723A (en) | Coupon recommendation method and device | |
JP2007157058A (en) | Classification model learning device, classification model learning method, and program for learning classification model | |
CN115983253A (en) | Illegal word expansion method, device, equipment and storage medium | |
JP6763967B2 (en) | Data conversion device and data conversion method | |
CN115878864A (en) | Data retrieval method, device and equipment and readable storage medium | |
CN111309953B (en) | Image recognition method and device | |
CN115309995A (en) | Scientific and technological resource pushing method and device based on demand text | |
WO2021260865A1 (en) | Classification device, classification method, and classification program | |
US20210064862A1 (en) | System and a method for developing a tool for automated data capture | |
CN112182218A (en) | Text data classification method and device | |
CN117648635B (en) | Sensitive information classification and classification method and system and electronic equipment | |
CN117271653B (en) | Multi-dimensional patent map construction method and system | |
CN116451787B (en) | Content risk identification method, device, system and equipment | |
JP2014038392A (en) | Spam account score calculation device, spam account score calculation method and program | |
US20240283820A1 (en) | Automated machine learning using large language models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |