CN115983253A

CN115983253A - Illegal word expansion method, device, equipment and storage medium

Info

Publication number: CN115983253A
Application number: CN202111195304.4A
Authority: CN
Inventors: 杭浩然; 吴天昊; 何丙南
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-04-18

Abstract

The invention belongs to the technical field of computers, and discloses a method, a device, equipment and a storage medium for expanding illegal words. The method comprises the following steps: generating a plurality of expansion words respectively corresponding to preset word roots through a preset illegal word expansion model; determining similarity scores corresponding to the expansion words and preset roots; determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model; determining corresponding weight according to the expansion type corresponding to the preset root word; determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight; and taking each expansion word as a violation word of the corresponding industry according to the final industry score. Through the method, the preset root words are automatically expanded, industries corresponding to the expanded words are distinguished, data support is provided for judging violation of the advertisement, filling efficiency of the violation word bank is improved, the phenomenon that the violation words are searched from a large amount of text data repeatedly by manpower is avoided, and labor cost is reduced.

Description

Violation word expansion method, device, equipment and storage medium

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种违规词拓展方法、装置、设备及存储介质。The present invention relates to the field of computer technology, in particular to a method, device, equipment and storage medium for expanding illegal words.

背景技术Background technique

现有的广告违法违规性判别中通常使用已有的违规词与广告进行匹配，而已有的违规词通过由人工重复地从大量的文本数据中寻找广告违法性质高的违规词，这种方式主观性高，词库填充效率低，并且耗费人力。In the existing illegal judgment of advertisements, the existing illegal words are usually used to match the advertisements, and the existing illegal words are manually and repeatedly searched for illegal words with high advertising illegality from a large amount of text data. This method is subjective The accuracy is high, the thesaurus filling efficiency is low, and it is labor-intensive.

上述内容仅用于辅助理解本发明的技术方案，并不代表承认上述内容是现有技术。The above content is only used to assist in understanding the technical solution of the present invention, and does not mean that the above content is admitted as prior art.

发明内容Contents of the invention

本发明的主要目的在于提供一种违规词拓展方法、装置、设备及存储介质，旨在解决如何实现对广告违规词进行自动扩展，避免人工重复从大量的文本数据中寻找违规词，从而降低人力成本的技术问题。The main purpose of the present invention is to provide a method, device, equipment and storage medium for expanding illegal words, aiming to solve how to realize automatic expansion of advertising illegal words, avoid manual repetition of finding illegal words from a large amount of text data, thereby reducing manpower cost technical issues.

为实现上述目的，本发明提供了一种违规词拓展方法，所述方法包括以下步骤：In order to achieve the above object, the present invention provides a method for expanding illegal words, said method comprising the following steps:

通过预设违规词拓展模型生成预设词根分别对应的若干拓展词；Generate a number of extended words corresponding to the preset root roots through the preset violation word expansion model;

确定各拓展词与所述预设词根对应的相似度得分；Determine the similarity score corresponding to each expanded word and the preset word root;

基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分；Based on the preset industry classification model, determine the industry initial score of each expanded word in multiple industries;

根据所述预设词根对应的拓展类型确定对应的权重；Determine the corresponding weight according to the expansion type corresponding to the preset root;

根据所述相似度得分、所述行业初始得分以及所述权重确定各拓展词对应的行业最终得分；Determine the industry final score corresponding to each expanded word according to the similarity score, the industry initial score and the weight;

根据所述行业最终得分将各拓展词作为对应行业的违规词。According to the final score of the industry, each expansion word is regarded as the violation word of the corresponding industry.

可选地，所述通过预设违规词拓展模型生成预设词根分别对应的若干拓展词，包括：Optionally, the said expansion model generates a number of expansion words corresponding to the default roots through the preset violation word expansion model, including:

通过预设违规词拓展模型获取预设词根对应的若干相关词；Obtain a number of related words corresponding to the preset root through the preset violation word expansion model;

根据预设数据库对所述若干相关词进行去重处理，得到若干拓展词。Deduplication processing is performed on the several related words according to the preset database to obtain several expanded words.

可选地，所述根据预设数据库对所述若干相关词进行去重处理，得到若干拓展词，包括：Optionally, performing deduplication processing on the several related words according to the preset database to obtain several expanded words, including:

从预设数据库中确定若干当前词；Determining several current words from a preset database;

分别确定所述若干当前词与所述若干相关词之间的编辑距离；Respectively determine the edit distance between the several current words and the several related words;

根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词。Deduplication processing is performed on the several related words according to the edit distance to obtain several expanded words.

可选地，所述根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词，包括：Optionally, according to the edit distance, the several related words are deduplicated to obtain several extended words, including:

在目标相关词对应的目标编辑距离小于预设距离阈值时，删除所述目标相关词；When the target edit distance corresponding to the target related word is less than a preset distance threshold, delete the target related word;

将剩余的相关词作为若干拓展词。Use the remaining related words as some expansion words.

可选地，所述根据所述行业最终得分将各拓展词作为对应行业的违规词之后，所述方法还包括：Optionally, after the final score of the industry uses each expansion word as a violation word of the corresponding industry, the method also includes:

将各拓展词以及对应行业存储至所述预设数据库中。Store each expanded word and corresponding industry in the preset database.

获取用户输入的删除指令，根据所述删除指令删除对应的目标拓展词；Obtain the deletion instruction input by the user, and delete the corresponding target expansion word according to the deletion instruction;

将剩余的各拓展词以及对应行业存储至所述预设数据库中。Store the remaining extended words and corresponding industries in the preset database.

可选地，所述确定各拓展词与所述预设词根对应的相似度得分，包括：Optionally, the determination of the similarity score corresponding to each expanded word and the preset root includes:

确定所述预设违规词拓展模型输出的各拓展词与所述预设词根对应的概率值；Determining the probability value corresponding to each expanded word output by the preset violation word expansion model and the preset root;

分别确定各拓展词与所述预设词根之间的当前编辑距离；Respectively determine the current edit distance between each expanded word and the preset word root;

根据所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。Determine the similarity score corresponding to each expanded word and the preset word root according to the probability value and the current edit distance.

可选地，所述预设违规词拓展模型包括预设Word2Vec模型以及预设Bert模型。Optionally, the preset violation word expansion model includes a preset Word2Vec model and a preset Bert model.

可选地，所述根据所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分，包括：Optionally, the determining the similarity score corresponding to each expanded word and the preset word root according to the probability value and the current edit distance includes:

根据生成各拓展词的预设违规词拓展模型确定对应的控制参数；Determine corresponding control parameters according to the preset violation word expansion model that generates each expansion word;

根据所述控制参数、所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。The similarity score corresponding to each expanded word and the preset word root is determined according to the control parameter, the probability value, and the current edit distance.

可选地，所述通过预设违规词拓展模型生成预设词根分别对应的若干拓展词之前，所述方法还包括：Optionally, before generating a number of extended words corresponding to the preset roots through the preset violation word expansion model, the method also includes:

根据文本数据对初始Word2Vec模型进行训练，得到训练好的预设Word2Vec模型；According to the text data, the initial Word2Vec model is trained to obtain the trained preset Word2Vec model;

根据所述文本数据对初始Bert模型进行训练，并基于文本分类对训练好的Bert模型进行微调，得到预设Bert模型。The initial Bert model is trained according to the text data, and the trained Bert model is fine-tuned based on text classification to obtain a preset Bert model.

此外，为实现上述目的，本发明还提出一种违规词拓展装置，所述违规词拓展装置包括：In addition, in order to achieve the above purpose, the present invention also proposes a device for expanding illegal words, which includes:

生成模块，用于通过预设违规词拓展模型生成预设词根分别对应的若干拓展词；The generation module is used to generate a number of extended words corresponding to the preset root roots through the preset violation word expansion model;

确定模块，用于确定各拓展词与所述预设词根对应的相似度得分；A determining module, configured to determine the similarity score corresponding to each expanded word and the preset root;

行业分类模块，用于基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分；The industry classification module is used to determine the industry initial score of each expanded word in multiple industries based on the preset industry classification model;

所述确定模块，还用于根据所述预设词根对应的拓展类型确定对应的权重；The determination module is also used to determine the corresponding weight according to the expansion type corresponding to the preset root;

所述行业分类模块，还用于根据所述相似度得分、所述行业初始得分以及所述权重确定各拓展词对应的行业最终得分；The industry classification module is also used to determine the final industry score corresponding to each expanded word according to the similarity score, the industry initial score and the weight;

拓展模块，用于根据所述行业最终得分将各拓展词作为对应行业的违规词。The expansion module is used to use each expansion word as a violation word of the corresponding industry according to the final score of the industry.

可选地，所述生成模块，还用于通过预设违规词拓展模型获取预设词根对应的若干相关词；根据预设数据库对所述若干相关词进行去重处理，得到若干拓展词。Optionally, the generation module is also used to obtain a number of related words corresponding to the preset root through the preset violation word expansion model; perform deduplication processing on the number of related words according to the preset database to obtain a number of expanded words.

可选地，所述生成模块，还用于从预设数据库中确定若干当前词；分别确定所述若干当前词与所述若干相关词之间的编辑距离；根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词。Optionally, the generation module is also used to determine several current words from a preset database; respectively determine the edit distance between the several current words and the several related words; Relevant words are deduplicated, and several expanded words are obtained.

可选地，所述生成模块，还用于在目标相关词对应的目标编辑距离小于预设距离阈值时，删除所述目标相关词；将剩余的相关词作为若干拓展词。Optionally, the generation module is further configured to delete the target related word when the target edit distance corresponding to the target related word is less than a preset distance threshold; and use the remaining related words as several extended words.

可选地，所述违规词拓展装置还包括存储模块；Optionally, the device for expanding illegal words also includes a storage module;

所述存储模块，用于将各拓展词以及对应行业存储至所述预设数据库中。The storage module is configured to store each expanded word and corresponding industry in the preset database.

所述存储模块，用于获取用户输入的删除指令，根据所述删除指令删除对应的目标拓展词；将剩余的各拓展词以及对应行业存储至所述预设数据库中。The storage module is configured to obtain a deletion instruction input by the user, delete the corresponding target expansion words according to the deletion instruction, and store the remaining expansion words and corresponding industries in the preset database.

可选地，所述确定模块，还用于确定所述预设违规词拓展模型输出的各拓展词与所述预设词根对应的概率值；分别确定各拓展词与所述预设词根之间的当前编辑距离；根据所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。Optionally, the determination module is also used to determine the probability value corresponding to each expanded word output by the preset violation word expansion model and the preset root; respectively determine the relationship between each expanded word and the preset root The current edit distance; determine the similarity score corresponding to each expanded word and the preset root according to the probability value and the current edit distance.

此外，为实现上述目的，本发明还提出一种违规词拓展设备，所述违规词拓展设备包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的违规词拓展程序，所述违规词拓展程序配置为实现如上文所述的违规词拓展方法。In addition, in order to achieve the above object, the present invention also proposes a violation word expansion device, the violation word expansion device includes: a memory, a processor, and a violation word expansion stored in the memory and operable on the processor program, and the violation word expansion program is configured to implement the violation word expansion method as described above.

此外，为实现上述目的，本发明还提出一种存储介质，所述存储介质上存储有违规词拓展程序，所述违规词拓展程序被处理器执行时实现如上文所述的违规词拓展方法。In addition, in order to achieve the above purpose, the present invention also proposes a storage medium, on which a violation word expansion program is stored, and when the violation word expansion program is executed by a processor, the above-mentioned method for expanding violation words is realized.

本发明通过预设违规词拓展模型生成预设词根分别对应的若干拓展词；确定各拓展词与预设词根对应的相似度得分；基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分；根据预设词根对应的拓展类型确定对应的权重；根据相似度得分、行业初始得分以及权重确定各拓展词对应的行业最终得分；根据行业最终得分将各拓展词作为对应行业的违规词。通过上述方式，对预设词根进行自动拓展，并依据拓展词相似度以及拓展类型区分各拓展词对应的行业，为广告违法违规性判别提供数据支持，提高了违规词库填充效率，避免了人工重复从大量的文本数据中寻找违规词，降低了人力成本。The present invention generates a number of expanded words corresponding to the preset root through the preset violation word expansion model; determines the similarity score corresponding to each expanded word and the preset root; determines the value of each expanded word in multiple industries based on the preset industry classification model Industry initial score; determine the corresponding weight according to the expansion type corresponding to the preset root; determine the final industry score corresponding to each expansion word according to the similarity score, industry initial score and weight; determine each expansion word as a violation of the corresponding industry according to the industry final score word. Through the above method, the default word root is automatically expanded, and the industry corresponding to each expanded word is distinguished according to the similarity of the expanded word and the type of expansion, providing data support for the identification of advertising violations, improving the filling efficiency of the illegal lexicon, and avoiding artificial Repeatedly looking for illegal words from a large amount of text data reduces labor costs.

附图说明Description of drawings

图1是本发明实施例方案涉及的硬件运行环境的违规词拓展设备的结构示意图；Fig. 1 is the structural representation of the illegal word expanding device of the hardware operating environment that scheme of the embodiment of the present invention relates to;

图2为本发明违规词拓展方法第一实施例的流程示意图；Fig. 2 is a schematic flow chart of the first embodiment of the method for expanding illegal words of the present invention;

图3为本发明违规词拓展方法一实施例确定行业得分的流程示意图；Fig. 3 is a schematic flow chart of determining the industry score in an embodiment of the method for expanding the illegal words of the present invention;

图4为本发明违规词拓展方法第二实施例的流程示意图；Fig. 4 is the schematic flow chart of the second embodiment of the method for expanding the illegal words of the present invention;

图5为本发明违规词拓展方法第三实施例的流程示意图；Fig. 5 is a schematic flow chart of the third embodiment of the method for expanding illegal words of the present invention;

图6为本发明违规词拓展方法一实施例的拓展流程示意图；Fig. 6 is a schematic diagram of the expansion process of an embodiment of the method for expanding violation words of the present invention;

图7为本发明违规词拓展装置第一实施例的结构框图。Fig. 7 is a structural block diagram of the first embodiment of the device for expanding illegal words according to the present invention.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. The realization of the purpose of the present invention, functional characteristics and advantages will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

参照图1，图1为本发明实施例方案涉及的硬件运行环境的违规词拓展设备结构示意图。Referring to FIG. 1 , FIG. 1 is a schematic structural diagram of a device for expanding illegal words in a hardware operating environment related to the solution of an embodiment of the present invention.

如图1所示，该违规词拓展设备可以包括：处理器1001，例如中央处理器(CentralProcessing Unit，CPU)，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元(比如键盘(Keyboard))；可选的，用户接口1003还可以包括标准的有线接口、无线接口。可选的，网络接口1004包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity，Wi-Fi)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory，RAM)，也可以是稳定的非易失性存储器(Non-Volatile Memory，NVM)，例如磁盘存储器。可选的，存储器1005还可以是独立于前述处理器1001的存储装置。As shown in Figure 1, this violation word expanding device can comprise: processor 1001, such as central processing unit (Central Processing Unit, CPU), communication bus 1002, user interface 1003, network interface 1004, memory 1005. Wherein, the communication bus 1002 is used to realize connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit (such as a keyboard (Keyboard)); optionally, the user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 includes a standard wired interface and a wireless interface (such as a Wireless-Fidelity (Wi-Fi) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM), or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

本领域技术人员可以理解，图1中示出的结构并不构成对违规词拓展设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the device for expanding illegal words, and may include more or less components than those shown in the figure, or combine some components, or arrange different components.

如图1所示，作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及违规词拓展程序。As shown in FIG. 1 , the memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and a violation word expansion program.

在图1所示的违规词拓展设备中，网络接口1004主要用于与网络服务器进行数据通信；用户接口1003主要用于与用户进行数据交互；本发明违规词拓展设备中的处理器1001、存储器1005可以设置在违规词拓展设备中，所述违规词拓展设备通过处理器1001调用存储器1005中存储的违规词拓展程序，并执行本发明实施例提供的违规词拓展方法。In the device for expanding the illegal words shown in Figure 1, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processor 1001, memory 1005 may be set in the device for expanding illegal words, and the device for expanding illegal words calls the program for expanding illegal words stored in memory 1005 through the processor 1001, and executes the method for expanding illegal words provided by the embodiment of the present invention.

本发明实施例提供了一种违规词拓展方法，参照图2，图2为本发明违规词拓展方法第一实施例的流程示意图。An embodiment of the present invention provides a method for expanding illegal words. Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first embodiment of the method for expanding illegal words in the present invention.

本实施例中，所述违规词拓展方法包括以下步骤：In the present embodiment, described violation word expansion method comprises the following steps:

步骤S10：通过预设违规词拓展模型生成预设词根分别对应的若干拓展词。Step S10: Generate a number of extended words corresponding to the preset word roots through the preset violation word expansion model.

可以理解的是，本实施例的执行主体为违规词拓展设备，所述违规词拓展设备可以为计算机、服务器等设备，还可以为其他具备相同或相似功能的设备，本实施例对此不加以限制。It can be understood that the execution subject of this embodiment is the device for expanding illegal words, and the device for expanding illegal words can be a computer, a server, etc., or other devices with the same or similar functions. limit.

需要说明的是，预设违规词拓展模型可以为Word2Vec模型，也可以为Bert模型，通过Word2Vec模型和/或Bert模型召回与预设词根相关的若干拓展词，在具体实现中，为了进一步提高违规词库的扩展效率，预设词根可以为多个，将多个预设词根输入至预设违规词拓展模型，从而生成大量的拓展词，预设词根可以从预设数据库中获取，其中，预设数据库存储有大量的现有违规词。It should be noted that the default expansion model of illegal words can be a Word2Vec model or a Bert model, and a number of expanded words related to the preset root are recalled through the Word2Vec model and/or Bert model. In the specific implementation, in order to further improve the violation For the expansion efficiency of the vocabulary, there can be multiple preset roots, and multiple preset roots can be input into the preset violation word expansion model to generate a large number of expanded words. The preset roots can be obtained from the preset database, among which, the preset Assume that the database stores a large number of existing offending words.

步骤S20：确定各拓展词与所述预设词根对应的相似度得分。Step S20: Determine the similarity score corresponding to each expanded word and the preset word root.

可以理解的是，相似度得分可以基于各拓展词与预设词根之间的编辑距离确定，拓展词与预设词根的编辑距离越近，相似度得分越小。在具体实现中，还考虑预设违规词拓展模型的输出概率，根据输出概率以及编辑距离确定相似度得分，其中，输出概率越大、编辑距离越大，相似度得分越高。It can be understood that the similarity score can be determined based on the edit distance between each expanded word and the preset root, and the closer the edit distance between the expanded word and the preset root, the smaller the similarity score. In the specific implementation, the output probability of the preset violation word expansion model is also considered, and the similarity score is determined according to the output probability and the edit distance. The greater the output probability and the greater the edit distance, the higher the similarity score.

步骤S30：基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分。Step S30: Determine the industry initial score of each expanded word in multiple industries based on the preset industry classification model.

需要说明的是，预设行业分类模型可以为轻量级的fastText模型，对各拓展词进行行业分类，确定各拓展词在每个预设的行业的行业初始得分。It should be noted that the preset industry classification model can be a lightweight fastText model, which classifies each expanded word by industry, and determines the industry initial score of each expanded word in each preset industry.

步骤S40：根据所述预设词根对应的拓展类型确定对应的权重。Step S40: Determine the corresponding weight according to the expansion type corresponding to the preset word root.

可以理解的是，本实施例中，违规词分为三类，分别包括A类词、B类词以及C类词，其中，A类词来源于高违法、高触发性质的词；B类词是来源于高违法、低触发性质的词；C类词则是业务部门对接的客户需求以及业务部门提供的采集到的新词。在对于不同类型的违规词进行扩展时，采取差异化拓展方式，其中，由于A类词是高违法、高触发类词，A拓展类型是将A类词的行业泛化到除本行业外的各个行业中去，例如：买房子，免费帮忙办理户口，该条目行业属于“房地产”，其中“免费帮忙办理户口”属于该行业的违规词，那么如果将该词泛化，那么得到“免费帮忙进重点”，那么就可以从“房地产”跨行业泛化到“教育培训”行业，从而得到“教育培训”行业的一条违规词。B类词可能违法性质较高，但是平时在行业内的触发性较低，B拓展类型是将该条目的词继续在本行业进行拓展，例如，上述例子中，其违法词行业属于“房地产”，如果其触发性较低，那么通过继续在“房地产”行业拓展，使得在“房地产”行业的违规词的丰富度更高，提高触发性，比如拓展得到“户口办理不花钱”依然属于“房地产”行业的违法词。对于C类词则C拓展类型是在丰富违规词库的角度上出发来扩大违规匹配量，比如“百分百”属于几乎全行业广告违规词，那么其拓展出来的“肯定”、“一定实现”等带有肯定意味的词均为违规词。It can be understood that, in this embodiment, the offending words are divided into three categories, including A-category words, B-category words and C-category words, wherein, A-category words come from words with high illegality and high triggering properties; B-category words It is derived from words with high illegality and low triggering nature; C-type words are new words collected from the customer needs of the business department and provided by the business department. When expanding different types of illegal words, a differentiated expansion method is adopted. Among them, since the A-type words are highly illegal and high-triggering words, the A expansion type is to generalize the industry of the A-type words to other industries. In various industries, for example: buying a house, helping with account registration for free, the entry industry belongs to "real estate", and "help with account registration for free" is an illegal word in this industry, so if this word is generalized, then you can get "free help" Into key points", then it can be generalized from "real estate" to "education and training" industry, so as to get an illegal word in the "education and training" industry. Type B words may be more illegal, but they are usually less triggering in the industry. The B expansion type is to continue to expand the word of the entry in this industry. For example, in the above example, the illegal word industry belongs to "real estate" , if its triggering is low, then by continuing to expand in the "real estate" industry, the richness of illegal words in the "real estate" industry will be higher, and the triggering will be improved. Illegal words in the "real estate" industry. For category C words, the C expansion type is to expand the amount of illegal matching from the perspective of enriching the illegal lexicon. " and other words with positive connotations are illegal words.

需要说明的是，本实施例中根据预设词根的拓展类型设置有不同的行业权重，例如，通过A拓展类型进行拓展时，本行业对应的权重远小于其他行业对应的权重，通过B拓展类型进行拓展时，本行业对应的权重远大于其他行业对应的权重，而通过C拓展类型进行拓展时，各行业对应的权重相似。It should be noted that in this embodiment, different industry weights are set according to the extension type of the preset root word. For example, when expanding through type A, the weight corresponding to this industry is much smaller than the weight corresponding to other industries. When expanding, the corresponding weight of this industry is much greater than that of other industries, and when expanding through C expansion type, the corresponding weight of each industry is similar.

步骤S50：根据所述相似度得分、所述行业初始得分以及所述权重确定各拓展词对应的行业最终得分。Step S50: Determine the industry final score corresponding to each expanded word according to the similarity score, the industry initial score and the weight.

应当理解的是，假设相似度得分表示为sim_score，行业初始得分表示为F(q,q_i,cate)，权重表示为γ_i以及δ_i，根据公式(1)确定各拓展词对应的行业最终得分：It should be understood that, assuming that the similarity score is expressed as sim_score, the industry initial score is expressed as F(q,q _i ,cate), and the weight is expressed as γ _i and δ _i , the final industry corresponding to each expanded word is determined according to formula (1). Score:

final_score＝γ_i*sim_score+δ_i*F(q,q_i,cate) (1)final_score＝γ _i *sim_score+δ _i *F(q,q _i ,cate) (1)

步骤S60：根据所述行业最终得分将各拓展词作为对应行业的违规词。Step S60: According to the final score of the industry, each expanded word is regarded as a violation word of the corresponding industry.

需要说明的是，本实施例根据拓展类型以及相似度得分对行业最终得分进行制约，从而确定各拓展词在每个行业对应的行业最终得分，对各拓展词对应的若干行业最终得分进行排序，确定数值最大的得分以及对应的行业，将各拓展词以及对应行业进行绑定，得到各行业内拓展得到的违规词。It should be noted that this embodiment restricts the final industry scores according to the type of expansion and the similarity score, so as to determine the final industry scores corresponding to each expanded word in each industry, and sort the final scores of several industries corresponding to each expanded word. Determine the score with the largest value and the corresponding industry, bind each expanded word and the corresponding industry, and obtain the illegal words expanded in each industry.

参照图3进行举例说明，图3为本发明违规词拓展方法一实施例确定行业得分的流程示意图，预设词根为从预设数据库中选取的违法词根q，其对应的词语类别为A类词，属于行业1，根据通过预设违规词拓展模型生成预设词根分别对应的若干拓展词，得到拓词候选集，其中包含若干拓展词q1,q2,q3…，确定各拓展词与预设词根q对应的相似度得分s1,s2,s3…，基于预设行业分类模型确定各拓展词在行业1-行业n的行业初始得分f1,f2,f3…，根据A类词对应的拓展类型确定权重，即系数，根据相似度得分s1,s2,s3…、系数以及行业初始得分f1,f2,f3…各拓展词的对应的行业最终得分fs1,fs2,fs3…，基于行业最终得分将各拓展词作为对应行业的违规词，例如，行业1对应的拓展词为q3,q5,q1…。Referring to Fig. 3 for illustration, Fig. 3 is a schematic flow diagram of determining industry scores in an embodiment of the method for expanding illegal words of the present invention. The preset root is the illegal root q selected from the preset database, and its corresponding word category is a type A word , belonging to industry 1, according to the expansion model of preset illegal words to generate a number of expansion words corresponding to the preset root, obtain the extension word candidate set, which contains a number of expansion words q1, q2, q3..., determine the expansion words and the default root The similarity scores s1, s2, s3... corresponding to q, based on the preset industry classification model, determine the industry initial scores f1, f2, f3... of each expanded word in industry 1-industry n, and determine the weight according to the expansion type corresponding to the A category word , that is, coefficients, according to the similarity scores s1, s2, s3..., coefficients and industry initial scores f1, f2, f3... and the corresponding industry final scores fs1, fs2, fs3... of each expanded word, based on the industry final score, each expanded word As the illegal words corresponding to the industry, for example, the expanded words corresponding to industry 1 are q3, q5, q1....

本实施例通过预设违规词拓展模型生成预设词根分别对应的若干拓展词；确定各拓展词与预设词根对应的相似度得分；基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分；根据预设词根对应的拓展类型确定对应的权重；根据相似度得分、行业初始得分以及权重确定各拓展词对应的行业最终得分；根据行业最终得分将各拓展词作为对应行业的违规词。通过上述方式，对预设词根进行自动拓展，并依据拓展词相似度以及拓展类型区分各拓展词对应的行业，为广告违法违规性判别提供数据支持，提高了违规词库填充效率，避免了人工重复从大量的文本数据中寻找违规词，降低了人力成本。In this embodiment, a number of extended words corresponding to the preset root are generated through the preset violation word expansion model; the similarity score corresponding to each expanded word and the preset root is determined; based on the preset industry classification model, it is determined that each expanded word is in multiple industries Determine the corresponding weight according to the expansion type corresponding to the preset root; determine the final industry score corresponding to each expanded word according to the similarity score, industry initial score and weight; use each expanded word as the corresponding industry according to the final score of the industry offending word. Through the above method, the default word root is automatically expanded, and the industry corresponding to each expanded word is distinguished according to the similarity of the expanded word and the type of expansion, providing data support for the identification of advertising violations, improving the filling efficiency of the illegal lexicon, and avoiding artificial Repeatedly looking for illegal words from a large amount of text data reduces labor costs.

参考图4，图4为本发明违规词拓展方法第二实施例的流程示意图。Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a second embodiment of the method for expanding illegal words according to the present invention.

基于上述第一实施例，本实施例违规词拓展方法的所述步骤S10，包括：Based on the above-mentioned first embodiment, the step S10 of the method for expanding the violation word in this embodiment includes:

步骤S101：通过预设违规词拓展模型获取预设词根对应的若干相关词。Step S101: Obtain a number of related words corresponding to the preset word roots through the preset violation word expansion model.

步骤S102：根据预设数据库对所述若干相关词进行去重处理，得到若干拓展词。Step S102: Perform deduplication processing on the several related words according to the preset database to obtain several expanded words.

可以理解的是，预设数据库中存储有大量的现有违规词，将与现有违规词相同的相关词进行删除，得到若干拓展词。It can be understood that a large number of existing illegal words are stored in the preset database, and related words that are the same as the existing illegal words are deleted to obtain several expanded words.

进一步地，所述步骤S102，包括：从预设数据库中确定若干当前词；分别确定所述若干当前词与所述若干相关词之间的编辑距离；根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词。Further, the step S102 includes: determining several current words from a preset database; respectively determining the edit distance between the several current words and the several related words; Perform deduplication processing to obtain a number of extended words.

进一步地，所述根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词，包括：在目标相关词对应的目标编辑距离小于预设距离阈值时，删除所述目标相关词；将剩余的相关词作为若干拓展词。Further, the deduplication processing is performed on the several related words according to the edit distance to obtain several extended words, including: when the target edit distance corresponding to the target related word is less than a preset distance threshold, delete the target related word ; Use the remaining related words as some expansion words.

需要说明的是，为了进一步减小重复的相关词，避免对重复相关词进行行业划分造成计算资源的浪费，本实施例中，通过确定相关词以及现有违规词之间的编辑距离，对相关词进行去重处理，得到若干拓展词。It should be noted that, in order to further reduce the repeated related words and avoid the waste of computing resources caused by industry division of repeated related words, in this embodiment, by determining the edit distance between related words and existing illegal words, the relevant Words are deduplicated to get a number of extended words.

可以理解的是，将编辑距离小于预设距离阈值的相关词进行删除，其中，预设距离阈值可以根据业务需求进行自主设置，用于区分相关词以及现有违规词是否高度近似，利用编辑距离进行去重操作，去除字符串高度近似的相关词，例如，将“土豆是什么”拓展成“土豆是干什么”，两者高度相似，去除拓展出的相关词。It is understandable that the related words whose edit distance is less than the preset distance threshold are deleted, and the preset distance threshold can be set independently according to business needs, which is used to distinguish whether related words and existing offending words are highly similar, and use edit distance Perform a deduplication operation to remove related words that are highly similar to the string, for example, expand "what is a potato" into "what is a potato", the two are highly similar, and remove the expanded related words.

进一步地，所述步骤S60之后，所述方法还包括：将各拓展词以及对应行业存储至所述预设数据库中。Further, after the step S60, the method further includes: storing each expanded word and corresponding industry in the preset database.

需要说明的是，将拓展得到的拓展词以及对应的行业存储至预设数据库，从而进一步补充现有违规词库，在进行违规词拓展时，从预设数据库中获取预设词根，具体过程可以为根据用户的选择，从各行业中按照业务需求选取预设词根，用户对选取得到的预设词根进行拓展类型标注，通过拓展类型对应的拓展方式进行违规词拓展，此外，还可以在对拓展词进行存储时，提示审核人员对各拓展词进行标记，确定各拓展词对应的词语类别，即确认各拓展词属于A类词、B类词还是C类词，将各拓展词、对应行业以及对应词语类别存储至所述预设数据库中，在进行违规词拓展时，从预设数据库中获取预设词根以及对应的词语类别，根据词语类别对应的扩展类型进行违规词拓展。在具体实现中，还可以直接根据预设词根的扩展类型对各拓展词进行分类，即A类词拓展得到的拓展词标记为A类词。It should be noted that the expanded words obtained from the expansion and the corresponding industries are stored in the preset database, so as to further supplement the existing illegal thesaurus. When expanding the illegal words, the preset root is obtained from the preset database. The specific process can be In order to select preset word roots from various industries according to business needs according to the user's choice, the user will mark the extension type of the selected preset word root, and expand the illegal words through the extension method corresponding to the extension type. In addition, you can also expand the When the words are stored, the auditors are prompted to mark each extended word, and determine the word category corresponding to each extended word, that is, to confirm whether each extended word belongs to a type A word, a B type word, or a C type word, and to mark each extended word, corresponding industry and The corresponding word categories are stored in the preset database. When expanding the illegal words, the preset roots and corresponding word categories are obtained from the preset database, and the illegal word expansion is performed according to the expansion type corresponding to the word categories. In a specific implementation, each expanded word can also be directly classified according to the expanded type of the preset root, that is, the expanded word obtained by expanding the A-type word is marked as the A-type word.

进一步地，所述步骤S60之后，所述方法还包括：获取用户输入的删除指令，根据所述删除指令删除对应的目标拓展词；将剩余的各拓展词以及对应行业存储至所述预设数据库中。Further, after the step S60, the method further includes: obtaining a deletion instruction input by the user, and deleting the corresponding target expansion words according to the deletion instruction; storing the remaining expansion words and corresponding industries in the preset database middle.

需要说明的是，本实施例中提供人工审核流程，在拓展得到若干拓展词后，接收人工输入的删除指令，进一步避免了违规词的重复，保证了违规词库的数据准确性。It should be noted that in this embodiment, a manual review process is provided. After a number of expanded words are obtained, a manual input deletion command is received, which further avoids the repetition of illegal words and ensures the data accuracy of the illegal lexicon.

本实施例通过预设违规词拓展模型获取预设词根对应的若干相关词；根据预设数据库对若干相关词进行去重处理，得到若干拓展词；确定各拓展词与预设词根对应的相似度得分；基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分；根据预设词根对应的拓展类型确定对应的权重；根据相似度得分、行业初始得分以及权重确定各拓展词对应的行业最终得分；根据行业最终得分将各拓展词作为对应行业的违规词。通过上述方式，对预设词根进行自动拓展，并依据预设数据库进行去重处理，避免了违规词的重复，进一步提高了违规词库填充效率，并依据拓展词相似度以及拓展类型区分各拓展词对应的行业，为广告违法违规性判别提供数据支持，避免了人工重复从大量的文本数据中寻找违规词，降低了人力成本。This embodiment obtains some related words corresponding to the preset root through the preset violation word expansion model; according to the preset database, some related words are deduplicated to obtain some expanded words; determine the similarity between each expanded word and the preset root Score; Determine the initial industry score of each expanded word in multiple industries based on the preset industry classification model; determine the corresponding weight according to the expansion type corresponding to the preset root; determine the corresponding value of each expanded word according to the similarity score, industry initial score and weight The final score of the industry; according to the final score of the industry, each expansion word is regarded as the violation word of the corresponding industry. Through the above method, the preset word root is automatically expanded, and the duplicate processing is carried out according to the preset database, which avoids the repetition of illegal words, further improves the filling efficiency of the illegal lexicon, and distinguishes each expansion according to the similarity of the expanded words and the type of expansion. The industry corresponding to the word provides data support for the identification of advertising violations, avoiding manual repetition of finding illegal words from a large amount of text data, and reducing labor costs.

参考图5，图5为本发明违规词拓展方法第三实施例的流程示意图。Referring to FIG. 5 , FIG. 5 is a schematic flowchart of a third embodiment of a method for expanding illegal words according to the present invention.

基于上述第一实施例，本实施例违规词拓展方法的所述步骤S20，包括：Based on the above-mentioned first embodiment, the step S20 of the method for expanding the violation word in this embodiment includes:

步骤S201：确定所述预设违规词拓展模型输出的各拓展词与所述预设词根对应的概率值。Step S201: Determine the probability value corresponding to the preset word root for each expanded word output by the preset violation word expansion model.

步骤S202：分别确定各拓展词与所述预设词根之间的当前编辑距离。Step S202: Determine the current edit distance between each expanded word and the preset root.

步骤S203：根据所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。Step S203: Determine the similarity score corresponding to each expanded word and the preset word root according to the probability value and the current edit distance.

可以理解的是，假设模型输出的概率值表示为M(q,q_i)，当前编辑距离表示为L(q,q_i)，其中，q为输入的预设词根，q_i为模型召回的拓展词，根据公式(2)确定各拓展词与预设词根对应的相似度得分：It is understandable that, assuming that the output probability value of the model is expressed as M(q,q _i ), the current edit distance is expressed as L(q,q _i ), where q is the preset root of the input, and q _i is the model recalled Expanded words, determine the similarity score corresponding to each expanded word and the preset word root according to formula (2):

sim_score＝αM(q,q_i)+βL(q,q_i) (2)sim_score=αM(q,q _i )+βL(q,q _i ) (2)

进一步地，所述预设违规词拓展模型包括预设Word2Vec模型以及预设Bert模型。Further, the preset violation word expansion model includes a preset Word2Vec model and a preset Bert model.

具体地，所述步骤S203，包括：根据生成各拓展词的预设违规词拓展模型确定对应的控制参数；根据所述控制参数、所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。Specifically, the step S203 includes: determining the corresponding control parameters according to the preset violation word expansion model for generating each expanded word; The similarity score corresponding to the preset root.

需要说明的是，预设Word2Vec模型以及预设Bert模型存在差异，提前将预设Word2Vec模型以及预设Bert模型对应的控制参数α以及控制参数β存储于预设存储区域内，根据生成各拓展词的模型确定对应的α以及β，根据控制参数α以及β、概率值M(q,q_i)以及当前编辑距离L(q,q_i)根据公式(2)确定各拓展词与预设词根对应的相似度得分。It should be noted that there are differences between the preset Word2Vec model and the preset Bert model. The control parameters α and β corresponding to the preset Word2Vec model and the preset Bert model are stored in the preset storage area in advance. The corresponding α and β are determined by the model, and the corresponding extended words are determined according to the formula (2) according to the control parameters α and β, the probability value M(q,q _i ) and the current edit distance L(q,q _i ). similarity score.

进一步地，所述步骤S10之前，所述方法还包括：根据文本数据对初始Word2Vec模型进行训练，得到训练好的预设Word2Vec模型；根据所述文本数据对初始Bert模型进行训练，并基于文本分类对训练好的Bert模型进行微调，得到预设Bert模型。Further, before the step S10, the method also includes: training the initial Word2Vec model according to the text data to obtain a trained preset Word2Vec model; training the initial Bert model according to the text data, and classifying the initial Bert model based on the text Fine-tune the trained Bert model to obtain the preset Bert model.

可以理解的是，在进行违规词拓展之前，利用大量文本数据无监督训练Word2Vec模型，并在预训练模型Bert的基础上通过文本分类的方式微调Bert模型，得到预设Word2Vec模型以及预设Bert模型。It is understandable that before the expansion of illegal words, a large amount of text data is used to unsupervisedly train the Word2Vec model, and on the basis of the pre-trained model Bert, the Bert model is fine-tuned by text classification, and the preset Word2Vec model and the preset Bert model are obtained. .

参照图6进行举例说明，图6为本发明违规词拓展方法一实施例的拓展流程示意图；将A，B，C类违规词根输入至通过大量文本数据训练得到的违规词拓展模型中，得到若干近似拓展词，其中违规词拓展模型包括Word2Vec模型以及Bert模型，根据现有违规词库对若干近似拓展词进行去重操作，根据fastText模型对各拓展词进行分类，确定各拓展词对应的行业初始得分，并基于违规词对应拓展类别确定各拓展词对应的行业，人工甄别拓展得到的违规词是否违规以及标记对应类别，存储至现有违规词库。Illustrate with reference to Fig. 6 by way of example, Fig. 6 is the expansion flow schematic diagram of an embodiment of the method for expanding the violation word of the present invention; A, B, and the C class violation root are input into the violation word expansion model obtained through a large amount of text data training, and several Approximate extended words, where the illegal word extended model includes the Word2Vec model and the Bert model. According to the existing illegal lexicon, several approximate extended words are deduplicated, and the fastText model is used to classify each extended word, and the industry initial corresponding to each extended word is determined. Score, and determine the industry corresponding to each expanded word based on the corresponding expanded category of the illegal word, manually check whether the expanded illegal word is illegal and mark the corresponding category, and store it in the existing illegal thesaurus.

本实施例通过预设违规词拓展模型生成预设词根分别对应的若干拓展词；确定预设违规词拓展模型输出的各拓展词与预设词根对应的概率值；分别确定各拓展词与预设词根之间的当前编辑距离；根据概率值以及当前编辑距离确定各拓展词与预设词根对应的相似度得分；基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分；根据预设词根对应的拓展类型确定对应的权重；根据相似度得分、行业初始得分以及权重确定各拓展词对应的行业最终得分；根据行业最终得分将各拓展词作为对应行业的违规词。通过上述方式，对预设词根进行自动拓展，根据模型输出概率值以及编辑距离确定相似度，并依据拓展词相似度以及拓展类型区分各拓展词对应的行业，通过相似度以及拓展类型对行业得分进行限制，提高行业划分准确性，为广告违法违规性判别提供数据支持，提高了违规词库填充效率，避免了人工重复从大量的文本数据中寻找违规词，降低了人力成本。The present embodiment generates a number of expanded words corresponding to the default root through the preset violation word expansion model; determines the probability value corresponding to each expansion word output by the default violation word expansion model and the preset root; determines respectively each expansion word and the preset The current edit distance between the roots; determine the similarity score between each expanded word and the preset root according to the probability value and the current edit distance; determine the industry initial score of each expanded word in multiple industries based on the preset industry classification model; The extension type corresponding to the preset word root determines the corresponding weight; the final industry score corresponding to each expanded word is determined according to the similarity score, the industry initial score, and the weight; and each expanded word is regarded as a violation word of the corresponding industry according to the final score of the industry. Through the above method, the preset word root is automatically expanded, the similarity is determined according to the model output probability value and the edit distance, and the industry corresponding to each expanded word is distinguished according to the expanded word similarity and expanded type, and the industry is scored through the similarity and expanded type Restrictions are made to improve the accuracy of industry division, provide data support for the identification of advertising violations, improve the efficiency of filling the illegal lexicon, avoid manual repetition of finding illegal words from a large amount of text data, and reduce labor costs.

此外，本发明实施例还提出一种存储介质，所述存储介质上存储有违规词拓展程序，所述违规词拓展程序被处理器执行时实现如上文所述的违规词拓展方法。In addition, the embodiment of the present invention also proposes a storage medium, on which a violation word expansion program is stored, and when the violation word expansion program is executed by a processor, the above-mentioned method for expanding violation words is implemented.

由于本存储介质采用了上述所有实施例的全部技术方案，因此至少具有上述实施例的技术方案所带来的所有有益效果，在此不再一一赘述。Since the storage medium adopts all the technical solutions of all the above-mentioned embodiments, it at least has all the beneficial effects brought by the technical solutions of the above-mentioned embodiments, which will not be repeated here.

参照图7，图7为本发明违规词拓展装置第一实施例的结构框图。Referring to FIG. 7, FIG. 7 is a structural block diagram of the first embodiment of the apparatus for expanding illegal words according to the present invention.

如图7所示，本发明实施例提出的违规词拓展装置包括：As shown in Figure 7, the device for expanding the illegal words proposed by the embodiment of the present invention includes:

生成模块10，用于通过预设违规词拓展模型生成预设词根分别对应的若干拓展词。The generating module 10 is configured to generate several expanded words respectively corresponding to preset root words through the preset violation word expansion model.

确定模块20，用于确定各拓展词与所述预设词根对应的相似度得分。The determining module 20 is configured to determine the similarity score corresponding to each expanded word and the preset word root.

行业分类模块30，用于基于预设行业分类模型确定各拓展词在多个行业内的行业初始得分。The industry classification module 30 is configured to determine the industry initial score of each expanded word in multiple industries based on a preset industry classification model.

所述确定模块20，还用于根据所述预设词根对应的拓展类型确定对应的权重。The determination module 20 is further configured to determine the corresponding weight according to the extension type corresponding to the preset word root.

所述行业分类模块30，还用于根据所述相似度得分、所述行业初始得分以及所述权重确定各拓展词对应的行业最终得分。The industry classification module 30 is further configured to determine the industry final score corresponding to each expanded word according to the similarity score, the industry initial score and the weight.

拓展模块40，用于根据所述行业最终得分将各拓展词作为对应行业的违规词。The expansion module 40 is configured to use each expansion word as a violation word of the corresponding industry according to the final score of the industry.

应当理解的是，以上仅为举例说明，对本发明的技术方案并不构成任何限定，在具体应用中，本领域的技术人员可以根据需要进行设置，本发明对此不做限制。It should be understood that the above is only an example, and does not constitute any limitation to the technical solution of the present invention. In specific applications, those skilled in the art can make settings according to needs, and the present invention is not limited thereto.

需要说明的是，以上所描述的工作流程仅仅是示意性的，并不对本发明的保护范围构成限定，在实际应用中，本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的，此处不做限制。It should be noted that the workflow described above is only illustrative and does not limit the protection scope of the present invention. In practical applications, those skilled in the art can select part or all of them to implement according to actual needs. The purpose of the scheme of this embodiment is not limited here.

另外，未在本实施例中详尽描述的技术细节，可参见本发明任意实施例所提供的违规词拓展方法，此处不再赘述。In addition, for technical details that are not described in detail in this embodiment, refer to the method for expanding illegal words provided in any embodiment of the present invention, which will not be repeated here.

在一实施例中，所述生成模块10，还用于通过预设违规词拓展模型获取预设词根对应的若干相关词；根据预设数据库对所述若干相关词进行去重处理，得到若干拓展词。In one embodiment, the generation module 10 is also used to obtain a number of related words corresponding to the preset root through the preset violation word expansion model; perform deduplication processing on the number of related words according to the preset database to obtain a number of extensions word.

在一实施例中，所述生成模块10，还用于从预设数据库中确定若干当前词；分别确定所述若干当前词与所述若干相关词之间的编辑距离；根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词。In one embodiment, the generation module 10 is further configured to determine a number of current words from a preset database; respectively determine the edit distance between the number of current words and the number of related words; The several related words are deduplicated to obtain several expanded words.

在一实施例中，所述生成模块10，还用于在目标相关词对应的目标编辑距离小于预设距离阈值时，删除所述目标相关词；将剩余的相关词作为若干拓展词。In an embodiment, the generating module 10 is further configured to delete the target related word when the target edit distance corresponding to the target related word is less than a preset distance threshold; and use the remaining related words as several extended words.

在一实施例中，所述违规词拓展装置还包括存储模块；In one embodiment, the device for expanding illegal words also includes a storage module;

在一实施例中，所述确定模块20，还用于确定所述预设违规词拓展模型输出的各拓展词与所述预设词根对应的概率值；分别确定各拓展词与所述预设词根之间的当前编辑距离；根据所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。In one embodiment, the determination module 20 is also used to determine the probability value corresponding to each expanded word output by the preset violation word expansion model and the preset root; respectively determine each expanded word and the preset The current edit distance between word roots; determine the similarity score corresponding to each expanded word and the preset word root according to the probability value and the current edit distance.

在一实施例中，所述预设违规词拓展模型包括预设Word2Vec模型以及预设Bert模型。In one embodiment, the preset violation word expansion model includes a preset Word2Vec model and a preset Bert model.

在一实施例中，所述确定模块20，还用于根据生成各拓展词的预设违规词拓展模型确定对应的控制参数；根据所述控制参数、所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。In one embodiment, the determination module 20 is further configured to determine corresponding control parameters according to the preset violation word expansion model for generating each expanded word; determine according to the control parameters, the probability value and the current edit distance The similarity score corresponding to each expanded word and the preset word root.

所述存储模块，用于根据文本数据对初始Word2Vec模型进行训练，得到训练好的预设Word2Vec模型；根据所述文本数据对初始Bert模型进行训练，并基于文本分类对训练好的Bert模型进行微调，得到预设Bert模型。The storage module is used to train the initial Word2Vec model according to the text data to obtain the trained preset Word2Vec model; train the initial Bert model according to the text data, and fine-tune the trained Bert model based on text classification , to get the preset Bert model.

此外，需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that in this document, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or system comprising a set of elements includes not only those elements, but also other elements not expressly listed, or elements inherent in such a process, method, article, or system. Without further limitations, an element defined by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system comprising that element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory，ROM)/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as a read-only memory (Read Only Memory) , ROM)/RAM, magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, computer, server, or network device, etc.) execute the methods described in various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process conversion made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technical fields , are all included in the scope of patent protection of the present invention in the same way.

本发明公开了A1、一种违规词拓展方法，所述违规词拓展方法包括：The invention discloses A1, a method for expanding illegal words, and the method for expanding illegal words includes:

A2、如A1所述的违规词拓展方法，所述通过预设违规词拓展模型生成预设词根分别对应的若干拓展词，包括：A2, the method for expanding the illegal words as described in A1, the described expanded words corresponding to the default roots are generated by the preset illegal word expansion model, including:

A3、如A2所述的违规词拓展方法，所述根据预设数据库对所述若干相关词进行去重处理，得到若干拓展词，包括：A3, the method for expanding the illegal words as described in A2, the described several related words are deduplicated according to the preset database, and some expanded words are obtained, including:

A4、如A3所述的违规词拓展方法，所述根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词，包括：A4, the method for expanding the illegal words as described in A3, the described several related words are deduplicated according to the edit distance, and some expanded words are obtained, including:

A5、如A2所述的违规词拓展方法，所述根据所述行业最终得分将各拓展词作为对应行业的违规词之后，所述方法还包括：A5, the method for expanding the violation words as described in A2, after the described according to the final score of the industry, each expansion word is used as the violation words of the corresponding industry, the method also includes:

A6、如A2所述的违规词拓展方法，所述根据所述行业最终得分将各拓展词作为对应行业的违规词之后，所述方法还包括：A6, the method for expanding the violation words as described in A2, after the described according to the final score of the industry, each expansion word is used as the violation words of the corresponding industry, the method also includes:

A7、如A1所述的违规词拓展方法，所述确定各拓展词与所述预设词根对应的相似度得分，包括：A7, the method for expanding the illegal words as described in A1, the determination of the similarity score corresponding to each expanded word and the preset word root includes:

A8、如A7所述的违规词拓展方法，所述预设违规词拓展模型包括预设Word2Vec模型以及预设Bert模型。A8. The method for expanding illegal words as described in A7, wherein the preset expanded model for illegal words includes a preset Word2Vec model and a preset Bert model.

A9、如A8所述的违规词拓展方法，所述根据所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分，包括：A9, the method for expanding the illegal word as described in A8, said determining the similarity score corresponding to each expanded word and the preset word root according to the probability value and the current edit distance, including:

A10、如A8所述的违规词拓展方法，所述通过预设违规词拓展模型生成预设词根分别对应的若干拓展词之前，所述方法还包括：A10, as described in A8 Violation word expansion method, described by preset violation word expansion model before generating a number of expansion words corresponding to the preset root, said method also includes:

本发明还公开了B11、一种违规词拓展装置，所述违规词拓展装置包括：The present invention also discloses B11, a device for expanding illegal words. The device for expanding illegal words includes:

B12、如B11所述的违规词拓展装置，所述生成模块，还用于通过预设违规词拓展模型获取预设词根对应的若干相关词；根据预设数据库对所述若干相关词进行去重处理，得到若干拓展词。B12, the device for expanding the illegal words as described in B11, the generation module is also used to obtain some related words corresponding to the preset root through the preset illegal word expansion model; according to the preset database, the several related words are deduplicated Processing, get some expansion words.

B13、如B12所述的违规词拓展装置，所述生成模块，还用于从预设数据库中确定若干当前词；分别确定所述若干当前词与所述若干相关词之间的编辑距离；根据所述编辑距离对所述若干相关词进行去重处理，得到若干拓展词。B13, as described in B12 said violation word expanding device, described generating module is also used for determining some current words from preset database; Determine the editing distance between described several current words and described several related words respectively; According to The edit distance performs deduplication processing on the several related words to obtain several expanded words.

B14、如B13所述的违规词拓展装置，所述生成模块，还用于在目标相关词对应的目标编辑距离小于预设距离阈值时，删除所述目标相关词；将剩余的相关词作为若干拓展词。B14, as described in B13 Violation word expanding device, described generating module, is also used for when the target editing distance corresponding to target related word is less than preset distance threshold, deletes described target related word; Use remaining related word as several expand the word.

B15、如B12所述的违规词拓展装置，所述违规词拓展装置还包括存储模块；B15, the violation word expansion device as described in B12, the violation word expansion device also includes a storage module;

B16、如B12所述的违规词拓展装置，所述违规词拓展装置还包括存储模块；B16, the violation word expansion device as described in B12, the violation word expansion device also includes a storage module;

B17、如B11所述的违规词拓展装置，所述确定模块，还用于确定所述预设违规词拓展模型输出的各拓展词与所述预设词根对应的概率值；分别确定各拓展词与所述预设词根之间的当前编辑距离；根据所述概率值以及所述当前编辑距离确定各拓展词与所述预设词根对应的相似度得分。B17, the violation word expansion device as described in B11, the determination module is also used to determine the probability value corresponding to each expansion word output by the preset violation word expansion model and the preset root; determine each expansion word respectively The current edit distance from the preset word root; determine the similarity score corresponding to each expanded word and the preset word root according to the probability value and the current edit distance.

B18、如B17所述的违规词拓展装置，所述预设违规词拓展模型包括预设Word2Vec模型以及预设Bert模型。B18. The device for expanding violation words as described in B17, wherein the preset violation word expansion model includes a preset Word2Vec model and a preset Bert model.

本发明还公开了C19、一种违规词拓展设备，所述设备包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的违规词拓展程序，所述违规词拓展程序配置为实现如A1至A10中任一项所述的违规词拓展方法。The invention also discloses C19, a device for expanding illegal words. The device includes: a memory, a processor, and a program for expanding illegal words that is stored in the memory and can run on the processor. The program is configured to implement the violating word expansion method described in any one of A1 to A10.

本发明还公开了D20、一种存储介质，所述存储介质上存储有违规词拓展程序，所述违规词拓展程序被处理器执行时实现如A1至A10任一项所述的违规词拓展方法。The present invention also discloses D20, a storage medium, on which a violation word expansion program is stored, and when the violation word expansion program is executed by a processor, the violation word expansion method described in any one of A1 to A10 is realized .

Claims

1. A method for expanding illegal words is characterized by comprising the following steps:

generating a plurality of expansion words respectively corresponding to preset word roots through a preset illegal word expansion model;

determining similarity scores corresponding to the expansion words and the preset root words;

determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;

determining corresponding weight according to the expansion type corresponding to the preset root word;

determining an industry final score corresponding to each expansion word according to the similarity score, the industry initial score and the weight;

and taking each expansion word as a violation word of the corresponding industry according to the final industry score.

2. The method for expanding illegal words according to claim 1, wherein the generating of the plurality of expansion words respectively corresponding to the preset root of word by the preset illegal word expansion model comprises:

acquiring a plurality of related words corresponding to a preset root of a word through a preset illegal word expansion model;

and carrying out duplicate removal processing on the plurality of related words according to a preset database to obtain a plurality of expansion words.

3. The method for expanding illegal words according to claim 2, wherein the step of performing deduplication processing on the plurality of related words according to a preset database to obtain a plurality of expanded words comprises the steps of:

determining a plurality of current words from a preset database;

respectively determining editing distances between the current words and the related words;

and carrying out duplication elimination processing on the plurality of related words according to the editing distance to obtain a plurality of expansion words.

4. The method for expanding illegal words according to claim 3, wherein the removing the duplicate of the related words according to the edit distance to obtain expanded words comprises:

deleting the target related words when the target editing distance corresponding to the target related words is smaller than a preset distance threshold;

and taking the rest related words as a plurality of expansion words.

5. The method for expanding illegal words according to claim 2, wherein after each expansion word is used as the illegal word of the corresponding industry according to the industry final score, the method further comprises:

and storing each expansion word and the corresponding industry into the preset database.

6. The method for expanding illegal words according to claim 2, wherein after each expanded word is used as the illegal word of the corresponding industry according to the industry final score, the method further comprises:

acquiring a deleting instruction input by a user, and deleting the corresponding target expansion word according to the deleting instruction;

and storing the remaining expansion words and the corresponding industries in the preset database.

7. The method for extending illegal words according to claim 1, wherein the determining the similarity score of each extended word corresponding to the preset root word comprises:

determining probability values corresponding to the expansion words output by the preset illegal word expansion model and the preset root of word;

respectively determining the current editing distance between each expansion word and the preset root;

and determining the similarity score of each expansion word corresponding to the preset root word according to the probability value and the current editing distance.

8. An illegal word expansion device, characterized in that the illegal word expansion device comprises:

the generating module is used for generating a plurality of expansion words corresponding to the preset root respectively through the preset illegal word expansion model;

the determining module is used for determining similarity scores corresponding to the expansion words and the preset root;

the industry classification module is used for determining industry initial scores of the expansion words in multiple industries based on a preset industry classification model;

the determining module is further configured to determine a corresponding weight according to the expansion type corresponding to the preset root word;

the industry classification module is further used for determining industry final scores corresponding to the expansion words according to the similarity scores, the industry initial scores and the weights;

and the expansion module is used for taking each expansion word as the violation word of the corresponding industry according to the industry final score.

9. An illegal word expansion device, characterized in that the device comprises: a memory, a processor, and an offending word extension program stored on the memory and executable on the processor, the offending word extension program configured to implement the offending word extension method of any of claims 1-7.

10. A storage medium having stored thereon an illegal word expansion program that, when executed by a processor, implements an illegal word expansion method according to any one of claims 1 to 7.