CN114329112A

CN114329112A - Content review method, device, electronic device and storage medium

Info

Publication number: CN114329112A
Application number: CN202111599306.XA
Authority: CN
Inventors: 吴俊清; 刘芳彤; 汪一鸣
Original assignee: Xinao Xinzhi Technology Co ltd
Current assignee: Xinao Xinzhi Technology Co ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-12

Abstract

The application relates to the technical field of natural language processing, in particular to a content auditing method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result; semantic level division is carried out on the text to be audited by utilizing a word segmentation method to obtain a divided word level text, and sensitive word comparison is carried out on the divided word level text and the search matching result to obtain a comparison result; and fusing based on the comparison result, calculating the confidence coefficient of the fused sensitive words, and obtaining the content verification result according to the confidence coefficient of the sensitive words. Therefore, the problems that the auditing efficiency is low, the error is large, the auditing cost is increased and the like caused by adopting a manual auditing mode to audit the text content in the related technology are solved.

Description

Content review method, device, electronic device and storage medium

技术领域technical field

本申请涉及自然语言处理技术领域，特别涉及一种内容审核方法、装置、电子设备及存储介质。The present application relates to the technical field of natural language processing, and in particular, to a content review method, device, electronic device and storage medium.

背景技术Background technique

随着社会生活飞速发展，网络平台及社交媒体平台的数据呈井喷式增加，如何更好的监管、净化网络世界，为人们提供一个绿色平和的网络生活，成为亟待解决的问题。With the rapid development of social life, the data of network platforms and social media platforms is increasing exponentially. How to better supervise and purify the network world and provide people with a green and peaceful network life has become an urgent problem to be solved.

相关技术中，通常通过人工审核的方式进行文本内容审核。然而，相关技术中对文本内容进行人工审核的方式，大大增加审核所需的时间，降低审核效率，而且人工审核的误差较大，审核成本较高。In the related art, the text content review is usually performed by means of manual review. However, the manual review of the text content in the related art greatly increases the time required for review and reduces the review efficiency, and the error of manual review is relatively large and the review cost is relatively high.

发明内容SUMMARY OF THE INVENTION

本申请提供一种基于字符级和词语级融合的内容审核方法、装置、电子设备及存储介质，以解决相关技术中采用人工审核的方式进行文本内容审核，导致审核效率较低、误差较大、且审核成本增加等问题。The present application provides a content review method, device, electronic device and storage medium based on the fusion of character level and word level, so as to solve the problem that the manual review method is used for text content review in the related art, resulting in low review efficiency, large errors, And the increase in audit costs and other issues.

本申请第一方面实施例提供一种基于字符级和词语级融合的内容审核方法，包括以下步骤：将待审核的文本按照字符级别的颗粒度进行搜索匹配，得到搜索匹配结果；对所述待审核的文本利用分词方法进行语义层面划分，得到划分后的词语级文本，并对所述划分后的词语级文本与所述搜索匹配结果进行敏感词比对，得到比对结果；基于所述比对结果进行融合，并计算融合后的敏感词置信度，由所述敏感词置信度得到内容审核结果。An embodiment of the first aspect of the present application provides a content review method based on character-level and word-level fusion, including the following steps: searching and matching the text to be reviewed according to character-level granularity to obtain a search matching result; The reviewed text is divided at the semantic level by using the word segmentation method to obtain the divided word-level text, and the sensitive words are compared between the divided word-level text and the search matching result to obtain a comparison result; based on the comparison The results are fused, and the fused sensitive word confidence is calculated, and the content review result is obtained from the sensitive word confidence.

进一步地，还包括：对初始文本进行文本预处理，得到满足审核条件的所述待审核的文本。Further, it also includes: performing text preprocessing on the initial text to obtain the text to be reviewed that meets the review conditions.

进一步地，所述将待审核的文本按照字符级别的颗粒度进行搜索匹配，得到搜索匹配结果，包括：构造用于搜索匹配的搜索树；提取所述待审核的文本的字符串的公共前缀，并基于所述公共前缀利用所述搜索树得到所述搜索匹配结果。Further, searching and matching the text to be reviewed according to the granularity of the character level to obtain a search matching result, comprising: constructing a search tree for searching and matching; extracting the common prefix of the character string of the text to be reviewed, and obtaining the search matching result by using the search tree based on the common prefix.

进一步地，所述基于所述比对结果进行融合，并计算融合后的敏感词置信度，包括：在所述比对结果为相同且命中时，所述敏感词置信度为1；在所述对比结果为所述搜索匹配结果的敏感词的长度小于所述词语级文本的分词长度且命中时，所述敏感词置信度为0.2；在所述比对结果为所述词语级文本为所述搜索匹配结果的敏感词的真子集，且相邻分词合并后长度与所述敏感词的长度相同时，所述敏感词置信度为1。Further, the performing fusion based on the comparison results and calculating the confidence of the fused sensitive words includes: when the comparison results are the same and hit, the confidence of the sensitive words is 1; When the comparison result is that the length of the sensitive word of the search matching result is less than the word segmentation length of the word-level text and it is hit, the confidence of the sensitive word is 0.2; when the comparison result is that the word-level text is the When the proper subset of the sensitive words of the matching result is searched, and the combined length of the adjacent segmented words is the same as the length of the sensitive words, the confidence of the sensitive words is 1.

进一步地，还包括：获取用户的业务需求；根据所述业务需求得到对应的个性敏感词，将所述个性敏感词增加至用于搜索匹配的敏感词库中。Further, the method further includes: acquiring the user's business requirements; obtaining corresponding personality-sensitive words according to the business requirements, and adding the personality-sensitive words to a sensitive word database for search and matching.

本申请第二方面实施例提供一种基于字符级和词语级融合的内容审核装置，包括：匹配模块，用于将待审核的文本按照字符级别的颗粒度进行搜索匹配，得到搜索匹配结果；划分模块，用于对所述待审核的文本利用分词方法进行语义层面划分，得到划分后的词语级文本，并对所述划分后的词语级文本与所述搜索匹配结果进行敏感词比对，得到比对结果；融合模块，用于基于所述比对结果进行融合，并计算融合后的敏感词置信度，由所述敏感词置信度得到内容审核结果。The embodiment of the second aspect of the present application provides a content review device based on character-level and word-level fusion, including: a matching module, configured to search and match the text to be reviewed according to the granularity of the character level to obtain a search matching result; The module is used to divide the text to be reviewed at the semantic level by using the word segmentation method, obtain the divided word-level text, and compare the divided word-level text with the search matching result. The comparison result; the fusion module is used to perform fusion based on the comparison result, and calculate the confidence level of the fused sensitive words, and obtain the content review result from the confidence level of the sensitive words.

进一步地，还包括：预处理模块，用于对初始文本进行文本预处理，得到满足审核条件的所述待审核的文本；管理模块，用于获取用户的业务需求，根据所述业务需求得到对应的个性敏感词，将所述个性敏感词增加至用于搜索匹配的敏感词库中。Further, it also includes: a preprocessing module, used to perform text preprocessing on the initial text, to obtain the text to be reviewed that meets the review conditions; a management module, used to obtain the user's business requirements, and obtain corresponding according to the business requirements. The personality-sensitive words are added to the sensitive thesaurus for searching and matching.

进一步地，所述匹配模块用于构造用于搜索匹配的搜索树；提取所述待审核的文本的字符串的公共前缀，并基于所述公共前缀利用所述搜索树得到所述搜索匹配结果。Further, the matching module is configured to construct a search tree for searching and matching; extract the common prefix of the character string of the text to be reviewed, and use the search tree to obtain the search matching result based on the common prefix.

进一步地，所述融合模块用于在所述比对结果为相同且命中时，所述敏感词置信度为1；在所述对比结果为所述搜索匹配结果的敏感词的长度小于所述词语级文本的分词长度且命中时，所述敏感词置信度为0.2；在所述比对结果为所述词语级文本为所述搜索匹配结果的敏感词的真子集，且相邻分词合并后长度与所述敏感词的长度相同时，所述敏感词置信度为1。Further, the fusion module is used for when the comparison result is the same and hit, the confidence of the sensitive word is 1; when the comparison result is that the length of the sensitive word of the search matching result is smaller than the word When the word-level text is the length of the word segment and hits, the confidence level of the sensitive word is 0.2; when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of adjacent segmented words is merged When the length of the sensitive word is the same, the confidence level of the sensitive word is 1.

本申请第三方面实施例提供一种电子设备，包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述程序，以实现上述实施例述的基于字符级和词语级融合的内容审核方法。An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to achieve The content review method based on character-level and word-level fusion described in the above embodiments.

本申请第四方面实施例提供一种计算机可读存储介质，所述计算机可读存储介质存储计算机指令，所述计算机指令用于使所述计算机执行如上述实施例所述的基于字符级和词语级融合的内容审核方法。Embodiments of the fourth aspect of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions are used to cause the computer to execute the character-level and word-based character-level and word-based methods described in the foregoing embodiments. Level Fusion content moderation method.

由此，本申请至少具有如下有益效果：Therefore, the present application at least has the following beneficial effects:

可以基于动态词库的字符级和词语级融合对文本内容进行自动审核，有效减少人工审核耗费的时间与精力，提高审核效率，降低审核成本，且融合后的结果不仅保证了字级别的准确率，而且增加了词级别的语义信息，使得审核结果更加合理，大大降低审核误差。由此，解决了相关技术中采用人工审核的方式进行文本内容审核，导致审核效率较低、误差较大、且审核成本增加等问题。The text content can be automatically reviewed based on the character-level and word-level fusion of dynamic thesaurus, effectively reducing the time and effort spent on manual review, improving review efficiency, and reducing review costs, and the fusion results not only ensure word-level accuracy. , and the semantic information at the word level is added, which makes the audit results more reasonable and greatly reduces audit errors. As a result, problems such as low review efficiency, large errors, and increased review costs are solved in the related art by using manual review to review text content.

本申请附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本申请的实践了解到。Additional aspects and advantages of the present application will be set forth, in part, in the following description, and in part will be apparent from the following description, or learned by practice of the present application.

附图说明Description of drawings

本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present application will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为根据本申请实施例提供的基于字符级和词语级融合的内容审核方法的流程示意图；1 is a schematic flowchart of a content review method based on character-level and word-level fusion provided according to an embodiment of the present application;

图2为根据本申请一个实施例提供的基于字符级和词语级融合的内容审核方法的流程示意图；2 is a schematic flowchart of a content review method based on character-level and word-level fusion provided according to an embodiment of the present application;

图3为根据本申请实施例提供的基于字符级和词语级融合的内容审核装置的示例图；3 is an example diagram of a content review device based on character-level and word-level fusion provided according to an embodiment of the present application;

图4为根据本申请实施例提供的电子设备的示例图。FIG. 4 is an example diagram of an electronic device provided according to an embodiment of the present application.

具体实施方式Detailed ways

下面详细描述本申请的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本申请，而不能理解为对本申请的限制。The following describes in detail the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to be used to explain the present application, but should not be construed as a limitation to the present application.

相关技术中，文本内容审核方法包括以下方式：In the related art, the text content review method includes the following methods:

(1)对待审核文本直接进行敏感词匹配，即待审核文本中存在与敏感词库中相同的词语即为命中。然而，该匹配方式受限于敏感词库大小，且未考虑语义层面信息，会导致内容审核存在敏感词库灵活性不高的问题；审核后的文本只考虑了拥有相同字词的匹配，却忽略了原有的语义内容，导致审核结果失真的表现；且在面对大量敏感词库的比对时，常会存在响应速度过慢的问题。(1) Directly perform sensitive word matching on the text to be reviewed, that is, if the text to be reviewed has the same words as those in the sensitive thesaurus, it is a hit. However, this matching method is limited by the size of the sensitive thesaurus, and does not consider the semantic level information, which will lead to the problem of low flexibility of the sensitive thesaurus in content review; the reviewed text only considers matches with the same words, but The original semantic content is ignored, resulting in the distortion of the audit results; and when faced with the comparison of a large number of sensitive thesaurus, the response speed is often too slow.

(2)对待审核文本进行向量转换，将映射后的语义向量与映射后的敏感词进行相似度比对，达到规定阈值视为不合规文本；然而，依据相似度的计算方式依赖大量的训练数据与精确的训练模型，审核结果存在漏报、误报的风险，训练后的模型难以做到较强的鲁棒性。(2) Perform vector conversion on the text to be reviewed, and compare the similarity between the mapped semantic vector and the mapped sensitive words, and if the text reaches the specified threshold, it is regarded as non-compliant text; however, the calculation method based on similarity depends on a large amount of training. Data and accurate training models, the audit results have the risk of false negatives and false positives, and it is difficult for the trained model to achieve strong robustness.

为解决上述问题，简化文本审核方法，提升文本审核的有效性与高效性，减少人工审核耗费的时间与精力，本申请实施例提出了一种基于字符级和词语级融合的内容审核方法、装置、电子设备及存储介质，以减少审核误差，降低人工审核成本。In order to solve the above problems, simplify the text review method, improve the effectiveness and efficiency of text review, and reduce the time and energy consumed by manual review, the embodiment of the present application proposes a content review method and device based on character-level and word-level fusion , electronic equipment and storage media to reduce audit errors and reduce manual audit costs.

下面将参考附图描述本申请实施例的基于字符级和词语级融合的内容审核方法、装置、电子设备及存储介质。针对上述背景技术中提到的相关技术中采用人工审核的方式进行文本内容审核，不仅审核效率较低，而且审核误差较大，增加审核成本的问题，本申请提供了一种基于字符级和词语级融合的内容审核方法，在该方法中，可以基于动态词库的字符级和词语级融合对文本内容进行自动审核，有效减少人工审核耗费的时间与精力，提高审核效率，降低审核成本，且融合后的结果不仅保证了字级别的准确率，而且增加了词级别的语义信息，使得审核结果更加合理，大大降低审核误差。由此，解决了相关技术中采用人工审核的方式进行文本内容审核，导致审核效率较低、误差较大、且审核成本增加等问题。The content review method, apparatus, electronic device, and storage medium based on character-level and word-level fusion according to embodiments of the present application will be described below with reference to the accompanying drawings. Aiming at the problems mentioned in the above background art that manual review is used to review text content, not only the review efficiency is low, but also the review error is large and the review cost is increased. In this method, the text content can be automatically reviewed based on the character-level and word-level fusion of dynamic thesaurus, which can effectively reduce the time and energy consumed by manual review, improve review efficiency, and reduce review costs. The fusion result not only ensures the word-level accuracy, but also increases the word-level semantic information, making the audit results more reasonable and greatly reducing audit errors. As a result, problems such as low review efficiency, large errors, and increased review costs are solved in the related art by using manual review to review text content.

具体而言，图1为本申请实施例所提供的一种基于字符级和词语级融合的内容审核方法的流程示意图。Specifically, FIG. 1 is a schematic flowchart of a content review method based on character-level and word-level fusion provided by an embodiment of the present application.

如图1所示，该基于字符级和词语级融合的内容审核方法包括以下步骤：As shown in Figure 1, the content review method based on character-level and word-level fusion includes the following steps:

在步骤S101中，将待审核的文本按照字符级别的颗粒度进行搜索匹配，得到搜索匹配结果。In step S101, the text to be reviewed is searched and matched according to the granularity of the character level, and a search matching result is obtained.

其中，字符指类字形单位或符号，包括字母、数字、运算符号、标点符号和其他符号，以及一些功能性符号。字符是电子计算机或无线电通信中字母、数字、符号的统称，是数据结构中最小的数据存取单位，通常由8个二进制位(一个字节)来表示一个字符。字符是计算机中经常用到的二进制编码形式，也是计算机中最常用到的信息形式。Among them, characters refer to glyph-like units or symbols, including letters, numbers, operation symbols, punctuation marks and other symbols, as well as some functional symbols. A character is a general term for letters, numbers, and symbols in electronic computers or radio communications. It is the smallest data access unit in a data structure, and usually consists of 8 binary bits (one byte) to represent a character. Characters are the binary encoding form often used in computers and the most commonly used information form in computers.

其中，颗粒度大表示宏观、概括；颗粒度小表示更微观、注重细节。Among them, large granularity means macroscopic and generalization; small granularity means more microscopic and attention to detail.

在本实施例中，将待审核的文本按照字符级别的颗粒度进行搜索匹配，得到搜索匹配结果，包括：构造用于搜索匹配的搜索树；提取待审核的文本的字符串的公共前缀，并基于公共前缀利用搜索树得到搜索匹配结果。In this embodiment, the text to be reviewed is searched and matched according to the granularity of the character level to obtain a search matching result, including: constructing a search tree for searching and matching; extracting the common prefix of the character string of the text to be reviewed, and Search matching results are obtained using a search tree based on common prefixes.

可以理解的是，由于相关技术中搜索速度依赖于敏感词库的大小，如果直接采用数据库存储方式，不仅会增加空间开销，而且一旦词库数量海量增加，将直接降低审核匹配的响应速度；因此，本申请实施例可以通过构造搜索树，利用了字符串的公共前缀来降低查询时间的开销，以此减小响应时间，缩短审核周期。It is understandable that since the search speed in related technologies depends on the size of the sensitive thesaurus, if the database storage method is used directly, it will not only increase the space overhead, but also directly reduce the response speed of audit matching once the number of thesaurus increases greatly; therefore , the embodiment of the present application can reduce the overhead of query time by constructing a search tree and use the common prefix of character strings, thereby reducing the response time and shortening the audit cycle.

在本实施例中，本申请实施例可以利用基础敏感词库和自定义敏感词库两个词库构造搜索树，减少空间的开销，具体地：敏感词库可以采用树的结构进行构造，最终形成具有公共前缀的多叉树，这种形式大大提升了审核匹配的响应速度，适用于大量数据搜索的场景；因此，本申请实施例可以利用树模型公共前缀的性质，构造了基于敏感词库的搜索树，利用了字符串的公共前缀来降低查询时间的开销，从而大大缩短了文本审核的响应时间，使文本审核具有更好的有效性和高效性。In this embodiment, the embodiment of the present application can use two thesaurus of the basic sensitive thesaurus and the custom sensitive thesaurus to construct a search tree to reduce the space overhead. Specifically: the sensitive thesaurus can be constructed using a tree structure, and finally A multi-fork tree with a common prefix is formed, which greatly improves the response speed of audit matching and is suitable for a large number of data search scenarios; therefore, the embodiment of the present application can utilize the nature of the common prefix of the tree model to construct a sensitive thesaurus based on The search tree uses the common prefix of strings to reduce the overhead of query time, thereby greatly shortening the response time of text auditing, and making text auditing more effective and efficient.

在本实施例中，本申请实施例的方法还包括：获取用户的业务需求；根据业务需求得到对应的个性敏感词，将个性敏感词增加至用于搜索匹配的敏感词库中。In this embodiment, the method of the embodiment of the present application further includes: acquiring the user's business requirements; obtaining corresponding personality-sensitive words according to the business requirements, and adding the personality-sensitive words to the sensitive thesaurus used for search and matching.

可以理解的是，用户可以根据业务需求定制个性化敏感词库，即自定义敏感词库，从而可以解决直接匹配方式受限于固定词库的困扰，使文本审核更加灵活，并且用户可以依据个性化的业务需求动态增加、修改自定义词库，完善通用词库，以此增强审核效果对业务灵活度的适配能力。其中，自定义敏感词库的增加、删除、修改和查找操作自动同步到树节点的更改，可以保证搜索树的搜索效率，减少搜索时间。It is understandable that the user can customize the personalized sensitive thesaurus according to the business needs, that is, the custom sensitive thesaurus, which can solve the problem that the direct matching method is limited by the fixed thesaurus, make the text review more flexible, and the user can customize the sensitive thesaurus according to the personality. The customized thesaurus can be dynamically increased, modified, and the general thesaurus can be improved to enhance the adaptability of the audit effect to the business flexibility. Among them, the addition, deletion, modification and search operations of the custom sensitive thesaurus are automatically synchronized to the changes of the tree nodes, which can ensure the search efficiency of the search tree and reduce the search time.

在本实施例中，本申请实施例的方法还包括：对初始文本进行文本预处理，得到满足审核条件的待审核的文本。In this embodiment, the method of this embodiment of the present application further includes: performing text preprocessing on the initial text to obtain the text to be reviewed that meets the review conditions.

其中，预处理可以包括：将待审核文本中的英文字母统一为小写，文本中的繁体字统一为简体，删除文本中的字符，去除常规停用词等。The preprocessing may include: unifying English letters in the text to be reviewed into lowercase, unifying traditional characters in the text into simplified characters, deleting characters in the text, removing regular stop words, and the like.

其中，审核条件可以根据实际审核需求具体设置，对此不作具体限定。The audit conditions may be specifically set according to actual audit requirements, which are not specifically limited.

在步骤S102中，对待审核的文本利用分词方法进行语义层面划分，得到划分后的词语级文本，并对划分后的词语级文本与搜索匹配结果进行敏感词比对，得到比对结果。In step S102, the text to be reviewed is divided at the semantic level by using the word segmentation method to obtain the divided word-level text, and a sensitive word comparison is performed between the divided word-level text and the search matching result to obtain a comparison result.

其中，词语是词和短语的合称，包括词(含单词、合成词)和词组(又称短语)，组成语句文章的最小组词结构形式单元。Among them, a word is a combination of words and phrases, including words (including words, compound words) and phrases (also known as phrases), which constitute the smallest group of word structural form units of a sentence article.

可以理解的是，本申请实施例可以对待审核的文本利用分词技术进行语义层面划分，将划分后的文本以词语作为最小颗粒度与直接匹配的结果做敏感词比对，得到比对结果。It can be understood that, in this embodiment of the present application, the text to be reviewed can be classified at the semantic level by using word segmentation technology, and the divided text can be compared with the result of direct matching using words as the smallest granularity for sensitive words to obtain a comparison result.

在步骤S103中，基于比对结果进行融合，并计算融合后的敏感词置信度，由敏感词置信度得到内容审核结果。In step S103 , fusion is performed based on the comparison result, and the confidence level of the fused sensitive word is calculated, and the content review result is obtained from the confidence level of the sensitive word.

可以理解的是，本申请实施例在得到对比结果之后，可以计算二者之间的包含关系，以此作为融合审核结果的依据，融合后的结果既保证了字级别的准确率，又增加了词级别的语义信息，使审核结果更加合理；然后，按照业务规则赋分，将融合后的审核结果做置信度判别推送给用户。It can be understood that, after the comparison result is obtained in the embodiment of the present application, the inclusion relationship between the two can be calculated, which can be used as the basis for the fusion audit result. The fusion result not only ensures the accuracy of the word level, but also increases the The semantic information at the word level makes the audit results more reasonable; then, according to the business rules, points are assigned, and the integrated audit results are pushed to the user for confidence judgment.

在本实施例中，基于比对结果进行融合，并计算融合后的敏感词置信度，包括：在比对结果为相同且命中时，敏感词置信度为1；在对比结果为搜索匹配结果的敏感词的长度小于词语级文本的分词长度且命中时，敏感词置信度为0.2；在比对结果为词语级文本为搜索匹配结果的敏感词的真子集，且相邻分词合并后长度与敏感词的长度相同时，敏感词置信度为1。In this embodiment, the fusion is performed based on the comparison results, and the confidence level of the fused sensitive words is calculated, including: when the comparison results are the same and hit, the confidence level of the sensitive words is 1; When the length of the sensitive word is less than that of the word-level text and it is hit, the confidence of the sensitive word is 0.2; when the comparison result is that the word-level text is the proper subset of the sensitive words of the search matching result, and the combined length of the adjacent segmented words is the same as the sensitive word. When the length of words is the same, the confidence of sensitive words is 1.

可以理解的是，如果命中的敏感词与分词后的词完全重合，则该敏感词的置信度为1；如果命中的敏感词为分词后词语的真子集或不完全包含于分词后的词语中，则该敏感词的置信度为0.2；最终文本的置信度为匹配到的所有敏感词置信度之和。如果该文本不存在敏感词，其返回结果为通过。如果该文本的最终置信度大于0.5，则其返回结果为驳回；若该文本的最终置信度大于0且小于0.5，则其返回结果为人工复核。It can be understood that if the hit sensitive word completely coincides with the word after the segmentation, the confidence of the sensitive word is 1; if the sensitive word hit is a proper subset of the words after the segmentation or is not completely included in the words after the segmentation , the confidence of the sensitive word is 0.2; the confidence of the final text is the sum of the confidences of all matched sensitive words. If there are no sensitive words in the text, the result is passed. If the final confidence of the text is greater than 0.5, the returned result is rejection; if the final confidence of the text is greater than 0 and less than 0.5, the returned result is manual review.

根据本申请实施例提出的基于字符级和词语级融合的内容审核方法，可以基于动态词库的字符级和词语级融合对文本内容进行自动审核，有效减少人工审核耗费的时间与精力，提高审核效率，降低审核成本，且融合后的结果不仅保证了字级别的准确率，而且增加了词级别的语义信息，使得审核结果更加合理，大大降低审核误差；并可以利用树模型公共前缀的性质，构造了基于敏感词库的搜索树，大大缩短了文本审核的响应时间，提高文本审核的有效性和高效性。According to the content review method based on character-level and word-level fusion proposed in the embodiment of the present application, text content can be automatically reviewed based on the character-level and word-level fusion of dynamic thesaurus, which effectively reduces the time and energy consumed by manual review and improves review. efficiency, reduce audit costs, and the fusion results not only ensure word-level accuracy, but also increase word-level semantic information, making audit results more reasonable and greatly reducing audit errors; and can use the nature of the tree model common prefix, A search tree based on sensitive thesaurus is constructed, which greatly shortens the response time of text review and improves the effectiveness and efficiency of text review.

下面将通过具体实施例对基于字符级和词语级融合的内容审核方法进行阐述，如图2所示，包括以下步骤：The content review method based on character-level and word-level fusion will be described below through specific embodiments, as shown in Figure 2, including the following steps:

1、文本预处理1. Text preprocessing

将待审核文本中的英文字母统一为小写；文本中的繁体字统一为简体；删除文本中的字符；去除常规停用词。The English letters in the text to be reviewed are unified into lowercase; the traditional characters in the text are unified into simplified; the characters in the text are deleted; the regular stop words are removed.

2、文本分词2. Text segmentation

对待审核文本进行分词操作，将文本内容划分成词语级别的颗粒度，分词后的文本考虑了原有内容的语义信息，可以对后续的审核结果提供语义层面的支撑能力。The word segmentation operation is performed on the text to be reviewed, and the text content is divided into word-level granularity. The text after word segmentation considers the semantic information of the original content, which can provide semantic support for subsequent review results.

3、搜索方式3. Search method

通过构造搜索树，利用了字符串的公共前缀来降低查询时间的开销，减小响应时间，缩短审核周期。By constructing a search tree, the common prefix of strings is used to reduce query time overhead, response time and audit cycle.

4、敏感词比对及融合4. Sensitive word comparison and fusion

判断敏感词是否包含于分词后的词语中：Determine whether the sensitive word is included in the word after the participle:

(1)如果判断为真、且敏感词长度与分词词语长度相等，即敏感词语分词结果相同且命中，此时融合二者后的敏感词置信度为1；(1) If the judgment is true, and the length of the sensitive word is equal to the length of the participle, that is, the result of the participle of the sensitive word is the same and hit, and the confidence of the sensitive word after merging the two is 1;

(2)如果判断为真、且敏感词是分词后词语的真子集，即敏感词长度小于分词词语长度且命中，此时融合二者后的敏感词置信度为0.2；(2) If the judgment is true and the sensitive word is a true subset of the words after the segmentation, that is, the length of the sensitive word is less than the length of the segmentation word and hits, then the confidence of the sensitive word after fusion of the two is 0.2;

(3)如果判断为假、且相邻分词合并后仍等于敏感词，即分词后的词语为敏感词的真子集，但相邻分词合并后长度仍与命中敏感词长度相同，此时融合二者后的敏感词置信度为1；由此解决由于分词造成的敏感词割裂的问题。(3) If it is judged to be false, and the adjacent word segmentation is still equal to the sensitive word after the combination, that is, the word after the segmentation is a true subset of the sensitive word, but the length of the adjacent word segmentation after the combination is still the same as the length of the hit sensitive word. The confidence of the latter sensitive words is 1; this solves the problem of sensitive words segmentation caused by word segmentation.

5、敏感词库动态管理5. Dynamic management of sensitive thesaurus

用户可根据业务需求定制个性化敏感词库，解决了直接匹配方式受限于固定词库的困扰，使文本审核更加灵活；其中，对自定义词库的增加、删除、修改和查找操作均同步至搜索树中，以保证搜索匹配的效率。Users can customize the personalized sensitive thesaurus according to business needs, which solves the problem that the direct matching method is limited by the fixed thesaurus, and makes the text review more flexible; among them, the addition, deletion, modification and search operations of the custom thesaurus are synchronized. into the search tree to ensure the efficiency of search matching.

综上，本申请实施例基于动态词库的字符级和词语级融合的内容审核方法，增加审核结果的灵活适配，可以简单高效地完成文本审核工作；可以灵活支撑敏感词库的动态管理，减小敏感词库对于文本审核结果的限制；保留了字符级别的审核有效性，同时增加分词后的语义信息，减少硬匹配对最终结果的影响，提升审核结果的可靠性，减少人工运维工作量，保障业务应用高性能稳定运行；同时，采用搜索匹配的方式，无需对文本内容进行语义空间的映射，且将敏感词库按照多叉树的结构构造一棵具有公共前缀的树模型，以此缩短审核响应时间。To sum up, the embodiment of the present application is based on the character-level and word-level fusion content review method of the dynamic thesaurus, increases the flexible adaptation of the review results, and can complete the text review work simply and efficiently; it can flexibly support the dynamic management of the sensitive thesaurus, Reduce the restriction of sensitive thesaurus on text review results; retain the character-level review effectiveness, and increase the semantic information after word segmentation, reduce the impact of hard matching on the final result, improve the reliability of the review results, and reduce manual operation and maintenance work At the same time, the search and matching method is adopted, and there is no need to map the semantic space of the text content, and a tree model with a common prefix is constructed according to the structure of a multi-forked tree for the sensitive thesaurus. This reduces audit response time.

其次参照附图描述根据本申请实施例提出的基于字符级和词语级融合的内容审核装置。Next, a content review device based on character-level and word-level fusion proposed according to an embodiment of the present application will be described with reference to the accompanying drawings.

图3是本申请实施例的基于字符级和词语级融合的内容审核装置的方框示意图。FIG. 3 is a schematic block diagram of a content review apparatus based on character-level and word-level fusion according to an embodiment of the present application.

如图3所示，该基于字符级和词语级融合的内容审核装置10包括：匹配模块100、划分模块200和融合模块300。As shown in FIG. 3 , the content review apparatus 10 based on character-level and word-level fusion includes: a matching module 100 , a division module 200 and a fusion module 300 .

其中，匹配模块100用于将待审核的文本按照字符级别的颗粒度进行搜索匹配，得到搜索匹配结果；划分模块200，用于对待审核的文本利用分词方法进行语义层面划分，得到划分后的词语级文本，并对划分后的词语级文本与搜索匹配结果进行敏感词比对，得到比对结果；融合模块300用于基于比对结果进行融合，并计算融合后的敏感词置信度，由敏感词置信度得到内容审核结果。The matching module 100 is used to search and match the text to be reviewed according to the granularity of the character level to obtain a search matching result; the dividing module 200 is used to divide the text to be reviewed at the semantic level by using the word segmentation method to obtain the divided words level text, and compares the divided word level text with the search matching results for sensitive words to obtain a comparison result; the fusion module 300 is used to fuse based on the comparison results, and calculate the confidence of the fused sensitive words, which is determined by the sensitivity The word confidence gets the content audit result.

进一步地，本申请实施例的装置10还包括：预处理模块和管理模块。Further, the apparatus 10 in this embodiment of the present application further includes: a preprocessing module and a management module.

其中，预处理模块，用于对初始文本进行文本预处理，得到满足审核条件的待审核的文本；管理模块，用于获取用户的业务需求，根据业务需求得到对应的个性敏感词，将个性敏感词增加至用于搜索匹配的敏感词库中。Among them, the preprocessing module is used to perform text preprocessing on the initial text to obtain the text to be reviewed that meets the review conditions; the management module is used to obtain the user's business needs, obtain corresponding personality-sensitive words according to the business needs, and identify the personality-sensitive words. Words are added to the sensitive thesaurus used for search matches.

进一步地，匹配模块100用于构造用于搜索匹配的搜索树；提取待审核的文本的字符串的公共前缀，并基于公共前缀利用搜索树得到搜索匹配结果。Further, the matching module 100 is configured to construct a search tree for searching and matching; extract the common prefix of the character string of the text to be reviewed, and use the search tree to obtain a search matching result based on the common prefix.

进一步地，融合模块300用于在比对结果为相同且命中时，敏感词置信度为1；在对比结果为搜索匹配结果的敏感词的长度小于词语级文本的分词长度且命中时，敏感词置信度为0.2；在比对结果为词语级文本为搜索匹配结果的敏感词的真子集，且相邻分词合并后长度与敏感词的长度相同时，敏感词置信度为1。Further, the fusion module 300 is used for when the comparison result is the same and hit, the confidence of the sensitive word is 1; when the comparison result is that the length of the sensitive word of the search matching result is less than the word segmentation length of the word-level text and hits, the sensitive word is hit. The confidence level is 0.2; when the comparison result is a proper subset of the sensitive words whose word-level text is the search matching result, and the combined length of the adjacent word segmentation is the same as that of the sensitive word, the confidence level of the sensitive word is 1.

需要说明的是，前述对基于字符级和词语级融合的内容审核方法实施例的解释说明也适用于该实施例的基于字符级和词语级融合的内容审核装置，此处不再赘述。It should be noted that the foregoing explanation of the embodiment of the content review method based on character-level and word-level fusion is also applicable to the content review device based on character-level and word-level fusion of this embodiment, and details are not repeated here.

根据本申请实施例提出的基于字符级和词语级融合的内容审核装置，可以基于动态词库的字符级和词语级融合对文本内容进行自动审核，有效减少人工审核耗费的时间与精力，提高审核效率，降低审核成本，且融合后的结果不仅保证了字级别的准确率，而且增加了词级别的语义信息，使得审核结果更加合理，大大降低审核误差；并可以利用树模型公共前缀的性质，构造了基于敏感词库的搜索树，大大缩短了文本审核的响应时间，提高文本审核的有效性和高效性。According to the content auditing device based on character-level and word-level fusion proposed in the embodiment of the present application, text content can be automatically audited based on the character-level and word-level fusion of dynamic thesaurus, effectively reducing the time and energy consumed by manual auditing, and improving auditing performance. efficiency, reduce audit costs, and the fusion results not only ensure word-level accuracy, but also increase word-level semantic information, making audit results more reasonable and greatly reducing audit errors; and can use the nature of the tree model common prefix, A search tree based on sensitive thesaurus is constructed, which greatly shortens the response time of text review and improves the effectiveness and efficiency of text review.

图4为本申请实施例提供的电子设备的结构示意图。该电子设备可以包括：FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device may include:

存储器401、处理器402及存储在存储器401上并可在处理器402上运行的计算机程序。Memory 401 , processor 402 , and computer programs stored on memory 401 and executable on processor 402 .

处理器402执行程序时实现上述实施例中提供的基于字符级和词语级融合的内容审核方法。When the processor 402 executes the program, the content review method based on the fusion of character level and word level provided in the above embodiments is implemented.

进一步地，电子设备还包括：Further, the electronic device also includes:

通信接口403，用于存储器401和处理器402之间的通信。The communication interface 403 is used for communication between the memory 401 and the processor 402 .

存储器401，用于存放可在处理器402上运行的计算机程序。The memory 401 is used to store computer programs that can be executed on the processor 402 .

存储器401可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 401 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.

如果存储器401、处理器402和通信接口403独立实现，则通信接口403、存储器401和处理器402可以通过总线相互连接并完成相互间的通信。总线可以是工业标准体系结构(Industry Standard Architecture，简称为ISA)总线、外部设备互连(PeripheralComponent，简称为PCI)总线或扩展工业标准体系结构(Extended Industry StandardArchitecture，简称为EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示，图4中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。If the memory 401, the processor 402 and the communication interface 403 are independently implemented, the communication interface 403, the memory 401 and the processor 402 can be connected to each other through a bus and complete communication with each other. The bus may be an Industry Standard Architecture (referred to as ISA) bus, a Peripheral Component (referred to as PCI) bus, or an Extended Industry Standard Architecture (referred to as EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 4, but it does not mean that there is only one bus or one type of bus.

可选的，在具体实现上，如果存储器401、处理器402及通信接口403，集成在一块芯片上实现，则存储器401、处理器402及通信接口403可以通过内部接口完成相互间的通信。Optionally, in terms of specific implementation, if the memory 401, the processor 402 and the communication interface 403 are integrated on one chip, the memory 401, the processor 402 and the communication interface 403 can communicate with each other through an internal interface.

处理器402可能是一个中央处理器(Central Processing Unit，简称为CPU)，或者是特定集成电路(Application Specific Integrated Circuit，简称为ASIC)，或者是被配置成实施本申请实施例的一个或多个集成电路。The processor 402 may be a central processing unit (Central Processing Unit, CPU for short), or a specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), or is configured to implement one or more of the embodiments of the present application integrated circuit.

本申请实施例还提供一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如上述的基于字符级和词语级融合的内容审核方法。Embodiments of the present application further provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above-described content review method based on character-level and word-level fusion.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或N个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or N of the embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中，“N个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present application, "N" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更N个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in the flowchart or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or N more executable instructions for implementing custom logical functions or steps of the process , and the scope of the preferred embodiments of the present application includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present application belong.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或N个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) with one or N wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，N个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如，如果用硬件来实现和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of this application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented by any one of the following techniques known in the art, or a combination thereof: discrete with logic gates for implementing logic functions on data signals Logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.

此外，在本申请各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本申请的限制，本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present application have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limitations to the present application. Embodiments are subject to variations, modifications, substitutions and variations.

Claims

1. A content auditing method based on character-level and word-level fusion is characterized by comprising the following steps:

searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result;

performing semantic level division on the text to be audited by using a word segmentation method to obtain a divided word-level text, and performing sensitive word comparison on the divided word-level text and the search matching result to obtain a comparison result; and

and fusing based on the comparison result, calculating the confidence of the fused sensitive words, and obtaining a content verification result according to the confidence of the sensitive words.

2. The method of claim 1, further comprising:

and performing text preprocessing on the initial text to obtain the text to be audited meeting the auditing conditions.

3. The method of claim 1, wherein the performing search matching on the text to be reviewed according to the character-level granularity to obtain a search matching result comprises:

constructing a search tree for searching for matches;

and extracting a public prefix of the character string of the text to be audited, and obtaining the search matching result by utilizing the search tree based on the public prefix.

4. The method of claim 1, wherein the fusing based on the comparison result and calculating the confidence of the fused sensitive word comprise:

when the comparison results are the same and hit, the confidence of the sensitive word is 1;

when the length of a sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word-level text is hit, the confidence coefficient of the sensitive word is 0.2;

and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.

5. The method according to any one of claims 1-4, further comprising:

acquiring the service requirement of a user;

and obtaining corresponding individual sensitive words according to the service requirements, and adding the individual sensitive words into a sensitive word bank for searching and matching.

6. A content auditing device based on character-level and word-level fusion, comprising:

the matching module is used for searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result;

the division module is used for performing semantic level division on the text to be audited by utilizing a word segmentation method to obtain a divided word-level text, and performing sensitive word comparison on the divided word-level text and the search matching result to obtain a comparison result; and

and the fusion module is used for fusing based on the comparison result, calculating the confidence coefficient of the fused sensitive words and obtaining a content verification result according to the confidence coefficient of the sensitive words.

7. The apparatus of claim 6, further comprising:

the preprocessing module is used for performing text preprocessing on the initial text to obtain the text to be audited meeting the auditing conditions;

the management module is used for acquiring the service requirement of a user, obtaining the corresponding individual sensitive words according to the service requirement, and adding the individual sensitive words into a sensitive word bank for searching and matching.

8. The apparatus of claim 6,

the matching module is used for constructing a search tree for searching matching; extracting a public prefix of the character string of the text to be audited, and obtaining the search matching result by utilizing the search tree based on the public prefix;

the fusion module is used for setting the confidence coefficient of the sensitive word to be 1 when the comparison results are the same and hit; when the length of a sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word-level text is hit, the confidence coefficient of the sensitive word is 0.2; and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.

9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the character-level and word-level fusion based content auditing method according to any one of claims 1-5.

10. A computer-readable storage medium, on which a computer program is stored, the program being executed by a processor for implementing a method for content auditing based on character-level and word-level fusion according to any one of claims 1-5.