WO2018000273A1 - 一种不良语料内容检测装置和方法 - Google Patents

一种不良语料内容检测装置和方法 Download PDF

Info

Publication number
WO2018000273A1
WO2018000273A1 PCT/CN2016/087758 CN2016087758W WO2018000273A1 WO 2018000273 A1 WO2018000273 A1 WO 2018000273A1 CN 2016087758 W CN2016087758 W CN 2016087758W WO 2018000273 A1 WO2018000273 A1 WO 2018000273A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
detected
bad
content
semantic
Prior art date
Application number
PCT/CN2016/087758
Other languages
English (en)
French (fr)
Inventor
杨新宇
王昊奋
邱楠
Original Assignee
深圳狗尾草智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳狗尾草智能科技有限公司 filed Critical 深圳狗尾草智能科技有限公司
Priority to PCT/CN2016/087758 priority Critical patent/WO2018000273A1/zh
Priority to CN201680001769.2A priority patent/CN106716397A/zh
Publication of WO2018000273A1 publication Critical patent/WO2018000273A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to the field of word processing, and in particular to a device and method for detecting bad corpus content.
  • a statistical method is generally used for detecting a bad corpus, and the statistical method mainly determines whether it is a bad content according to a bad information vocabulary.
  • the disadvantage of the prior art is that the accuracy is not high, and the accurate and comprehensive detection cannot be performed. Detecting all the bad content in the content is likely to cause a missed judgment.
  • the technical problem to be solved by the present invention is to provide a device and method for detecting bad corpus content, which can be compared with a known semantic framework to distinguish whether the semantic frame to be detected is a bad content corpus, and can accurately determine the corpus to be detected. Whether it is bad content, to prevent leakage.
  • a technical solution adopted by the present invention is to provide a bad corpus content detecting device, the device comprising: a semantic framework determining module, configured to perform word segmentation on the detected corpus, and determine a semantic frame of the corpus to be detected;
  • the standard setting module, the connection corpus and the semantic framework determining module is configured to transmit the corpus in the corpus to the semantic framework determining module to extract the semantic framework of the corpus in the corpus, and extract the bad content vocabulary obtained when the corpus is processed by the word segmentation;
  • Detection module for comparing detection words The word segmentation result and the bad content vocabulary are compared with the detection semantic framework and the entire semantic framework to determine whether the corpus to be detected is a bad corpus content.
  • a technical solution adopted by the present invention is to provide a method for detecting a bad corpus content, the steps of the method comprising: segmenting a corpus to be detected, determining a semantic framework of the corpus to be detected; and extracting semantics of the corpus in the corpus
  • the framework extracts the vocabulary of the bad content obtained when the corpus is processed by the word segmentation; compared with the word segmentation result and the bad content vocabulary of the detected corpus, and determines whether the corpus to be detected is a bad corpus content rather than the semantic framework and the entire semantic framework.
  • the bad corpus content detecting apparatus of the present invention performs word segmentation processing on the corpus to be detected, and after the word segmentation, determines the semantic frame according to the semantics of each participle in the corpus, and compares with the known semantic framework to determine whether For bad corpus content.
  • the present invention it is possible to distinguish whether the semantic framework to be detected is a bad content corpus by comparing with a known semantic framework type, and can accurately determine whether the corpus to be detected is a bad content and prevent a missed phenomenon.
  • FIG. 1 is a schematic structural diagram of an embodiment of a bad corpus content detecting apparatus provided by the present invention
  • FIG. 2 is a schematic flow chart of an embodiment of a method for detecting a bad corpus content provided by the present invention.
  • corpus The construction of corpus is an important foundation of statistical learning methods. In recent years, the great value of corpus resources for natural language research has been more and more recognized. Especially bilingual The corpus (Bilingual Corpus) has become an indispensable resource for machine translation, machine-assisted translation, and translation knowledge acquisition research. On the one hand, the emergence of bilingual corpora directly promotes the development of new machine translation technologies, such as parallel corpus provides essential training data for statistical machine translation model construction, based on statistics (Statistic-Based) and instance-based (Example-Based The corpus-based translation method provides new ideas for machine translation research, effectively improves the quality of translation, and sets off a new climax in the field of machine translation research.
  • the bilingual corpus is an important source of translation knowledge, from which you can learn a variety of fine-grained translation knowledge, such as translation dictionaries and translation templates, to improve traditional machine translation technology.
  • the bilingual corpus is also an important basic resource for cross-language information retrieval, translation dictionary compilation, automatic extraction of bilingual terms, and multi-language comparative research.
  • the current network in order to create a healthy network environment, it is necessary to diagnose and detect the existing corpus of the network and the content of the corpus input by the network user in real time. The continual enrichment of corpus content has made it difficult to detect corpus content.
  • FIG. 1 is a schematic structural diagram of an embodiment of a bad corpus content detecting apparatus provided by the present invention.
  • the apparatus 100 includes a semantic framework determination module 110, a detection standard setting module 120, and a detection module 130, wherein the detection standard setting module 120 is connected to the semantic framework determination module 110 and the corpus 101.
  • Corpus 101 refers to a large-scale electronic text library that has been scientifically sampled and processed. With computer analysis tools, researchers can conduct relevant language theory and applied research. There are many types of corpora, and the main basis for determining the type is its research purpose and purpose, which can often be reflected in the principles and methods of corpus collection.
  • the corpus is usually divided into four types: (1) Heterogeneous: there is no specific corpus collection principle, and various corpora are widely collected and stored as such; (2) Homogeneous: only collect corpus of the same content; (3) System Systematic: Collecting corpus according to predetermined principles and proportions, making the corpus balanced and systematic, able to represent a certain range of linguistic facts; (4) Specialized: only collected for a specific purpose Corpus.
  • the semantic framework determining module 110 performs word segmentation on the detected corpus, and extracts a semantic framework to be detected of the corpus to be detected.
  • the semantic framework determination module 110 includes a word segmentation unit 111 and a semantic framework determination unit 112.
  • the semantic framework determining unit 112 obtains the word segmentation according to the word segmentation unit 111.
  • the word segmentation result determines the semantic framework of the corpus in the corpus to be detected and the corpus, and determines the scene to which it belongs according to the context of the corpus to be detected.
  • the corpus input by the user is detected.
  • the corpus to be detected by the word segmentation unit 111 is subjected to word segmentation processing, and the word segmentation can be processed by the existing word segmentation tool.
  • semantically independent words are generated.
  • the semantic framework of the existing corpus in the corpus 101 needs to be determined. Therefore, the existing corpus in the corpus 101 should be subjected to word segmentation by the word segmentation unit 111.
  • the semantics of all the partial words are identified, and the bad semantic vocabulary can be selected, and all the bad semantic vocabularies are summarized and stored.
  • the semantic framework determining unit 112 determines the semantic framework of the corpus to be detected according to the word segmentation processing result of the corpus to be detected by the word segmentation unit 111, in combination with the semantic type of each word segmentation. Meanwhile, after the known corpus in the existing corpus 101 is processed by the word segmentation unit 111, the semantic framework determining unit 112 determines the semantic frame of the corpus according to the semantics of the known corpus, and determines the genre of the corpus according to the context of the corpus to be detected. Scenes. The semantic framework of the corpus is summarized, and the semantic framework is grouped according to the scene, and the semantic framework of the normal corpus and the semantic framework of the bad corpus are distinguished in each group. Store all kinds of semantic frameworks.
  • the detection standard setting module 120 is connected to the corpus 101 and the semantic framework determining module 110 for transmitting the corpus in the corpus to the detection standard extraction module 110 to extract the semantic framework of the corpus in the corpus, determine the semantic framework type, and extract the known defects.
  • Content vocabulary while storing all semantic frameworks and known bad content vocabulary.
  • the detection standard setting module 120 includes a bad content vocabulary acquiring unit 121 and a semantic frame classification unit 122.
  • the bad content vocabulary obtaining unit 121 is connected to the corpus 101 for acquiring a known bad content vocabulary from the corpus 101.
  • the word segmentation unit 111 performs word segmentation processing on the corpus in the existing corpus 101, and then discriminates the word segmentation processing result, filters the vocabulary of the inappropriate content, and summarizes and stores the vocabulary.
  • the defective content vocabulary acquiring unit 121 connects the corpus 101 and extracts the vocabulary of the inappropriate content categorized by the corpus 101.
  • the network cloud has stored a vocabulary of the vocabulary of the bad content
  • the vocabulary acquisition unit 121 can directly connect to the network cloud to extract the vocabulary of the bad content in the vocabulary of the vocabulary of the bad content stored in the network cloud.
  • the semantic framework classification unit 122 classifies all semantic frameworks into normal semantic frameworks and bad according to the semantic framework of the corpus in the corpus 101. Semantic framework. Among them, vocabulary including reactionary, violent, obscene, politically sensitive, etc. are all bad content vocabulary, including corpus of related vocabulary, or although the vocabulary of the above type is not included, but the semantic type of the attack or sputum type is analyzed. It is classified into a bad semantic framework, and the semantic framework of the corpus other than the bad semantic framework is the normal semantic framework. The normal semantic framework and the bad semantic framework are then grouped according to the context of each corpus.
  • the detection module 130 determines whether the corpus to be detected is a bad corpus content, rather than the word segmentation result of the corpus to be detected and the known bad content vocabulary, and the semantic framework to be detected and the entire semantic framework.
  • the word segmentation unit 111 performs word segmentation processing on the corpus to be detected, the word segmentation of the corpus to be detected and the vocabulary of the bad content acquired by the bad content vocabulary acquiring unit 121 are compared to detect whether the vocabulary of the inappropriate content is included, and if it is included, it is determined to be bad.
  • the corpus content if not included, the semantic framework determining unit 112 determines the semantic framework of the corpus to be detected, and the semantic framework of the corpus to be detected belongs to the normal semantics compared to the semantic framework of the corpus to be detected and the semantic framework of the corpus in the existing corpus 101.
  • a framework or a bad semantic framework to detect whether the corpus to be detected is a bad corpus content.
  • the corpus to be detected is determined as the corpus of the normal content. If the semantic framework of the corpus to be detected does not belong to any one of the semantic frameworks of the corpus in the corpus, it is determined whether the corpus to be detected is a bad corpus according to the comparison result of the vocabulary and the vocabulary of the inappropriate content.
  • the detection module 130 compares the vocabulary of the corpus to be detected to include at least one vocabulary in the vocabulary of the inappropriate content, but compares the semantic framework of the corpus to be detected with the semantic framework of the same scene in the corpus 101, the semantics of the corpus to be detected When the framework belongs to the semantic framework of the existing normal corpus under the corresponding scene, the content of the corpus is determined to be the normal corpus content.
  • the detection module 130 compares the semantic framework of the corpus to be detected with the semantic framework of the corpus in the same corpus in the existing corpus, and finds that the semantic framework of the corpus to be detected does not belong to the semantic framework in the context of the corpus, is the corpus to be detected
  • the corpus to be detected For the bad content, it is determined according to the result of the word segmentation of the to-be-detected corpus and the comparison result of the vocabulary of the inappropriate content, and if the vocabulary of the inappropriate content is included, it is the corpus of the inappropriate content.
  • the bad corpus content detecting apparatus of the present invention performs word segmentation processing on the corpus to be detected, and after the word segmentation, determines the semantic frame according to the semantics of each participle in the corpus, and compares with the known semantic framework to determine whether For bad corpus content.
  • the present invention it is possible to distinguish whether the semantic framework to be detected is a bad content corpus by comparing with a known semantic framework type, and can accurately determine whether the corpus to be detected is a bad content and prevent a missed phenomenon.
  • FIG. 2 is a schematic flowchart diagram of an implementation manner of a method for detecting a bad corpus content provided by the present invention. The steps of the method include:
  • S210 Perform word segmentation on the detected corpus to determine a semantic framework of the corpus to be detected.
  • a word segmentation is performed on the detected corpus, and a semantic framework to be detected of the corpus to be detected is extracted.
  • the semantic framework of the corpus in the corpus to be detected and the corpus is determined, and the scene to which it belongs is determined according to the context of the corpus to be detected.
  • the corpus input by the user is detected.
  • the detected corpus is processed by word segmentation, and the word segmentation can be processed by the existing word segmentation tool.
  • semantically independent words are generated.
  • it is necessary to determine the semantic framework of the existing corpus in the corpus so the existing corpus in the corpus should be processed in a word segmentation.
  • the semantics of all the partial words are identified, and the bad semantic vocabulary can be selected, and all the bad semantic vocabularies are summarized and stored.
  • the semantic structure of the corpus to be detected is determined according to the semantic type of each word segmentation.
  • the semantic frame of the corpus is determined according to the semantics of the known corpus, and the scene to which the corpus belongs is determined according to the context of the corpus to be detected.
  • the semantic framework of the corpus is summarized, and the semantic framework is grouped according to the scene, and the semantic framework of the normal corpus and the semantic framework of the bad corpus are distinguished in each group. Store all kinds of semantic frameworks.
  • S220 Extracting the semantic framework of the corpus in the corpus, and obtaining the bad content vocabulary obtained when the corpus is processed by the word segmentation.
  • Extract the semantic framework of the corpus in the corpus determine the type of semantic framework, extract the known bad content vocabulary, and store all the semantic framework and known bad content vocabulary.
  • the network cloud has stored a vocabulary of vocabulary of the bad content, and can directly connect to the network cloud, and extract the vocabulary of the bad content that is known in the vocabulary of the vocabulary of the bad content stored in the network cloud.
  • all semantic frameworks are classified into normal semantic framework and bad semantic framework. Among them, vocabulary including reactionary, violent, obscene, politically sensitive, etc.
  • S230 determining whether the corpus to be detected is a bad corpus content, compared to the word segmentation result and the bad content vocabulary of the corpus to be detected, and comparing the detection semantic framework and the entire semantic framework.
  • the corpus to be detected is a bad corpus content, rather than the word segmentation result of the detected corpus and the known bad content vocabulary, and the semantic framework and the entire semantic framework.
  • the lexical part of the corpus to be detected is compared with the obtained vocabulary of the bad content, and the vocabulary of the bad content is detected, and if it is included, it is determined as the bad corpus content; if not included, it is determined to be detected.
  • the semantic framework of the corpus is more than the semantic framework of the corpus and the semantic framework of the corpus in the existing corpus.
  • the semantic framework of the corpus to be detected belongs to the normal semantic framework or the bad semantic framework, so as to detect whether the corpus to be detected is a bad corpus. content.
  • the corpus to be detected is determined as the corpus of the normal content. If the semantic framework of the corpus to be detected does not belong to any one of the semantic frameworks of the corpus in the corpus, it is determined whether the corpus to be detected is a bad corpus according to the comparison result of the vocabulary and the vocabulary of the inappropriate content.
  • the comparison finds that the participle of the corpus to be detected contains at least one vocabulary in the vocabulary of the bad content, but compares the semantic framework of the corpus to be detected with the semantic framework of the same scene in the corpus, the semantic framework of the corpus to be detected belongs to the corresponding scene.
  • the content of the corpus is determined to be normal corpus content.
  • the semantic framework of the corpus to be detected is compared with the semantic framework of the corpus in the same corpus in the existing corpus, it is found that the semantic framework of the corpus to be detected does not belong to the semantic framework in the context of the corpus, and whether the corpus to be detected is inappropriate content Then, it is determined according to the comparison result of the word segmentation result of the to-be-detected corpus and the vocabulary of the inappropriate content, and if the vocabulary of the inappropriate content is included, it is a corpus of the inappropriate content.
  • the method for detecting bad corpus content of the present invention performs word segmentation processing on the corpus to be detected, and after segmentation, determines the semantic framework according to the semantics of each participle in the corpus, and compares it with a known semantic framework to determine whether For bad corpus content.
  • the present invention it is possible to distinguish whether the semantic framework to be detected is a bad content corpus by comparing with a known semantic framework type, and can accurately determine whether the corpus to be detected is a bad content and prevent a missed phenomenon.

Abstract

一种不良语料内容的检测装置和方法,该装置包括:语义框架确定模块(110),用于对待检测语料进行分词,确定待检测语料的语义框架;检测标准设定模块(120),连接语料库(101)和语义框架确定模块(110),用于将语料库(101)中的语料传输到语义框架确定模块(110),以确定语料库(101)中语料的语义框架,同时提取对语料库(101)进行分词处理时得到的不良内容词汇;检测模块(130),用于比对待检测语料的分词结果和不良内容词汇,并比对待检测语义框架和全部语义框架,确定待检测语料是否为不良语料内容。通过以上方案,能够与已知语义框架种类进行比对,辨别待检测的语义框架是否为不良内容语料,能够对精确判断待检测语料是否为不良内容,防止漏判现象。

Description

一种不良语料内容检测装置和方法 技术领域
本发明涉及文字处理领域,特别是涉及一种不良语料内容检测装置和方法。
背景技术
随着互联网的发展,网络检索的需求也越来越高,因此需要储备更多的关键词,以及语料,存储于云端的语料库中,供网民上网搜索时使用。为优化网络环境,往往需要对网络用户输入的词汇或语料进行不良内容检测,屏蔽不良内容的词汇或语料。
现有技术中,对于不良语料的检测方法通常采用统计方法,统计方法主要是根据不良信息词库来判断是否是不良内容,现有技术的缺点在于准确率不高,无法精确全面的检测到待检测内容中的全部不良内容,容易造成漏判。
发明内容
本发明主要解决的技术问题是提供一种不良语料内容检测装置和方法,能够通过与已知语义框架种类进行比对,辨别待检测的语义框架是否为不良内容语料,能够对精确判断待检测语料是否为不良内容,防止漏判现象。
为解决上述技术问题,本发明采用的一个技术方案是:提供一种不良语料内容检测装置,该装置包括:语义框架确定模块,用于对待检测语料进行分词,确定待检测语料的语义框架;检测标准设定模块,连接语料库和语义框架确定模块,用于将语料库中的语料传输到语义框架确定模块,以提取语料库中语料的语义框架,同时提取对语料库进行分词处理时得到的不良内容词汇;检测模块,用于比对待检测语 料的分词结果和不良内容词汇,并比对待检测语义框架和全部语义框架,确定待检测语料是否为不良语料内容。
为解决上述技术问题,本发明采用的一个技术方案是:提供一种不良语料内容检测方法,该方法的步骤包括:对待检测语料进行分词,确定待检测语料的语义框架;提取语料库中语料的语义框架,同时提取对语料库进行分词处理时得到的不良内容词汇;比对待检测语料的分词结果和不良内容词汇,并比对待检测语义框架和全部语义框架,确定待检测语料是否为不良语料内容。
区别于现有技术,本发明的不良语料内容检测装置通过对待检测的语料进行分词处理,分词后根据语料中每一分词的语义确定其语义框架,通过和已知的语义框架进行比较,确定是否为不良语料内容。通过本发明,能够通过与已知语义框架种类进行比对,辨别待检测的语义框架是否为不良内容语料,能够对精确判断待检测语料是否为不良内容,防止漏判现象。
附图说明
图1是本发明提供的一种不良语料内容检测装置的实施方式的结构示意图;
图2是本发明提供的一种不良语料内容检测方法的实施方式的流程示意图。
具体实施方式
下面结合具体实施方式对本发明的技术方案作进一步更详细的描述。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。
语料库的建设是统计学习方法的重要基础,近年来,语料库资源对于自然语言研究的巨大价值已经得到越来越多的认可。特别是双语 语料库(Bilingual Corpus),已经成为机器翻译、机器辅助翻译以及翻译知识获取研究不可或缺的重要资源。一方面,双语语料库的出现直接推动了机器翻译新技术的发展,像平行语料库为统计机器翻译的模型构建提供了必不可少的训练数据,基于统计(Statistic-Based)和基于实例(Example-Based)等基于语料库的翻译方法为机器翻译研究提供了新的思路,有效改善了翻译质量,在机器翻译研究领域掀起了新的高潮。另一方面,双语语料库又是获取翻译知识的重要来源,从中可以挖掘学习各种细粒度的翻译知识,如翻译词典和翻译模板,从而改进传统的机器翻译技术。此外,双语语料库也是跨语言信息检索,翻译词典编撰、双语术语自动提取以及多语言对比研究等的重要基础资源。当前的网络中,为创建健康网络环境,需要对网络现有的语料以及网络用户实时输入的语料的内容进行诊断检测。语料库内容的不断丰富壮大,给语料库内容的检测带来困难。
参阅图1,图1是本发明提供的一种不良语料内容检测装置的实施方式的结构示意图。该装置100包括:语义框架确定模块110、检测标准设定模块120和检测模块130,其中,检测标准设定模块120连接到语义框架确定模块110和语料库101。
语料库101是指经科学取样和加工的大规模电子文本库。借助计算机分析工具,研究者可开展相关的语言理论及应用研究。语料库有多种类型,确定类型的主要依据是它的研究目的和用途,这一点往往能够体现在语料采集的原则和方式上。通常把语料库分成四种类型:⑴异质的(Heterogeneous):没有特定的语料收集原则,广泛收集并原样存储各种语料;⑵同质的(Homogeneous):只收集同一类内容的语料;⑶系统的(Systematic):根据预先确定的原则和比例收集语料,使语料具有平衡性和系统性,能够代表某一范围内的语言事实;⑷专用的(Specialized):只收集用于某一特定用途的语料。
语义框架确定模块110对待检测语料进行分词,提取待检测语料的待检测语义框架。语义框架确定模块110包括分词单元111和语义框架确定单元112。语义框架确定单元112根据分词单元111分词后得到的 分词结果确定待检测语料和语料库中的语料的语义框架,并根据待检测语料的上下文确定其所属场景。在用户输入语料时,对用户输入的语料进行检测,首先通过分词单元111对待检测的语料进行分词处理,分词可通过现有的分词工具进行处理。分词完成后,生成语义独立的单词。在本实施方式中,需要确定语料库101中现有语料的语义框架,故应先通过分词单元111对语料库101中的现有语料进行分词处理。且在分词处理后,识别全部分词的语义,可从中筛选不良语义的词汇,将全部的不良语义词汇汇总并存储。
语义框架确定单元112根据分词单元111对待检测语料的分词处理结果,结合每一分词的语义类型确定待检测语料的语义框架。同时,对于现有语料库101中的已知语料经过分词单元111分词处理后,语义框架确定单元112结合已知语料的语义确定该语料的语义框架,并根据该待检测语料的上下文确定其所属的场景。汇总语料的语义框架,将语义框架按照场景进行分组,并在每一分组中区分正常语料的语义框架和不良语料的语义框架。将全部种类的语义框架存储。
检测标准设定模块120连接语料库101和语义框架确定模块110,用于将语料库中的语料传输到检测标准提取模块110,以提取语料库中语料的语义框架,确定语义框架种类,提取已知的不良内容词汇,同时将全部语义框架和已知不良内容词汇存储。
检测标准设定模块120包括不良内容词汇获取单元121和语义框架分类单元122。不良内容词汇获取单元121连接语料库101,用于从语料库101中获取已知的不良内容词汇。在本实施方式中,分词单元111对现有语料库101中的语料进行分词处理后,对分词处理结果进行辨别,筛选其中的不良内容词汇,汇总并存储。不良内容词汇获取单元121连接语料库101,将语料库101筛选汇总的不良内容词汇提取。在其他实施方式中,网络云端已经存储了不良内容词汇的词汇库,不良内容词汇获取单元121可直接连接到网络云端,提取网络云端存储的不良内容词汇的词汇库中已知的不良内容词汇。语义框架分类单元122按照语料库101中语料的语义框架将全部的语义框架分类为正常语义框架和不良 语义框架。其中,包含反动、暴力、淫秽、政治敏感等类型的词汇均为不良内容词汇,包含相关词汇的语料,或者虽然未包含上述类型词汇,但是经分析其语义类型为攻击或谩骂类型的语料,可将其分类为不良语义框架,除不良语义框架之外的语料的语义框架为正常语义框架。然后按照每一语料所属场景对正常语义框架和不良语义框架进行分组。
检测模块130比对待检测语料的分词结果和已知不良内容词汇,以及比对待检测语义框架和全部语义框架,确定待检测语料是否为不良语料内容。当分词单元111将待检测语料进行分词处理后,将待检测语料的分词和不良内容词汇获取单元121获取的不良内容词汇进行比对,检测其中是否包含不良内容词汇,若包含,则认定为不良语料内容;若不包含,通过语义框架确定单元112确定待检测语料的语义框架,比对待检测语料的语义框架和现有语料库101中的语料的语义框架,分析待检测语料的语义框架属于正常语义框架或不良语义框架,从而检测出待检测的语料是否为不良语料内容。
在对比判断过程中,若经对比检测到待检测语料的分词中包含不良内容词汇中的至少之一者,而待检测语料的语义框架属于正常语义框架时,确定待检测语料为正常内容的语料。若待检测语料的语义框架不属于语料库中语料的语义框架的任意一者时,根据分词与不良内容词汇的比较结果确定待检测语料是否为不良语料。即若在检测模块130对比发现待检测语料的分词中包含至少一个不良内容词汇中的词汇,但对比待检测语料的语义框架和语料库101中相同场景下的全部语义框架时,待检测语料的语义框架属于相应场景下的现有正常语料的语义框架时,则认定该语料内容为正常语料内容。若检测模块130经过对比待检测语料的语义框架和现有语料库中相同场景下语料的语义框架,发现该待检测语料的语义框架不属于语料库中该场景下的语义框架,则该待检测语料是否为不良内容则根据该待检测语料的分词结果和不良内容词汇的比较结果确定,若含有不良内容词汇,则为不良内容的语料。
区别于现有技术,本发明的不良语料内容检测装置通过对待检测的语料进行分词处理,分词后根据语料中每一分词的语义确定其语义框架,通过和已知的语义框架进行比较,确定是否为不良语料内容。通过本发明,能够通过与已知语义框架种类进行比对,辨别待检测的语义框架是否为不良内容语料,能够对精确判断待检测语料是否为不良内容,防止漏判现象。
参阅图2,图2是本发明提供的一种不良语料内容检测方法的实施方式的流程示意图。该方法的步骤包括:
S210:对待检测语料进行分词,确定待检测语料的语义框架。
对待检测语料进行分词,提取待检测语料的待检测语义框架。根据分词后得到的分词结果确定待检测语料和语料库中的语料的语义框架,并根据待检测语料的上下文确定其所属场景。在用户输入语料时,对用户输入的语料进行检测,首先对待检测的语料进行分词处理,分词可通过现有的分词工具进行处理。分词完成后,生成语义独立的单词。在本实施方式中,需要确定语料库中现有语料的语义框架,故应先对语料库中的现有语料进行分词处理。且在分词处理后,识别全部分词的语义,可从中筛选不良语义的词汇,将全部的不良语义词汇汇总并存储。
根据对待检测语料的分词处理结果,结合每一分词的语义类型确定待检测语料的语义框架。同时,对于现有语料库中的已知语料经过分词处理后,结合已知语料的语义确定该语料的语义框架,并根据该待检测语料的上下文确定其所属的场景。汇总语料的语义框架,将语义框架按照场景进行分组,并在每一分组中区分正常语料的语义框架和不良语料的语义框架。将全部种类的语义框架存储。
S220:提取语料库中语料的语义框架,同时对语料库进行分词处理时得到的不良内容词汇。
提取语料库中语料的语义框架,确定语义框架种类,提取已知的不良内容词汇,同时将全部语义框架和已知不良内容词汇存储。
从语料库中获取已知的不良内容词汇。在本实施方式中,对现有 语料库中的语料进行分词处理后,对分词处理结果进行辨别,筛选其中的不良内容词汇,汇总并存储。将语料库筛选汇总的不良内容词汇提取。在其他实施方式中,网络云端已经存储了不良内容词汇的词汇库,可直接连接到网络云端,提取网络云端存储的不良内容词汇的词汇库中已知的不良内容词汇。按照语料库中语料的语义框架将全部的语义框架分类为正常语义框架和不良语义框架。其中,包含反动、暴力、淫秽、政治敏感等类型的词汇均为不良内容词汇,包含相关词汇的语料,或者虽然未包含上述类型词汇,但是经分析其语义类型为攻击或谩骂类型的语料,可将其分类为不良语义框架,除不良语义框架之外的语料的语义框架为正常语义框架。然后按照每一语料所属场景对正常语义框架和不良语义框架进行分组。
S230:比对待检测语料的分词结果和不良内容词汇,并比对待检测语义框架和全部语义框架,确定待检测语料是否为不良语料内容。
比对待检测语料的分词结果和已知不良内容词汇,以及比对待检测语义框架和全部语义框架,确定待检测语料是否为不良语料内容。将待检测语料进行分词处理后,将待检测语料的分词和获取的不良内容词汇进行比对,检测其中是否包含不良内容词汇,若包含,则认定为不良语料内容;若不包含,确定待检测语料的语义框架,比对待检测语料的语义框架和现有语料库中的语料的语义框架,分析待检测语料的语义框架属于正常语义框架或不良语义框架,从而检测出待检测的语料是否为不良语料内容。
在对比判断过程中,若经对比检测到待检测语料的分词中包含不良内容词汇中的至少之一者,而待检测语料的语义框架属于正常语义框架时,确定待检测语料为正常内容的语料。若待检测语料的语义框架不属于语料库中语料的语义框架的任意一者时,根据分词与不良内容词汇的比较结果确定待检测语料是否为不良语料。即若对比发现待检测语料的分词中包含至少一个不良内容词汇中的词汇,但对比待检测语料的语义框架和语料库中相同场景下的全部语义框架时,待检测语料的语义框架属于相应场景下的现有正常语料的语义框架时,则认 定该语料内容为正常语料内容。若经过对比待检测语料的语义框架和现有语料库中相同场景下语料的语义框架,发现该待检测语料的语义框架不属于语料库中该场景下的语义框架,则该待检测语料是否为不良内容则根据该待检测语料的分词结果和不良内容词汇的比较结果确定,若含有不良内容词汇,则为不良内容的语料。
区别于现有技术,本发明的不良语料内容检测方法通过对待检测的语料进行分词处理,分词后根据语料中每一分词的语义确定其语义框架,通过和已知的语义框架进行比较,确定是否为不良语料内容。通过本发明,能够通过与已知语义框架种类进行比对,辨别待检测的语义框架是否为不良内容语料,能够对精确判断待检测语料是否为不良内容,防止漏判现象。
以上所述仅为本发明的实施方式,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (10)

  1. 一种不良语料内容的检测装置,其特征在于,包括:
    语义框架确定模块,用于对待检测语料进行分词,确定所述待检测语料的语义框架;
    检测标准设定模块,连接语料库和所述语义框架确定模块,用于将所述语料库中的语料传输到所述语义框架确定模块,以确定所述语料库中语料的语义框架,同时提取对语料库进行分词处理时得到的不良内容词汇;
    检测模块,用于比对所述待检测语料的分词结果和所述不良内容词汇,并比对所述待检测语义框架和全部所述语义框架,确定所述待检测语料是否为不良语料内容。
  2. 根据权利要求1的不良语料内容检测装置,其特征在于,所述语义框架确定模块包括:
    分词单元,用于对所述待检测语料和所述语料库中的语料进行分词;
    语义框架确定单元,用于根据所述分词单元分词后得到的分词结果确定所述待检测语料和所述语料库中的语料的语义框架,并根据所述待检测语料的上下文确定其所属场景。
  3. 根据权利要求2的不良语料内容检测装置,其特征在于,检测标准设定模块包括:
    不良内容词汇获取单元,连接所述语料库,用于从所述语料库中获取所述不良内容词汇;
    语义框架分类单元,用于按照所述语料库中语料的语义框架将全部的语义框架分类为正常语义框架和不良语义框架,以及按照每一所述语料所属场景对所述正常语义框架和不良语义框架进行分组。
  4. 根据权利要求3的不良语料内容检测装置,其特征在于,若经对比检测到所述待检测语料的分词中包含所述不良内容词汇中的至少之 一者,而所述待检测语料的语义框架属于正常语义框架时,确定所述待检测语料为正常内容的语料。
  5. 根据权利要求4的不良语料内容检测装置,其特征在于,若所述待检测语料的语义框架不属于所述语料库中语料的语义框架时,根据所述分词与不良内容词汇的比较结果确定所述待检测语料是否为不良语料。
  6. 一种不良语料内容检测方法,其特征在于,包括:
    对待检测语料进行分词,确定所述待检测语料的语义框架;
    提取所述语料库中语料的语义框架,同时提取对语料库进行分词处理时得到的不良内容词汇;
    比对所述待检测语料的分词结果和所述不良内容词汇,并比对所述待检测语义框架和全部所述语义框架,确定所述待检测语料是否为不良语料内容。
  7. 根据权利要求6的不良语料内容检测方法,其特征在于,在提取所述待检测语料的待检测语义框架的步骤中,包括步骤:
    对所述待检测语料和所述语料库中的语料进行分词;
    根据分词结果确定所述待检测语料和所述语料库中的语料的语义框架,并根据所述待检测语料的上下文确定其所属场景。
  8. 根据权利要求7的不良语料内容检测方法,其特征在于,在确定语义框架种类,提取已知的不良内容词汇的步骤中,包括步骤:
    从所述语料库中获取所述不良内容词汇;
    按照所述语料库中语料的语义框架将全部的语义框架分类为正常语义框架和不良语义框架,以及按照每一所述语料所属场景对所述正常语义框架和不良语义框架进行分组。
  9. 根据权利要求8的不良语料内容检测方法,其特征在于,若经对比检测到所述待检测语料的分词中包含所述不良内容词汇中的至少之一者,所述待检测语料的语义框架属于正常语义框架时,确定所述待检测语料为正常内容的语料。
  10. 根据权利要求9的不良语料内容检测方法,其特征在于,若所 述待检测语料的语义框架不属于所述语料库中语料的语义框架时,根据所述分词与不良内容词汇的比较结果确定所述待检测语料是否为不良语料。
PCT/CN2016/087758 2016-06-29 2016-06-29 一种不良语料内容检测装置和方法 WO2018000273A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/087758 WO2018000273A1 (zh) 2016-06-29 2016-06-29 一种不良语料内容检测装置和方法
CN201680001769.2A CN106716397A (zh) 2016-06-29 2016-06-29 一种不良语料内容检测装置和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/087758 WO2018000273A1 (zh) 2016-06-29 2016-06-29 一种不良语料内容检测装置和方法

Publications (1)

Publication Number Publication Date
WO2018000273A1 true WO2018000273A1 (zh) 2018-01-04

Family

ID=58906768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087758 WO2018000273A1 (zh) 2016-06-29 2016-06-29 一种不良语料内容检测装置和方法

Country Status (2)

Country Link
CN (1) CN106716397A (zh)
WO (1) WO2018000273A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362659A (zh) * 2019-07-16 2019-10-22 北京洛必德科技有限公司 机器人开放语料库的异常语句过滤方法和系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010000064A1 (en) * 2008-07-01 2010-01-07 Dossierview Inc. Information processing with integrated semantic contexts
CN102693236A (zh) * 2011-03-24 2012-09-26 苏州风采信息技术有限公司 基于内容理解的不良信息过滤方法
CN105574090A (zh) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 一种敏感词过滤方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102279875B (zh) * 2011-06-24 2013-04-24 华为数字技术(成都)有限公司 钓鱼网站的识别方法和装置
CN102929897A (zh) * 2011-08-12 2013-02-13 北京千橡网景科技发展有限公司 用于检测文本中不良信息的方法和设备
CN102609516A (zh) * 2012-02-08 2012-07-25 苏州中联互通信息科技有限公司 基于内容理解的不良信息过滤方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010000064A1 (en) * 2008-07-01 2010-01-07 Dossierview Inc. Information processing with integrated semantic contexts
CN102693236A (zh) * 2011-03-24 2012-09-26 苏州风采信息技术有限公司 基于内容理解的不良信息过滤方法
CN105574090A (zh) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 一种敏感词过滤方法及系统

Also Published As

Publication number Publication date
CN106716397A (zh) 2017-05-24

Similar Documents

Publication Publication Date Title
KR100961717B1 (ko) 병렬 코퍼스를 이용한 기계번역 오류 탐지 방법 및 장치
US10657325B2 (en) Method for parsing query based on artificial intelligence and computer device
CN112347244B (zh) 基于混合特征分析的涉黄、涉赌网站检测方法
US11521603B2 (en) Automatically generating conference minutes
CN107832229A (zh) 一种基于nlp的系统测试用例自动生成方法
CN102779135B (zh) 跨语言获取搜索资源的方法和装置及对应搜索方法和装置
CN106570180A (zh) 基于人工智能的语音搜索方法及装置
CN109522396B (zh) 一种面向国防科技领域的知识处理方法及系统
CN112765974B (zh) 一种业务辅助方法、电子设备及可读存储介质
CN105389303B (zh) 一种异源语料自动融合方法
CN110738033B (zh) 报告模板生成方法、装置及存储介质
CN107341142B (zh) 一种基于关键词提取分析的企业关系计算方法及系统
CN112836067B (zh) 基于知识图谱的智能搜索方法
CN104572619A (zh) 智能机器人交互系统在投融资领域的应用
WO2021012684A1 (zh) 市场情绪监测体系建立方法和系统
WO2018000273A1 (zh) 一种不良语料内容检测装置和方法
CN113033217A (zh) 字幕敏感信息自动屏蔽转译方法和装置
CN108021595B (zh) 检验知识库三元组的方法及装置
CN110929509B (zh) 一种基于louvain社区发现算法的领域事件触发词聚类方法
CN111859032A (zh) 一种短信拆字敏感词的检测方法、装置及计算机存储介质
CN106776590A (zh) 一种获取词条译文的方法及系统
CN106294315A (zh) 基于句法特性与统计融合的自然语言谓语动词识别方法
CN107577667B (zh) 一种实体词处理方法和装置
CN113722421A (zh) 一种合同审计方法和系统,及计算机可读存储介质
CN111341404A (zh) 一种基于ernie模型的电子病历数据组解析方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16906674

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16906674

Country of ref document: EP

Kind code of ref document: A1