CN106716397A

CN106716397A - Device and method for detecting bad corpus data content

Info

Publication number: CN106716397A
Application number: CN201680001769.2A
Authority: CN
Inventors: 杨新宇; 王昊奋; 邱楠
Original assignee: Shenzhen Green Bristlegrass Intelligence Science And Technology Ltd
Current assignee: Shenzhen Green Bristlegrass Intelligence Science And Technology Ltd
Priority date: 2016-06-29
Filing date: 2016-06-29
Publication date: 2017-05-24
Also published as: WO2018000273A1

Abstract

The invention discloses a device and method for detecting the bad corpus data content. The device comprises a semantic frame determining module used for carrying out word segmentation to corpus data to be detected and determining the semantic frame of the corpus data to be detected; a detection standard setting module connected with a corpus and the semantic frame determining module, and is used for transmitting corpus data in the corpus to the semantic frame determining module in order to determine the semantic frame of the corpus data in the corpus, and extracting bad content words obtained during the word segmentation process of the corpus; and a detecting module used for comparing the word segmentation result of the corpus data to be detected with the bad content words, comparing the semantic frame to be detected with all semantic frames, and determining whether the corpus data to be detected is a bad corpus data content. According to the invention, by comparing the semantic frame to be detected with known semantic frames, whether the semantic frame to be detected is a bad corpus data content is judged, whether the corpus data to be detected is a bad content can be judged accurately, and omission of judge can be prevented.

Description

A kind of bad language material content detection apparatus and method

Technical field

The present invention relates to word processing field, more particularly to a kind of bad language material content detection apparatus and method.

Background technology

With the development of internet, the demand of network retrieval also more and more higher, it is therefore desirable to lay in more keywords, with And language material, it is stored in the corpus in high in the clouds, used during for netizen's internet searching.It is optimization network environment, generally requires to net The vocabulary or language material of network user input carry out harmful content detection, shield the vocabulary or language material of harmful content.

In the prior art, statistical method, statistical method is generally used to be mainly basis for the detection method of bad language material Flame dictionary judges whether it is harmful content, and the shortcoming of prior art is accuracy rate not high, it is impossible to it is accurate comprehensively The whole harmful contents in content to be detected are detected, is easily caused and is failed to judge.

The content of the invention

The present invention solves the technical problem of a kind of bad language material content detection apparatus and method are provided, can pass through Compare with known semantic frame species, distinguish whether semantic frame to be detected is harmful content language material, can be to accurate Judge whether language material to be detected is harmful content, prevent phenomenon of failing to judge.

In order to solve the above technical problems, one aspect of the present invention is：A kind of bad language material content inspection is provided Device is surveyed, the device includes：Semantic frame determining module, for carrying out participle to language material to be detected, determines language material to be detected Semantic frame；Examination criteria setting module, connects corpus and semantic frame determining module, for the language material in corpus to be passed It is defeated to semantic frame determining module, to extract the semantic frame of language material in corpus, at the same extraction corpus is carried out at participle The harmful content vocabulary obtained during reason；Detection module, word segmentation result and harmful content vocabulary for comparing language material to be detected, and Semantic frame to be detected and whole semantic frames are compared, determines whether language material to be detected is bad language material content.

In order to solve the above technical problems, one aspect of the present invention is：A kind of bad language material content inspection is provided The step of survey method, the method, includes：Participle is carried out to language material to be detected, the semantic frame of language material to be detected is determined；Extract language The semantic frame of language material in material storehouse, while the harmful content vocabulary that extraction is obtained when carrying out word segmentation processing to corpus；Than treating The word segmentation result and harmful content vocabulary of language material are detected, and compares semantic frame to be detected and whole semantic frames, determined to be checked Survey whether language material is bad language material content.

Prior art is different from, bad language material content detection device of the invention carries out participle by language material to be detected Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content Leak-stopping sentences phenomenon.

Brief description of the drawings

Fig. 1 is the structural representation of the implementation method of a kind of bad language material content detection device that the present invention is provided；

Fig. 2 is the schematic flow sheet of the implementation method of a kind of bad language material content detection algorithm that the present invention is provided.

Specific embodiment

Make further more detailed description to technical scheme with reference to specific embodiment.Obviously, retouched The embodiment stated is only a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, The every other embodiment that those of ordinary skill in the art are obtained on the premise of creative work is not made, should all belong to The scope of protection of the invention.

The construction of corpus is the important foundation of statistical learning method, and in recent years, language material base resource grinds for natural language The immense value studied carefully more and more is approved.Particularly bilingualism corpora (Bilingual Corpus), into For machine translation, machine aided translation and translation knowledge obtain the indispensable valuable source of research.On the one hand, bilingual corpora The appearance in storehouse has pushed directly on the development of machine translation new technology, as Parallel Corpus for the model construction of statistical machine translation is carried Essential training data is supplied, based on the statistics base such as (Statistic-Based) and Case-based Reasoning (Example-Based) New thinking is provided for machine translation research in the interpretation method of corpus, translation quality is effectively improved, in machine translation Research field has started new climax.On the other hand, bilingualism corpora is again the important sources for obtaining translation knowledge, therefrom can be with Excavate and learn various fine-grained translation knowledges, such as dictionary for translation and translation template, so as to improve traditional machine translation mothod. Additionally, bilingualism corpora is also cross-language information retrieval, dictionary for translation writing, bilingual terminology are automatically extracted and multilingual contrast The important foundation resource of research etc..In current network, for create healthy network environment, it is necessary to the existing language material of network and The content of the language material that the network user is input into real time carries out diagnosis detection.The growth of enriching constantly of corpus content, in corpus The detection band of appearance is come difficult.

Refering to Fig. 1, Fig. 1 is the structural representation of the implementation method of a kind of bad language material content detection device that the present invention is provided Figure.The device 100 includes：Semantic frame determining module 110, examination criteria setting module 120 and detection module 130, wherein, inspection The quasi- setting module 120 of mark is connected to semantic frame determining module 110 and corpus 101.

Corpus 101 refers to the extensive e-text storehouse through scientific sampling and processing.By computer analysis tool, grind The person of studying carefully can carry out language theory and the application study of correlation.Corpus has polytype, and certain type of Main Basiss are them Research purpose and purposes, this point tend to be embodied in the principle and mode of language material collection.Corpus is generally divided into four Type：(1) heterogeneous (Heterogeneous)：There is no specific Corpus Selection Rule, collect extensively and store as former state various Language material；(2) (Homogeneous) of homogeneity：Only collect the language material of same class content；(3) (Systematic) of system：According to pre- The principle and ratio for first determining collect language material, language material is had balance and systematicness, can represent the language in a certain scope It is true；(4) special (Specialized)：Only collect the language material for a certain special-purpose.

Semantic frame determining module 110 carries out participle to language material to be detected, extracts the to be detected semantic frame of language material to be detected Frame.Semantic frame determining module 110 includes participle unit 111 and semantic frame determining unit 112.Semantic frame determining unit The 112 semantic frames that the language material in language material to be detected and corpus is determined according to the word segmentation result obtained after the participle of participle unit 111 Frame, and its affiliated scene is determined according to the context of language material to be detected.In user input language material, the language material to user input enters Row detection, carries out word segmentation processing to language material to be detected by participle unit 111 first, and participle can be by existing participle instrument Processed.After the completion of participle, the word of generative semantics independence.In the present embodiment, it is thus necessary to determine that existing in corpus 101 The semantic frame of language material, therefore the existing language material in should first passing through participle unit 111 to corpus 101 carries out word segmentation processing.And After word segmentation processing, the semanteme of the whole participles of identification can therefrom screen the vocabulary of bad semanteme, and whole bad semantic vocabularies is converged It is total and store.

Semantic frame determining unit 112 according to participle unit 111 to the word segmentation processing result of language material to be detected, with reference to each The semantic type of participle determines the semantic frame of language material to be detected.Known language material in simultaneously for existing corpus 101 passes through After the word segmentation processing of participle unit 111, the semanteme of the known language material of the combination of semantic frame determining unit 112 determines the semantic frame of the language material Frame, and the scene according to belonging to the context of the language material to be detected determines it.Collect the semantic frame of language material, semantic frame is pressed It is grouped according to scene, and the semantic frame of normal language material and the semantic frame of bad language material is distinguished in each packet.Will be complete The semantic frame storage of portion's species.

Examination criteria setting module 120 connects corpus 101 and semantic frame determining module 110, for by corpus Language material be transferred to examination criteria extraction module 110, to extract the semantic frame of language material in corpus, determine semantic frame kind Class, extracts known harmful content vocabulary, while whole semantic frames and known harmful content vocabulary are stored.

Examination criteria setting module 120 includes harmful content bilingual lexicon acquisition unit 121 and semantic frame taxon 122. Harmful content bilingual lexicon acquisition unit 121 connects corpus 101, for obtaining known harmful content vocabulary from corpus 101. In the present embodiment, after participle unit 111 carries out word segmentation processing to the language material in existing corpus 101, to word segmentation processing knot Fruit is distinguished that screening harmful content vocabulary therein collects and stores.Harmful content bilingual lexicon acquisition unit 121 connects language material Storehouse 101, corpus 101 is screened the harmful content word retrieval for collecting.In other embodiments, network high in the clouds has stored The lexicon of harmful content vocabulary, harmful content bilingual lexicon acquisition unit 121 may be coupled directly to network high in the clouds, extract network cloud Hold known harmful content vocabulary in the lexicon of the harmful content vocabulary for storing.Semantic frame taxon 122 is according to language material Whole semantic frames is categorized as normal semantic frame and bad semantic frame by the semantic frame of language material in storehouse 101.Wherein, wrap Vocabulary containing types such as reaction, violence, salaciousness, political sensitivities is harmful content vocabulary, the language material comprising relative words, or It is the language material of attacking or abuse type through analyzing its semantic type although not including the above-mentioned type vocabulary, can be classified It is bad semantic frame, the semantic frame of the language material in addition to bad semantic frame is normal semantic frame.Then according to each The affiliated scene of language material is grouped to normal semantic frame and bad semantic frame.

Detection module 130 compares the word segmentation result and known harmful content vocabulary of language material to be detected, and compares to be detected Semantic frame and whole semantic frames, determine whether language material to be detected is bad language material content.When participle unit 111 will be to be detected After language material carries out word segmentation processing, the harmful content that the participle of language material to be detected and harmful content bilingual lexicon acquisition unit 121 are obtained Vocabulary is compared, and whether detection wherein includes harmful content vocabulary, if comprising regarding as bad language material content；If not wrapping Contain, determine the semantic frame of language material to be detected by semantic frame determining unit 112, compare language material to be detected semantic frame and The semantic frame of the language material in existing corpus 101, the semantic frame for analyzing language material to be detected belongs to normal semantic frame or not Good semantic frame, so as to detect whether language material to be detected is bad language material content.

In deterministic process is contrasted, if comprising in harmful content vocabulary in detecting the participle of language material to be detected by contrast At least one, and the semantic frame of language material to be detected determines that language material to be detected is normal content when belonging to normal semantic frame Language material.If the semantic frame of language material to be detected is not belonging to any one of the semantic frame of language material in corpus, according to point Word determines whether language material to be detected is bad language material with the comparative result of harmful content vocabulary.Even contrasted in detection module 130 It was found that comprising the vocabulary at least one harmful content vocabulary in the participle of language material to be detected, but contrast the semanteme of language material to be detected In framework and corpus 101 during whole semantic frames under same scene, the semantic frame of language material to be detected belongs to corresponding scene Under existing normal language material semantic frame when, then assert the language material content be normal language material content.If detection module 130 passes through The semantic frame of language material under same scene in the semantic frame and existing corpus of language material to be detected is contrasted, the language to be detected is found The semantic frame of material is not belonging to the semantic frame under the scene in corpus, then whether the language material to be detected is harmful content then root Determine according to the word segmentation result of the language material to be detected and the comparative result of harmful content vocabulary, if containing harmful content vocabulary, for The language material of harmful content.

Refering to Fig. 2, Fig. 2 is that the flow of the implementation method of a kind of bad language material content detection algorithm that the present invention is provided is illustrated Figure.The step of the method, includes：

S210：Participle is carried out to language material to be detected, the semantic frame of language material to be detected is determined.

Participle is carried out to language material to be detected, the semantic frame to be detected of language material to be detected is extracted.According to what is obtained after participle Word segmentation result determines the semantic frame of the language material in language material to be detected and corpus, and is determined according to the context of language material to be detected Its affiliated scene.In user input language material, the language material to user input is detected, language material to be detected is divided first Word treatment, participle can be processed by existing participle instrument.After the completion of participle, the word of generative semantics independence.In this reality In applying mode, it is thus necessary to determine that the semantic frame of existing language material in corpus, thus should first to corpus in existing language material divided Word treatment.And after word segmentation processing, the semanteme of the whole participles of identification can therefrom screen the vocabulary of bad semanteme, by whole not Good semantic vocabulary collects and stores.

According to the word segmentation processing result to language material to be detected, language material to be detected is determined with reference to the semantic type of each participle Semantic frame.Known language material in simultaneously for existing corpus with reference to the semanteme of known language material by after word segmentation processing, determining The semantic frame of the language material, and the scene according to belonging to the context of the language material to be detected determines it.Collect the semantic frame of language material Frame, semantic frame is grouped according to scene, and the semantic frame and bad language material of normal language material are distinguished in each packet Semantic frame.The semantic frame of all categories is stored.

S220：The semantic frame of language material in corpus is extracted, while what is obtained when carrying out word segmentation processing to corpus is bad Content vocabulary.

The semantic frame of language material in corpus is extracted, semantic frame species is determined, known harmful content vocabulary is extracted, together When whole semantic frames and known harmful content vocabulary are stored.

Known harmful content vocabulary is obtained from corpus.In the present embodiment, to the language material in existing corpus After carrying out word segmentation processing, word segmentation processing result is distinguished, screen harmful content vocabulary therein, collected and store.By language The harmful content word retrieval that the screening of material storehouse collects.In other embodiments, network high in the clouds has stored harmful content word The lexicon of remittance, may be coupled directly to network high in the clouds, known in the lexicon of the harmful content vocabulary for extracting the storage of network high in the clouds Harmful content vocabulary.According to language material in corpus semantic frame by whole semantic frames be categorized as normal semantic frame and Bad semantic frame.Wherein, the vocabulary comprising types such as reaction, violence, salaciousness, political sensitivities is harmful content vocabulary, bag Language material containing relative words, although or not attacking or abusing through analyzing its semantic type comprising the above-mentioned type vocabulary The language material of type, can be classified as bad semantic frame, and the semantic frame of the language material in addition to bad semantic frame is normal Semantic frame.Then normal semantic frame and bad semantic frame are grouped according to the affiliated scene of each language material.

S230：The word segmentation result and harmful content vocabulary of language material to be detected are compared, and compares semantic frame to be detected and complete Portion's semantic frame, determines whether language material to be detected is bad language material content.

Compare the word segmentation result and known harmful content vocabulary of language material to be detected, and compare semantic frame to be detected and complete Portion's semantic frame, determines whether language material to be detected is bad language material content.After language material to be detected is carried out into word segmentation processing, will be to be checked The harmful content vocabulary surveyed the participle of language material and obtain is compared, and whether detection wherein includes harmful content vocabulary, if comprising, Then regard as bad language material content；If not including, the semantic frame of language material to be detected is determined, compare the semantic frame of language material to be detected The semantic frame of the language material in frame and existing corpus, the semantic frame for analyzing language material to be detected belongs to normal semantic frame or not Good semantic frame, so as to detect whether language material to be detected is bad language material content.

In deterministic process is contrasted, if comprising in harmful content vocabulary in detecting the participle of language material to be detected by contrast At least one, and the semantic frame of language material to be detected determines that language material to be detected is normal content when belonging to normal semantic frame Language material.If the semantic frame of language material to be detected is not belonging to any one of the semantic frame of language material in corpus, according to point Word determines whether language material to be detected is bad language material with the comparative result of harmful content vocabulary.Even contrast finds language material to be detected Participle in comprising the vocabulary at least one harmful content vocabulary, but contrast the semantic frame and corpus of language material to be detected During whole semantic frames under same scene, the semantic frame of language material to be detected belongs to the existing normal language material under corresponding scene During semantic frame, then assert that the language material content is normal language material content.If by contrasting the semantic frame of language material to be detected and showing There is the semantic frame of language material under same scene in corpus, it is found that the semantic frame of the language material to be detected should in being not belonging to corpus Semantic frame under scene, then whether the language material to be detected is word segmentation result of the harmful content then according to the language material to be detected and not The comparative result of good content vocabulary determines, is the language material of harmful content if containing harmful content vocabulary.

Prior art is different from, bad language material content detection algorithm of the invention carries out participle by language material to be detected Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content Leak-stopping sentences phenomenon.

Embodiments of the present invention are the foregoing is only, the scope of the claims of the invention is not thereby limited, it is every using this Equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations Technical field, is included within the scope of the present invention.

Claims

1. a kind of detection means of bad language material content, it is characterised in that including：

Semantic frame determining module, for carrying out participle to language material to be detected, determines the semantic frame of the language material to be detected；

Examination criteria setting module, connects corpus and the semantic frame determining module, for by the language in the corpus Material is transferred to the semantic frame determining module, to determine the semantic frame of language material in the corpus, while extracting to language material Storehouse carries out the harmful content vocabulary obtained during word segmentation processing；

Detection module, word segmentation result and the harmful content vocabulary for comparing the language material to be detected, and treated described in comparison Detection semantic frame and all semantic frame, determine whether the language material to be detected is bad language material content.

2. bad language material content detection device according to claim 1, it is characterised in that the semantic frame determining module bag Include：

Participle unit, for carrying out participle to the language material in the language material to be detected and the corpus；

Semantic frame determining unit, for determining the language to be detected according to the word segmentation result obtained after the participle unit participle The semantic frame of the language material in material and the corpus, and its affiliated scene is determined according to the context of the language material to be detected.

3. bad language material content detection device according to claim 2, it is characterised in that examination criteria setting module includes：

Harmful content bilingual lexicon acquisition unit, connects the corpus, for obtaining the harmful content word from the corpus Converge；

, be categorized as whole semantic frames for the semantic frame according to language material in the corpus by semantic frame taxon Normal semantic frame and bad semantic frame, and according to each affiliated scene of language material to the normal semantic frame and not Good semantic frame is grouped.

4. bad language material content detection device according to claim 3, it is characterised in that if detecting by contrast described to be detected Comprising at least one of harmful content vocabulary person in the participle of language material, and the semantic frame of the language material to be detected belongs to During normal semantic frame, the language material that the language material to be detected is normal content is determined.

5. bad language material content detection device according to claim 4, it is characterised in that if the semantic frame of the language material to be detected When frame is not belonging to the semantic frame of language material in the corpus, the comparative result according to the participle and harmful content vocabulary determines Whether the language material to be detected is bad language material.

6. a kind of bad language material content detection algorithm, it is characterised in that including：

Participle is carried out to language material to be detected, the semantic frame of the language material to be detected is determined；

Extract the semantic frame of language material in the corpus, at the same extraction obtain when carrying out word segmentation processing to corpus it is bad in Hold vocabulary；

Compare the word segmentation result and the harmful content vocabulary of the language material to be detected, and compare the semantic frame to be detected and All the semantic frame, determines whether the language material to be detected is bad language material content.

7. bad language material content detection algorithm according to claim 6, it is characterised in that extracting treating for the language material to be detected In the step of detection semantic frame, including step：

Participle is carried out to the language material in the language material to be detected and the corpus；

The semantic frame of the language material in the language material to be detected and the corpus is determined according to word segmentation result, and is treated according to described Detect that the context of language material determines its affiliated scene.

8. bad language material content detection algorithm according to claim 7, it is characterised in that it is determined that semantic frame species, extracts In the step of known harmful content vocabulary, including step：

The harmful content vocabulary is obtained from the corpus；

Whole semantic frames is categorized as normal semantic frame and bad language by the semantic frame according to language material in the corpus Adopted framework, and the normal semantic frame and bad semantic frame are grouped according to each affiliated scene of language material.

9. bad language material content detection algorithm according to claim 8, it is characterised in that if detecting by contrast described to be detected Comprising at least one of harmful content vocabulary person in the participle of language material, the semantic frame of the language material to be detected belongs to just During normal semantic frame, the language material that the language material to be detected is normal content is determined.

10. bad language material content detection algorithm according to claim 9, it is characterised in that if the semanteme of the language material to be detected It is true with the comparative result of harmful content vocabulary according to the participle when framework is not belonging to the semantic frame of language material in the corpus Whether the fixed language material to be detected is bad language material.