CN106716397A - Device and method for detecting bad corpus data content - Google Patents

Device and method for detecting bad corpus data content Download PDF

Info

Publication number
CN106716397A
CN106716397A CN201680001769.2A CN201680001769A CN106716397A CN 106716397 A CN106716397 A CN 106716397A CN 201680001769 A CN201680001769 A CN 201680001769A CN 106716397 A CN106716397 A CN 106716397A
Authority
CN
China
Prior art keywords
language material
detected
semantic frame
corpus
bad
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201680001769.2A
Other languages
Chinese (zh)
Inventor
杨新宇
王昊奋
邱楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Green Bristlegrass Intelligence Science And Technology Ltd
Original Assignee
Shenzhen Green Bristlegrass Intelligence Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Green Bristlegrass Intelligence Science And Technology Ltd filed Critical Shenzhen Green Bristlegrass Intelligence Science And Technology Ltd
Publication of CN106716397A publication Critical patent/CN106716397A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a device and method for detecting the bad corpus data content. The device comprises a semantic frame determining module used for carrying out word segmentation to corpus data to be detected and determining the semantic frame of the corpus data to be detected; a detection standard setting module connected with a corpus and the semantic frame determining module, and is used for transmitting corpus data in the corpus to the semantic frame determining module in order to determine the semantic frame of the corpus data in the corpus, and extracting bad content words obtained during the word segmentation process of the corpus; and a detecting module used for comparing the word segmentation result of the corpus data to be detected with the bad content words, comparing the semantic frame to be detected with all semantic frames, and determining whether the corpus data to be detected is a bad corpus data content. According to the invention, by comparing the semantic frame to be detected with known semantic frames, whether the semantic frame to be detected is a bad corpus data content is judged, whether the corpus data to be detected is a bad content can be judged accurately, and omission of judge can be prevented.

Description

A kind of bad language material content detection apparatus and method
Technical field
The present invention relates to word processing field, more particularly to a kind of bad language material content detection apparatus and method.
Background technology
With the development of internet, the demand of network retrieval also more and more higher, it is therefore desirable to lay in more keywords, with And language material, it is stored in the corpus in high in the clouds, used during for netizen's internet searching.It is optimization network environment, generally requires to net The vocabulary or language material of network user input carry out harmful content detection, shield the vocabulary or language material of harmful content.
In the prior art, statistical method, statistical method is generally used to be mainly basis for the detection method of bad language material Flame dictionary judges whether it is harmful content, and the shortcoming of prior art is accuracy rate not high, it is impossible to it is accurate comprehensively The whole harmful contents in content to be detected are detected, is easily caused and is failed to judge.
The content of the invention
The present invention solves the technical problem of a kind of bad language material content detection apparatus and method are provided, can pass through Compare with known semantic frame species, distinguish whether semantic frame to be detected is harmful content language material, can be to accurate Judge whether language material to be detected is harmful content, prevent phenomenon of failing to judge.
In order to solve the above technical problems, one aspect of the present invention is:A kind of bad language material content inspection is provided Device is surveyed, the device includes:Semantic frame determining module, for carrying out participle to language material to be detected, determines language material to be detected Semantic frame;Examination criteria setting module, connects corpus and semantic frame determining module, for the language material in corpus to be passed It is defeated to semantic frame determining module, to extract the semantic frame of language material in corpus, at the same extraction corpus is carried out at participle The harmful content vocabulary obtained during reason;Detection module, word segmentation result and harmful content vocabulary for comparing language material to be detected, and Semantic frame to be detected and whole semantic frames are compared, determines whether language material to be detected is bad language material content.
In order to solve the above technical problems, one aspect of the present invention is:A kind of bad language material content inspection is provided The step of survey method, the method, includes:Participle is carried out to language material to be detected, the semantic frame of language material to be detected is determined;Extract language The semantic frame of language material in material storehouse, while the harmful content vocabulary that extraction is obtained when carrying out word segmentation processing to corpus;Than treating The word segmentation result and harmful content vocabulary of language material are detected, and compares semantic frame to be detected and whole semantic frames, determined to be checked Survey whether language material is bad language material content.
Prior art is different from, bad language material content detection device of the invention carries out participle by language material to be detected Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content Leak-stopping sentences phenomenon.
Brief description of the drawings
Fig. 1 is the structural representation of the implementation method of a kind of bad language material content detection device that the present invention is provided;
Fig. 2 is the schematic flow sheet of the implementation method of a kind of bad language material content detection algorithm that the present invention is provided.
Specific embodiment
Make further more detailed description to technical scheme with reference to specific embodiment.Obviously, retouched The embodiment stated is only a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, The every other embodiment that those of ordinary skill in the art are obtained on the premise of creative work is not made, should all belong to The scope of protection of the invention.
The construction of corpus is the important foundation of statistical learning method, and in recent years, language material base resource grinds for natural language The immense value studied carefully more and more is approved.Particularly bilingualism corpora (Bilingual Corpus), into For machine translation, machine aided translation and translation knowledge obtain the indispensable valuable source of research.On the one hand, bilingual corpora The appearance in storehouse has pushed directly on the development of machine translation new technology, as Parallel Corpus for the model construction of statistical machine translation is carried Essential training data is supplied, based on the statistics base such as (Statistic-Based) and Case-based Reasoning (Example-Based) New thinking is provided for machine translation research in the interpretation method of corpus, translation quality is effectively improved, in machine translation Research field has started new climax.On the other hand, bilingualism corpora is again the important sources for obtaining translation knowledge, therefrom can be with Excavate and learn various fine-grained translation knowledges, such as dictionary for translation and translation template, so as to improve traditional machine translation mothod. Additionally, bilingualism corpora is also cross-language information retrieval, dictionary for translation writing, bilingual terminology are automatically extracted and multilingual contrast The important foundation resource of research etc..In current network, for create healthy network environment, it is necessary to the existing language material of network and The content of the language material that the network user is input into real time carries out diagnosis detection.The growth of enriching constantly of corpus content, in corpus The detection band of appearance is come difficult.
Refering to Fig. 1, Fig. 1 is the structural representation of the implementation method of a kind of bad language material content detection device that the present invention is provided Figure.The device 100 includes:Semantic frame determining module 110, examination criteria setting module 120 and detection module 130, wherein, inspection The quasi- setting module 120 of mark is connected to semantic frame determining module 110 and corpus 101.
Corpus 101 refers to the extensive e-text storehouse through scientific sampling and processing.By computer analysis tool, grind The person of studying carefully can carry out language theory and the application study of correlation.Corpus has polytype, and certain type of Main Basiss are them Research purpose and purposes, this point tend to be embodied in the principle and mode of language material collection.Corpus is generally divided into four Type:(1) heterogeneous (Heterogeneous):There is no specific Corpus Selection Rule, collect extensively and store as former state various Language material;(2) (Homogeneous) of homogeneity:Only collect the language material of same class content;(3) (Systematic) of system:According to pre- The principle and ratio for first determining collect language material, language material is had balance and systematicness, can represent the language in a certain scope It is true;(4) special (Specialized):Only collect the language material for a certain special-purpose.
Semantic frame determining module 110 carries out participle to language material to be detected, extracts the to be detected semantic frame of language material to be detected Frame.Semantic frame determining module 110 includes participle unit 111 and semantic frame determining unit 112.Semantic frame determining unit The 112 semantic frames that the language material in language material to be detected and corpus is determined according to the word segmentation result obtained after the participle of participle unit 111 Frame, and its affiliated scene is determined according to the context of language material to be detected.In user input language material, the language material to user input enters Row detection, carries out word segmentation processing to language material to be detected by participle unit 111 first, and participle can be by existing participle instrument Processed.After the completion of participle, the word of generative semantics independence.In the present embodiment, it is thus necessary to determine that existing in corpus 101 The semantic frame of language material, therefore the existing language material in should first passing through participle unit 111 to corpus 101 carries out word segmentation processing.And After word segmentation processing, the semanteme of the whole participles of identification can therefrom screen the vocabulary of bad semanteme, and whole bad semantic vocabularies is converged It is total and store.
Semantic frame determining unit 112 according to participle unit 111 to the word segmentation processing result of language material to be detected, with reference to each The semantic type of participle determines the semantic frame of language material to be detected.Known language material in simultaneously for existing corpus 101 passes through After the word segmentation processing of participle unit 111, the semanteme of the known language material of the combination of semantic frame determining unit 112 determines the semantic frame of the language material Frame, and the scene according to belonging to the context of the language material to be detected determines it.Collect the semantic frame of language material, semantic frame is pressed It is grouped according to scene, and the semantic frame of normal language material and the semantic frame of bad language material is distinguished in each packet.Will be complete The semantic frame storage of portion's species.
Examination criteria setting module 120 connects corpus 101 and semantic frame determining module 110, for by corpus Language material be transferred to examination criteria extraction module 110, to extract the semantic frame of language material in corpus, determine semantic frame kind Class, extracts known harmful content vocabulary, while whole semantic frames and known harmful content vocabulary are stored.
Examination criteria setting module 120 includes harmful content bilingual lexicon acquisition unit 121 and semantic frame taxon 122. Harmful content bilingual lexicon acquisition unit 121 connects corpus 101, for obtaining known harmful content vocabulary from corpus 101. In the present embodiment, after participle unit 111 carries out word segmentation processing to the language material in existing corpus 101, to word segmentation processing knot Fruit is distinguished that screening harmful content vocabulary therein collects and stores.Harmful content bilingual lexicon acquisition unit 121 connects language material Storehouse 101, corpus 101 is screened the harmful content word retrieval for collecting.In other embodiments, network high in the clouds has stored The lexicon of harmful content vocabulary, harmful content bilingual lexicon acquisition unit 121 may be coupled directly to network high in the clouds, extract network cloud Hold known harmful content vocabulary in the lexicon of the harmful content vocabulary for storing.Semantic frame taxon 122 is according to language material Whole semantic frames is categorized as normal semantic frame and bad semantic frame by the semantic frame of language material in storehouse 101.Wherein, wrap Vocabulary containing types such as reaction, violence, salaciousness, political sensitivities is harmful content vocabulary, the language material comprising relative words, or It is the language material of attacking or abuse type through analyzing its semantic type although not including the above-mentioned type vocabulary, can be classified It is bad semantic frame, the semantic frame of the language material in addition to bad semantic frame is normal semantic frame.Then according to each The affiliated scene of language material is grouped to normal semantic frame and bad semantic frame.
Detection module 130 compares the word segmentation result and known harmful content vocabulary of language material to be detected, and compares to be detected Semantic frame and whole semantic frames, determine whether language material to be detected is bad language material content.When participle unit 111 will be to be detected After language material carries out word segmentation processing, the harmful content that the participle of language material to be detected and harmful content bilingual lexicon acquisition unit 121 are obtained Vocabulary is compared, and whether detection wherein includes harmful content vocabulary, if comprising regarding as bad language material content;If not wrapping Contain, determine the semantic frame of language material to be detected by semantic frame determining unit 112, compare language material to be detected semantic frame and The semantic frame of the language material in existing corpus 101, the semantic frame for analyzing language material to be detected belongs to normal semantic frame or not Good semantic frame, so as to detect whether language material to be detected is bad language material content.
In deterministic process is contrasted, if comprising in harmful content vocabulary in detecting the participle of language material to be detected by contrast At least one, and the semantic frame of language material to be detected determines that language material to be detected is normal content when belonging to normal semantic frame Language material.If the semantic frame of language material to be detected is not belonging to any one of the semantic frame of language material in corpus, according to point Word determines whether language material to be detected is bad language material with the comparative result of harmful content vocabulary.Even contrasted in detection module 130 It was found that comprising the vocabulary at least one harmful content vocabulary in the participle of language material to be detected, but contrast the semanteme of language material to be detected In framework and corpus 101 during whole semantic frames under same scene, the semantic frame of language material to be detected belongs to corresponding scene Under existing normal language material semantic frame when, then assert the language material content be normal language material content.If detection module 130 passes through The semantic frame of language material under same scene in the semantic frame and existing corpus of language material to be detected is contrasted, the language to be detected is found The semantic frame of material is not belonging to the semantic frame under the scene in corpus, then whether the language material to be detected is harmful content then root Determine according to the word segmentation result of the language material to be detected and the comparative result of harmful content vocabulary, if containing harmful content vocabulary, for The language material of harmful content.
Prior art is different from, bad language material content detection device of the invention carries out participle by language material to be detected Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content Leak-stopping sentences phenomenon.
Refering to Fig. 2, Fig. 2 is that the flow of the implementation method of a kind of bad language material content detection algorithm that the present invention is provided is illustrated Figure.The step of the method, includes:
S210:Participle is carried out to language material to be detected, the semantic frame of language material to be detected is determined.
Participle is carried out to language material to be detected, the semantic frame to be detected of language material to be detected is extracted.According to what is obtained after participle Word segmentation result determines the semantic frame of the language material in language material to be detected and corpus, and is determined according to the context of language material to be detected Its affiliated scene.In user input language material, the language material to user input is detected, language material to be detected is divided first Word treatment, participle can be processed by existing participle instrument.After the completion of participle, the word of generative semantics independence.In this reality In applying mode, it is thus necessary to determine that the semantic frame of existing language material in corpus, thus should first to corpus in existing language material divided Word treatment.And after word segmentation processing, the semanteme of the whole participles of identification can therefrom screen the vocabulary of bad semanteme, by whole not Good semantic vocabulary collects and stores.
According to the word segmentation processing result to language material to be detected, language material to be detected is determined with reference to the semantic type of each participle Semantic frame.Known language material in simultaneously for existing corpus with reference to the semanteme of known language material by after word segmentation processing, determining The semantic frame of the language material, and the scene according to belonging to the context of the language material to be detected determines it.Collect the semantic frame of language material Frame, semantic frame is grouped according to scene, and the semantic frame and bad language material of normal language material are distinguished in each packet Semantic frame.The semantic frame of all categories is stored.
S220:The semantic frame of language material in corpus is extracted, while what is obtained when carrying out word segmentation processing to corpus is bad Content vocabulary.
The semantic frame of language material in corpus is extracted, semantic frame species is determined, known harmful content vocabulary is extracted, together When whole semantic frames and known harmful content vocabulary are stored.
Known harmful content vocabulary is obtained from corpus.In the present embodiment, to the language material in existing corpus After carrying out word segmentation processing, word segmentation processing result is distinguished, screen harmful content vocabulary therein, collected and store.By language The harmful content word retrieval that the screening of material storehouse collects.In other embodiments, network high in the clouds has stored harmful content word The lexicon of remittance, may be coupled directly to network high in the clouds, known in the lexicon of the harmful content vocabulary for extracting the storage of network high in the clouds Harmful content vocabulary.According to language material in corpus semantic frame by whole semantic frames be categorized as normal semantic frame and Bad semantic frame.Wherein, the vocabulary comprising types such as reaction, violence, salaciousness, political sensitivities is harmful content vocabulary, bag Language material containing relative words, although or not attacking or abusing through analyzing its semantic type comprising the above-mentioned type vocabulary The language material of type, can be classified as bad semantic frame, and the semantic frame of the language material in addition to bad semantic frame is normal Semantic frame.Then normal semantic frame and bad semantic frame are grouped according to the affiliated scene of each language material.
S230:The word segmentation result and harmful content vocabulary of language material to be detected are compared, and compares semantic frame to be detected and complete Portion's semantic frame, determines whether language material to be detected is bad language material content.
Compare the word segmentation result and known harmful content vocabulary of language material to be detected, and compare semantic frame to be detected and complete Portion's semantic frame, determines whether language material to be detected is bad language material content.After language material to be detected is carried out into word segmentation processing, will be to be checked The harmful content vocabulary surveyed the participle of language material and obtain is compared, and whether detection wherein includes harmful content vocabulary, if comprising, Then regard as bad language material content;If not including, the semantic frame of language material to be detected is determined, compare the semantic frame of language material to be detected The semantic frame of the language material in frame and existing corpus, the semantic frame for analyzing language material to be detected belongs to normal semantic frame or not Good semantic frame, so as to detect whether language material to be detected is bad language material content.
In deterministic process is contrasted, if comprising in harmful content vocabulary in detecting the participle of language material to be detected by contrast At least one, and the semantic frame of language material to be detected determines that language material to be detected is normal content when belonging to normal semantic frame Language material.If the semantic frame of language material to be detected is not belonging to any one of the semantic frame of language material in corpus, according to point Word determines whether language material to be detected is bad language material with the comparative result of harmful content vocabulary.Even contrast finds language material to be detected Participle in comprising the vocabulary at least one harmful content vocabulary, but contrast the semantic frame and corpus of language material to be detected During whole semantic frames under same scene, the semantic frame of language material to be detected belongs to the existing normal language material under corresponding scene During semantic frame, then assert that the language material content is normal language material content.If by contrasting the semantic frame of language material to be detected and showing There is the semantic frame of language material under same scene in corpus, it is found that the semantic frame of the language material to be detected should in being not belonging to corpus Semantic frame under scene, then whether the language material to be detected is word segmentation result of the harmful content then according to the language material to be detected and not The comparative result of good content vocabulary determines, is the language material of harmful content if containing harmful content vocabulary.
Prior art is different from, bad language material content detection algorithm of the invention carries out participle by language material to be detected Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content Leak-stopping sentences phenomenon.
Embodiments of the present invention are the foregoing is only, the scope of the claims of the invention is not thereby limited, it is every using this Equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations Technical field, is included within the scope of the present invention.

Claims (10)

1. a kind of detection means of bad language material content, it is characterised in that including:
Semantic frame determining module, for carrying out participle to language material to be detected, determines the semantic frame of the language material to be detected;
Examination criteria setting module, connects corpus and the semantic frame determining module, for by the language in the corpus Material is transferred to the semantic frame determining module, to determine the semantic frame of language material in the corpus, while extracting to language material Storehouse carries out the harmful content vocabulary obtained during word segmentation processing;
Detection module, word segmentation result and the harmful content vocabulary for comparing the language material to be detected, and treated described in comparison Detection semantic frame and all semantic frame, determine whether the language material to be detected is bad language material content.
2. bad language material content detection device according to claim 1, it is characterised in that the semantic frame determining module bag Include:
Participle unit, for carrying out participle to the language material in the language material to be detected and the corpus;
Semantic frame determining unit, for determining the language to be detected according to the word segmentation result obtained after the participle unit participle The semantic frame of the language material in material and the corpus, and its affiliated scene is determined according to the context of the language material to be detected.
3. bad language material content detection device according to claim 2, it is characterised in that examination criteria setting module includes:
Harmful content bilingual lexicon acquisition unit, connects the corpus, for obtaining the harmful content word from the corpus Converge;
, be categorized as whole semantic frames for the semantic frame according to language material in the corpus by semantic frame taxon Normal semantic frame and bad semantic frame, and according to each affiliated scene of language material to the normal semantic frame and not Good semantic frame is grouped.
4. bad language material content detection device according to claim 3, it is characterised in that if detecting by contrast described to be detected Comprising at least one of harmful content vocabulary person in the participle of language material, and the semantic frame of the language material to be detected belongs to During normal semantic frame, the language material that the language material to be detected is normal content is determined.
5. bad language material content detection device according to claim 4, it is characterised in that if the semantic frame of the language material to be detected When frame is not belonging to the semantic frame of language material in the corpus, the comparative result according to the participle and harmful content vocabulary determines Whether the language material to be detected is bad language material.
6. a kind of bad language material content detection algorithm, it is characterised in that including:
Participle is carried out to language material to be detected, the semantic frame of the language material to be detected is determined;
Extract the semantic frame of language material in the corpus, at the same extraction obtain when carrying out word segmentation processing to corpus it is bad in Hold vocabulary;
Compare the word segmentation result and the harmful content vocabulary of the language material to be detected, and compare the semantic frame to be detected and All the semantic frame, determines whether the language material to be detected is bad language material content.
7. bad language material content detection algorithm according to claim 6, it is characterised in that extracting treating for the language material to be detected In the step of detection semantic frame, including step:
Participle is carried out to the language material in the language material to be detected and the corpus;
The semantic frame of the language material in the language material to be detected and the corpus is determined according to word segmentation result, and is treated according to described Detect that the context of language material determines its affiliated scene.
8. bad language material content detection algorithm according to claim 7, it is characterised in that it is determined that semantic frame species, extracts In the step of known harmful content vocabulary, including step:
The harmful content vocabulary is obtained from the corpus;
Whole semantic frames is categorized as normal semantic frame and bad language by the semantic frame according to language material in the corpus Adopted framework, and the normal semantic frame and bad semantic frame are grouped according to each affiliated scene of language material.
9. bad language material content detection algorithm according to claim 8, it is characterised in that if detecting by contrast described to be detected Comprising at least one of harmful content vocabulary person in the participle of language material, the semantic frame of the language material to be detected belongs to just During normal semantic frame, the language material that the language material to be detected is normal content is determined.
10. bad language material content detection algorithm according to claim 9, it is characterised in that if the semanteme of the language material to be detected It is true with the comparative result of harmful content vocabulary according to the participle when framework is not belonging to the semantic frame of language material in the corpus Whether the fixed language material to be detected is bad language material.
CN201680001769.2A 2016-06-29 2016-06-29 Device and method for detecting bad corpus data content Pending CN106716397A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/087758 WO2018000273A1 (en) 2016-06-29 2016-06-29 Device and method for detecting unacceptable corpus data content

Publications (1)

Publication Number Publication Date
CN106716397A true CN106716397A (en) 2017-05-24

Family

ID=58906768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201680001769.2A Pending CN106716397A (en) 2016-06-29 2016-06-29 Device and method for detecting bad corpus data content

Country Status (2)

Country Link
CN (1) CN106716397A (en)
WO (1) WO2018000273A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362659A (en) * 2019-07-16 2019-10-22 北京洛必德科技有限公司 The abnormal statement filter method and system of the open corpus of robot

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102279875A (en) * 2011-06-24 2011-12-14 成都市华为赛门铁克科技有限公司 Method and device for identifying phishing website
CN102609516A (en) * 2012-02-08 2012-07-25 苏州中联互通信息科技有限公司 Content understanding-based bad information filter method
CN102929897A (en) * 2011-08-12 2013-02-13 北京千橡网景科技发展有限公司 Method and equipment for detecting bad information from text
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060513B2 (en) * 2008-07-01 2011-11-15 Dossierview Inc. Information processing with integrated semantic contexts
CN102693236A (en) * 2011-03-24 2012-09-26 苏州风采信息技术有限公司 Bad information filtering method based on content understanding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102279875A (en) * 2011-06-24 2011-12-14 成都市华为赛门铁克科技有限公司 Method and device for identifying phishing website
CN102929897A (en) * 2011-08-12 2013-02-13 北京千橡网景科技发展有限公司 Method and equipment for detecting bad information from text
CN102609516A (en) * 2012-02-08 2012-07-25 苏州中联互通信息科技有限公司 Content understanding-based bad information filter method
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曲泷玉: ""基于框架匹配的网络文本分析"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362659A (en) * 2019-07-16 2019-10-22 北京洛必德科技有限公司 The abnormal statement filter method and system of the open corpus of robot

Also Published As

Publication number Publication date
WO2018000273A1 (en) 2018-01-04

Similar Documents

Publication Publication Date Title
JP3856778B2 (en) Document classification apparatus and document classification method for multiple languages
CN112926405A (en) Method, system, equipment and storage medium for detecting wearing of safety helmet
CN102779135B (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN107909088B (en) Method, apparatus, device and computer storage medium for obtaining training samples
CN106959998B (en) Test question recommendation method and device
CN105095091B (en) A kind of software defect code file localization method based on Inverted Index Technique
CN106570109A (en) Method for automatically generating knowledge points of question bank through text analysis
CN107832290B (en) Method and device for identifying Chinese semantic relation
US11551151B2 (en) Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus
CN110119441A (en) Text based on Hanzi structure clicks identifying code identification and filling method
CN106227836B (en) Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters
EP3968244A1 (en) Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects
CN106909600A (en) The collection method and device of user context information
CN105389303B (en) A kind of automatic fusion method of heterologous corpus
CN110968664A (en) Document retrieval method, device, equipment and medium
CN108511064A (en) The system for automatically analyzing healthy data based on deep learning
CN104331361B (en) A kind of test device and method for white-box testing coverage rate calculation visualization
CN107977454A (en) The method, apparatus and computer-readable recording medium of bilingual corpora cleaning
CN109522413B (en) Construction method and device of medical term library for guided medical examination
CN112836067B (en) Intelligent searching method based on knowledge graph
CN107451433A (en) A kind of information source identification method and apparatus based on content of text
CN106716397A (en) Device and method for detecting bad corpus data content
CN103034657B (en) Documentation summary generates method and apparatus
CN103019924B (en) The intelligent evaluating system of input method and method
Pienaar et al. Spelling checker-based language identification for the eleven official south african languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 301, Building 39, 239 Renmin Road, Gusu District, Suzhou City, Jiangsu Province, 215000

Applicant after: Suzhou Dogweed Intelligent Technology Co., Ltd.

Address before: 518000 Dongfang Science and Technology Building 1307-09, 16 Keyuan Road, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: Shenzhen green bristlegrass intelligence Science and Technology Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170524

RJ01 Rejection of invention patent application after publication