CN106716397A - Device and method for detecting bad corpus data content - Google Patents
Device and method for detecting bad corpus data content Download PDFInfo
- Publication number
- CN106716397A CN106716397A CN201680001769.2A CN201680001769A CN106716397A CN 106716397 A CN106716397 A CN 106716397A CN 201680001769 A CN201680001769 A CN 201680001769A CN 106716397 A CN106716397 A CN 106716397A
- Authority
- CN
- China
- Prior art keywords
- language material
- detected
- semantic frame
- corpus
- bad
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention discloses a device and method for detecting the bad corpus data content. The device comprises a semantic frame determining module used for carrying out word segmentation to corpus data to be detected and determining the semantic frame of the corpus data to be detected; a detection standard setting module connected with a corpus and the semantic frame determining module, and is used for transmitting corpus data in the corpus to the semantic frame determining module in order to determine the semantic frame of the corpus data in the corpus, and extracting bad content words obtained during the word segmentation process of the corpus; and a detecting module used for comparing the word segmentation result of the corpus data to be detected with the bad content words, comparing the semantic frame to be detected with all semantic frames, and determining whether the corpus data to be detected is a bad corpus data content. According to the invention, by comparing the semantic frame to be detected with known semantic frames, whether the semantic frame to be detected is a bad corpus data content is judged, whether the corpus data to be detected is a bad content can be judged accurately, and omission of judge can be prevented.
Description
Technical field
The present invention relates to word processing field, more particularly to a kind of bad language material content detection apparatus and method.
Background technology
With the development of internet, the demand of network retrieval also more and more higher, it is therefore desirable to lay in more keywords, with
And language material, it is stored in the corpus in high in the clouds, used during for netizen's internet searching.It is optimization network environment, generally requires to net
The vocabulary or language material of network user input carry out harmful content detection, shield the vocabulary or language material of harmful content.
In the prior art, statistical method, statistical method is generally used to be mainly basis for the detection method of bad language material
Flame dictionary judges whether it is harmful content, and the shortcoming of prior art is accuracy rate not high, it is impossible to it is accurate comprehensively
The whole harmful contents in content to be detected are detected, is easily caused and is failed to judge.
The content of the invention
The present invention solves the technical problem of a kind of bad language material content detection apparatus and method are provided, can pass through
Compare with known semantic frame species, distinguish whether semantic frame to be detected is harmful content language material, can be to accurate
Judge whether language material to be detected is harmful content, prevent phenomenon of failing to judge.
In order to solve the above technical problems, one aspect of the present invention is:A kind of bad language material content inspection is provided
Device is surveyed, the device includes:Semantic frame determining module, for carrying out participle to language material to be detected, determines language material to be detected
Semantic frame;Examination criteria setting module, connects corpus and semantic frame determining module, for the language material in corpus to be passed
It is defeated to semantic frame determining module, to extract the semantic frame of language material in corpus, at the same extraction corpus is carried out at participle
The harmful content vocabulary obtained during reason;Detection module, word segmentation result and harmful content vocabulary for comparing language material to be detected, and
Semantic frame to be detected and whole semantic frames are compared, determines whether language material to be detected is bad language material content.
In order to solve the above technical problems, one aspect of the present invention is:A kind of bad language material content inspection is provided
The step of survey method, the method, includes:Participle is carried out to language material to be detected, the semantic frame of language material to be detected is determined;Extract language
The semantic frame of language material in material storehouse, while the harmful content vocabulary that extraction is obtained when carrying out word segmentation processing to corpus;Than treating
The word segmentation result and harmful content vocabulary of language material are detected, and compares semantic frame to be detected and whole semantic frames, determined to be checked
Survey whether language material is bad language material content.
Prior art is different from, bad language material content detection device of the invention carries out participle by language material to be detected
Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame
Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish
Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content
Leak-stopping sentences phenomenon.
Brief description of the drawings
Fig. 1 is the structural representation of the implementation method of a kind of bad language material content detection device that the present invention is provided;
Fig. 2 is the schematic flow sheet of the implementation method of a kind of bad language material content detection algorithm that the present invention is provided.
Specific embodiment
Make further more detailed description to technical scheme with reference to specific embodiment.Obviously, retouched
The embodiment stated is only a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention,
The every other embodiment that those of ordinary skill in the art are obtained on the premise of creative work is not made, should all belong to
The scope of protection of the invention.
The construction of corpus is the important foundation of statistical learning method, and in recent years, language material base resource grinds for natural language
The immense value studied carefully more and more is approved.Particularly bilingualism corpora (Bilingual Corpus), into
For machine translation, machine aided translation and translation knowledge obtain the indispensable valuable source of research.On the one hand, bilingual corpora
The appearance in storehouse has pushed directly on the development of machine translation new technology, as Parallel Corpus for the model construction of statistical machine translation is carried
Essential training data is supplied, based on the statistics base such as (Statistic-Based) and Case-based Reasoning (Example-Based)
New thinking is provided for machine translation research in the interpretation method of corpus, translation quality is effectively improved, in machine translation
Research field has started new climax.On the other hand, bilingualism corpora is again the important sources for obtaining translation knowledge, therefrom can be with
Excavate and learn various fine-grained translation knowledges, such as dictionary for translation and translation template, so as to improve traditional machine translation mothod.
Additionally, bilingualism corpora is also cross-language information retrieval, dictionary for translation writing, bilingual terminology are automatically extracted and multilingual contrast
The important foundation resource of research etc..In current network, for create healthy network environment, it is necessary to the existing language material of network and
The content of the language material that the network user is input into real time carries out diagnosis detection.The growth of enriching constantly of corpus content, in corpus
The detection band of appearance is come difficult.
Refering to Fig. 1, Fig. 1 is the structural representation of the implementation method of a kind of bad language material content detection device that the present invention is provided
Figure.The device 100 includes:Semantic frame determining module 110, examination criteria setting module 120 and detection module 130, wherein, inspection
The quasi- setting module 120 of mark is connected to semantic frame determining module 110 and corpus 101.
Corpus 101 refers to the extensive e-text storehouse through scientific sampling and processing.By computer analysis tool, grind
The person of studying carefully can carry out language theory and the application study of correlation.Corpus has polytype, and certain type of Main Basiss are them
Research purpose and purposes, this point tend to be embodied in the principle and mode of language material collection.Corpus is generally divided into four
Type:(1) heterogeneous (Heterogeneous):There is no specific Corpus Selection Rule, collect extensively and store as former state various
Language material;(2) (Homogeneous) of homogeneity:Only collect the language material of same class content;(3) (Systematic) of system:According to pre-
The principle and ratio for first determining collect language material, language material is had balance and systematicness, can represent the language in a certain scope
It is true;(4) special (Specialized):Only collect the language material for a certain special-purpose.
Semantic frame determining module 110 carries out participle to language material to be detected, extracts the to be detected semantic frame of language material to be detected
Frame.Semantic frame determining module 110 includes participle unit 111 and semantic frame determining unit 112.Semantic frame determining unit
The 112 semantic frames that the language material in language material to be detected and corpus is determined according to the word segmentation result obtained after the participle of participle unit 111
Frame, and its affiliated scene is determined according to the context of language material to be detected.In user input language material, the language material to user input enters
Row detection, carries out word segmentation processing to language material to be detected by participle unit 111 first, and participle can be by existing participle instrument
Processed.After the completion of participle, the word of generative semantics independence.In the present embodiment, it is thus necessary to determine that existing in corpus 101
The semantic frame of language material, therefore the existing language material in should first passing through participle unit 111 to corpus 101 carries out word segmentation processing.And
After word segmentation processing, the semanteme of the whole participles of identification can therefrom screen the vocabulary of bad semanteme, and whole bad semantic vocabularies is converged
It is total and store.
Semantic frame determining unit 112 according to participle unit 111 to the word segmentation processing result of language material to be detected, with reference to each
The semantic type of participle determines the semantic frame of language material to be detected.Known language material in simultaneously for existing corpus 101 passes through
After the word segmentation processing of participle unit 111, the semanteme of the known language material of the combination of semantic frame determining unit 112 determines the semantic frame of the language material
Frame, and the scene according to belonging to the context of the language material to be detected determines it.Collect the semantic frame of language material, semantic frame is pressed
It is grouped according to scene, and the semantic frame of normal language material and the semantic frame of bad language material is distinguished in each packet.Will be complete
The semantic frame storage of portion's species.
Examination criteria setting module 120 connects corpus 101 and semantic frame determining module 110, for by corpus
Language material be transferred to examination criteria extraction module 110, to extract the semantic frame of language material in corpus, determine semantic frame kind
Class, extracts known harmful content vocabulary, while whole semantic frames and known harmful content vocabulary are stored.
Examination criteria setting module 120 includes harmful content bilingual lexicon acquisition unit 121 and semantic frame taxon 122.
Harmful content bilingual lexicon acquisition unit 121 connects corpus 101, for obtaining known harmful content vocabulary from corpus 101.
In the present embodiment, after participle unit 111 carries out word segmentation processing to the language material in existing corpus 101, to word segmentation processing knot
Fruit is distinguished that screening harmful content vocabulary therein collects and stores.Harmful content bilingual lexicon acquisition unit 121 connects language material
Storehouse 101, corpus 101 is screened the harmful content word retrieval for collecting.In other embodiments, network high in the clouds has stored
The lexicon of harmful content vocabulary, harmful content bilingual lexicon acquisition unit 121 may be coupled directly to network high in the clouds, extract network cloud
Hold known harmful content vocabulary in the lexicon of the harmful content vocabulary for storing.Semantic frame taxon 122 is according to language material
Whole semantic frames is categorized as normal semantic frame and bad semantic frame by the semantic frame of language material in storehouse 101.Wherein, wrap
Vocabulary containing types such as reaction, violence, salaciousness, political sensitivities is harmful content vocabulary, the language material comprising relative words, or
It is the language material of attacking or abuse type through analyzing its semantic type although not including the above-mentioned type vocabulary, can be classified
It is bad semantic frame, the semantic frame of the language material in addition to bad semantic frame is normal semantic frame.Then according to each
The affiliated scene of language material is grouped to normal semantic frame and bad semantic frame.
Detection module 130 compares the word segmentation result and known harmful content vocabulary of language material to be detected, and compares to be detected
Semantic frame and whole semantic frames, determine whether language material to be detected is bad language material content.When participle unit 111 will be to be detected
After language material carries out word segmentation processing, the harmful content that the participle of language material to be detected and harmful content bilingual lexicon acquisition unit 121 are obtained
Vocabulary is compared, and whether detection wherein includes harmful content vocabulary, if comprising regarding as bad language material content;If not wrapping
Contain, determine the semantic frame of language material to be detected by semantic frame determining unit 112, compare language material to be detected semantic frame and
The semantic frame of the language material in existing corpus 101, the semantic frame for analyzing language material to be detected belongs to normal semantic frame or not
Good semantic frame, so as to detect whether language material to be detected is bad language material content.
In deterministic process is contrasted, if comprising in harmful content vocabulary in detecting the participle of language material to be detected by contrast
At least one, and the semantic frame of language material to be detected determines that language material to be detected is normal content when belonging to normal semantic frame
Language material.If the semantic frame of language material to be detected is not belonging to any one of the semantic frame of language material in corpus, according to point
Word determines whether language material to be detected is bad language material with the comparative result of harmful content vocabulary.Even contrasted in detection module 130
It was found that comprising the vocabulary at least one harmful content vocabulary in the participle of language material to be detected, but contrast the semanteme of language material to be detected
In framework and corpus 101 during whole semantic frames under same scene, the semantic frame of language material to be detected belongs to corresponding scene
Under existing normal language material semantic frame when, then assert the language material content be normal language material content.If detection module 130 passes through
The semantic frame of language material under same scene in the semantic frame and existing corpus of language material to be detected is contrasted, the language to be detected is found
The semantic frame of material is not belonging to the semantic frame under the scene in corpus, then whether the language material to be detected is harmful content then root
Determine according to the word segmentation result of the language material to be detected and the comparative result of harmful content vocabulary, if containing harmful content vocabulary, for
The language material of harmful content.
Prior art is different from, bad language material content detection device of the invention carries out participle by language material to be detected
Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame
Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish
Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content
Leak-stopping sentences phenomenon.
Refering to Fig. 2, Fig. 2 is that the flow of the implementation method of a kind of bad language material content detection algorithm that the present invention is provided is illustrated
Figure.The step of the method, includes:
S210:Participle is carried out to language material to be detected, the semantic frame of language material to be detected is determined.
Participle is carried out to language material to be detected, the semantic frame to be detected of language material to be detected is extracted.According to what is obtained after participle
Word segmentation result determines the semantic frame of the language material in language material to be detected and corpus, and is determined according to the context of language material to be detected
Its affiliated scene.In user input language material, the language material to user input is detected, language material to be detected is divided first
Word treatment, participle can be processed by existing participle instrument.After the completion of participle, the word of generative semantics independence.In this reality
In applying mode, it is thus necessary to determine that the semantic frame of existing language material in corpus, thus should first to corpus in existing language material divided
Word treatment.And after word segmentation processing, the semanteme of the whole participles of identification can therefrom screen the vocabulary of bad semanteme, by whole not
Good semantic vocabulary collects and stores.
According to the word segmentation processing result to language material to be detected, language material to be detected is determined with reference to the semantic type of each participle
Semantic frame.Known language material in simultaneously for existing corpus with reference to the semanteme of known language material by after word segmentation processing, determining
The semantic frame of the language material, and the scene according to belonging to the context of the language material to be detected determines it.Collect the semantic frame of language material
Frame, semantic frame is grouped according to scene, and the semantic frame and bad language material of normal language material are distinguished in each packet
Semantic frame.The semantic frame of all categories is stored.
S220:The semantic frame of language material in corpus is extracted, while what is obtained when carrying out word segmentation processing to corpus is bad
Content vocabulary.
The semantic frame of language material in corpus is extracted, semantic frame species is determined, known harmful content vocabulary is extracted, together
When whole semantic frames and known harmful content vocabulary are stored.
Known harmful content vocabulary is obtained from corpus.In the present embodiment, to the language material in existing corpus
After carrying out word segmentation processing, word segmentation processing result is distinguished, screen harmful content vocabulary therein, collected and store.By language
The harmful content word retrieval that the screening of material storehouse collects.In other embodiments, network high in the clouds has stored harmful content word
The lexicon of remittance, may be coupled directly to network high in the clouds, known in the lexicon of the harmful content vocabulary for extracting the storage of network high in the clouds
Harmful content vocabulary.According to language material in corpus semantic frame by whole semantic frames be categorized as normal semantic frame and
Bad semantic frame.Wherein, the vocabulary comprising types such as reaction, violence, salaciousness, political sensitivities is harmful content vocabulary, bag
Language material containing relative words, although or not attacking or abusing through analyzing its semantic type comprising the above-mentioned type vocabulary
The language material of type, can be classified as bad semantic frame, and the semantic frame of the language material in addition to bad semantic frame is normal
Semantic frame.Then normal semantic frame and bad semantic frame are grouped according to the affiliated scene of each language material.
S230:The word segmentation result and harmful content vocabulary of language material to be detected are compared, and compares semantic frame to be detected and complete
Portion's semantic frame, determines whether language material to be detected is bad language material content.
Compare the word segmentation result and known harmful content vocabulary of language material to be detected, and compare semantic frame to be detected and complete
Portion's semantic frame, determines whether language material to be detected is bad language material content.After language material to be detected is carried out into word segmentation processing, will be to be checked
The harmful content vocabulary surveyed the participle of language material and obtain is compared, and whether detection wherein includes harmful content vocabulary, if comprising,
Then regard as bad language material content;If not including, the semantic frame of language material to be detected is determined, compare the semantic frame of language material to be detected
The semantic frame of the language material in frame and existing corpus, the semantic frame for analyzing language material to be detected belongs to normal semantic frame or not
Good semantic frame, so as to detect whether language material to be detected is bad language material content.
In deterministic process is contrasted, if comprising in harmful content vocabulary in detecting the participle of language material to be detected by contrast
At least one, and the semantic frame of language material to be detected determines that language material to be detected is normal content when belonging to normal semantic frame
Language material.If the semantic frame of language material to be detected is not belonging to any one of the semantic frame of language material in corpus, according to point
Word determines whether language material to be detected is bad language material with the comparative result of harmful content vocabulary.Even contrast finds language material to be detected
Participle in comprising the vocabulary at least one harmful content vocabulary, but contrast the semantic frame and corpus of language material to be detected
During whole semantic frames under same scene, the semantic frame of language material to be detected belongs to the existing normal language material under corresponding scene
During semantic frame, then assert that the language material content is normal language material content.If by contrasting the semantic frame of language material to be detected and showing
There is the semantic frame of language material under same scene in corpus, it is found that the semantic frame of the language material to be detected should in being not belonging to corpus
Semantic frame under scene, then whether the language material to be detected is word segmentation result of the harmful content then according to the language material to be detected and not
The comparative result of good content vocabulary determines, is the language material of harmful content if containing harmful content vocabulary.
Prior art is different from, bad language material content detection algorithm of the invention carries out participle by language material to be detected
Treatment, the semanteme after participle according to each participle in language material determines its semantic frame, is compared by with known semantic frame
Compared with, it is determined whether it is bad language material content.By means of the invention it is possible to compare by with known semantic frame species, distinguish
Whether semantic frame to be detected is harmful content language material, can be prevented accurately judging whether language material to be detected is harmful content
Leak-stopping sentences phenomenon.
Embodiments of the present invention are the foregoing is only, the scope of the claims of the invention is not thereby limited, it is every using this
Equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations
Technical field, is included within the scope of the present invention.
Claims (10)
1. a kind of detection means of bad language material content, it is characterised in that including:
Semantic frame determining module, for carrying out participle to language material to be detected, determines the semantic frame of the language material to be detected;
Examination criteria setting module, connects corpus and the semantic frame determining module, for by the language in the corpus
Material is transferred to the semantic frame determining module, to determine the semantic frame of language material in the corpus, while extracting to language material
Storehouse carries out the harmful content vocabulary obtained during word segmentation processing;
Detection module, word segmentation result and the harmful content vocabulary for comparing the language material to be detected, and treated described in comparison
Detection semantic frame and all semantic frame, determine whether the language material to be detected is bad language material content.
2. bad language material content detection device according to claim 1, it is characterised in that the semantic frame determining module bag
Include:
Participle unit, for carrying out participle to the language material in the language material to be detected and the corpus;
Semantic frame determining unit, for determining the language to be detected according to the word segmentation result obtained after the participle unit participle
The semantic frame of the language material in material and the corpus, and its affiliated scene is determined according to the context of the language material to be detected.
3. bad language material content detection device according to claim 2, it is characterised in that examination criteria setting module includes:
Harmful content bilingual lexicon acquisition unit, connects the corpus, for obtaining the harmful content word from the corpus
Converge;
, be categorized as whole semantic frames for the semantic frame according to language material in the corpus by semantic frame taxon
Normal semantic frame and bad semantic frame, and according to each affiliated scene of language material to the normal semantic frame and not
Good semantic frame is grouped.
4. bad language material content detection device according to claim 3, it is characterised in that if detecting by contrast described to be detected
Comprising at least one of harmful content vocabulary person in the participle of language material, and the semantic frame of the language material to be detected belongs to
During normal semantic frame, the language material that the language material to be detected is normal content is determined.
5. bad language material content detection device according to claim 4, it is characterised in that if the semantic frame of the language material to be detected
When frame is not belonging to the semantic frame of language material in the corpus, the comparative result according to the participle and harmful content vocabulary determines
Whether the language material to be detected is bad language material.
6. a kind of bad language material content detection algorithm, it is characterised in that including:
Participle is carried out to language material to be detected, the semantic frame of the language material to be detected is determined;
Extract the semantic frame of language material in the corpus, at the same extraction obtain when carrying out word segmentation processing to corpus it is bad in
Hold vocabulary;
Compare the word segmentation result and the harmful content vocabulary of the language material to be detected, and compare the semantic frame to be detected and
All the semantic frame, determines whether the language material to be detected is bad language material content.
7. bad language material content detection algorithm according to claim 6, it is characterised in that extracting treating for the language material to be detected
In the step of detection semantic frame, including step:
Participle is carried out to the language material in the language material to be detected and the corpus;
The semantic frame of the language material in the language material to be detected and the corpus is determined according to word segmentation result, and is treated according to described
Detect that the context of language material determines its affiliated scene.
8. bad language material content detection algorithm according to claim 7, it is characterised in that it is determined that semantic frame species, extracts
In the step of known harmful content vocabulary, including step:
The harmful content vocabulary is obtained from the corpus;
Whole semantic frames is categorized as normal semantic frame and bad language by the semantic frame according to language material in the corpus
Adopted framework, and the normal semantic frame and bad semantic frame are grouped according to each affiliated scene of language material.
9. bad language material content detection algorithm according to claim 8, it is characterised in that if detecting by contrast described to be detected
Comprising at least one of harmful content vocabulary person in the participle of language material, the semantic frame of the language material to be detected belongs to just
During normal semantic frame, the language material that the language material to be detected is normal content is determined.
10. bad language material content detection algorithm according to claim 9, it is characterised in that if the semanteme of the language material to be detected
It is true with the comparative result of harmful content vocabulary according to the participle when framework is not belonging to the semantic frame of language material in the corpus
Whether the fixed language material to be detected is bad language material.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2016/087758 WO2018000273A1 (en) | 2016-06-29 | 2016-06-29 | Device and method for detecting unacceptable corpus data content |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106716397A true CN106716397A (en) | 2017-05-24 |
Family
ID=58906768
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201680001769.2A Pending CN106716397A (en) | 2016-06-29 | 2016-06-29 | Device and method for detecting bad corpus data content |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106716397A (en) |
WO (1) | WO2018000273A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362659A (en) * | 2019-07-16 | 2019-10-22 | 北京洛必德科技有限公司 | The abnormal statement filter method and system of the open corpus of robot |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102279875A (en) * | 2011-06-24 | 2011-12-14 | 成都市华为赛门铁克科技有限公司 | Method and device for identifying phishing website |
CN102609516A (en) * | 2012-02-08 | 2012-07-25 | 苏州中联互通信息科技有限公司 | Content understanding-based bad information filter method |
CN102929897A (en) * | 2011-08-12 | 2013-02-13 | 北京千橡网景科技发展有限公司 | Method and equipment for detecting bad information from text |
CN105574090A (en) * | 2015-12-10 | 2016-05-11 | 北京中科汇联科技股份有限公司 | Sensitive word filtering method and system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8060513B2 (en) * | 2008-07-01 | 2011-11-15 | Dossierview Inc. | Information processing with integrated semantic contexts |
CN102693236A (en) * | 2011-03-24 | 2012-09-26 | 苏州风采信息技术有限公司 | Bad information filtering method based on content understanding |
-
2016
- 2016-06-29 CN CN201680001769.2A patent/CN106716397A/en active Pending
- 2016-06-29 WO PCT/CN2016/087758 patent/WO2018000273A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102279875A (en) * | 2011-06-24 | 2011-12-14 | 成都市华为赛门铁克科技有限公司 | Method and device for identifying phishing website |
CN102929897A (en) * | 2011-08-12 | 2013-02-13 | 北京千橡网景科技发展有限公司 | Method and equipment for detecting bad information from text |
CN102609516A (en) * | 2012-02-08 | 2012-07-25 | 苏州中联互通信息科技有限公司 | Content understanding-based bad information filter method |
CN105574090A (en) * | 2015-12-10 | 2016-05-11 | 北京中科汇联科技股份有限公司 | Sensitive word filtering method and system |
Non-Patent Citations (1)
Title |
---|
曲泷玉: ""基于框架匹配的网络文本分析"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362659A (en) * | 2019-07-16 | 2019-10-22 | 北京洛必德科技有限公司 | The abnormal statement filter method and system of the open corpus of robot |
Also Published As
Publication number | Publication date |
---|---|
WO2018000273A1 (en) | 2018-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3856778B2 (en) | Document classification apparatus and document classification method for multiple languages | |
CN112926405A (en) | Method, system, equipment and storage medium for detecting wearing of safety helmet | |
CN102779135B (en) | Method and device for obtaining cross-linguistic search resources and corresponding search method and device | |
CN107909088B (en) | Method, apparatus, device and computer storage medium for obtaining training samples | |
CN106959998B (en) | Test question recommendation method and device | |
CN105095091B (en) | A kind of software defect code file localization method based on Inverted Index Technique | |
CN106570109A (en) | Method for automatically generating knowledge points of question bank through text analysis | |
CN107832290B (en) | Method and device for identifying Chinese semantic relation | |
US11551151B2 (en) | Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus | |
CN110119441A (en) | Text based on Hanzi structure clicks identifying code identification and filling method | |
CN106227836B (en) | Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters | |
EP3968244A1 (en) | Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects | |
CN106909600A (en) | The collection method and device of user context information | |
CN105389303B (en) | A kind of automatic fusion method of heterologous corpus | |
CN110968664A (en) | Document retrieval method, device, equipment and medium | |
CN108511064A (en) | The system for automatically analyzing healthy data based on deep learning | |
CN104331361B (en) | A kind of test device and method for white-box testing coverage rate calculation visualization | |
CN107977454A (en) | The method, apparatus and computer-readable recording medium of bilingual corpora cleaning | |
CN109522413B (en) | Construction method and device of medical term library for guided medical examination | |
CN112836067B (en) | Intelligent searching method based on knowledge graph | |
CN107451433A (en) | A kind of information source identification method and apparatus based on content of text | |
CN106716397A (en) | Device and method for detecting bad corpus data content | |
CN103034657B (en) | Documentation summary generates method and apparatus | |
CN103019924B (en) | The intelligent evaluating system of input method and method | |
Pienaar et al. | Spelling checker-based language identification for the eleven official south african languages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 301, Building 39, 239 Renmin Road, Gusu District, Suzhou City, Jiangsu Province, 215000 Applicant after: Suzhou Dogweed Intelligent Technology Co., Ltd. Address before: 518000 Dongfang Science and Technology Building 1307-09, 16 Keyuan Road, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province Applicant before: Shenzhen green bristlegrass intelligence Science and Technology Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170524 |
|
RJ01 | Rejection of invention patent application after publication |