CN109446288A - One kind being based on the internet Spark concerning security matters map detection algorithm - Google Patents
One kind being based on the internet Spark concerning security matters map detection algorithm Download PDFInfo
- Publication number
- CN109446288A CN109446288A CN201811216505.6A CN201811216505A CN109446288A CN 109446288 A CN109446288 A CN 109446288A CN 201811216505 A CN201811216505 A CN 201811216505A CN 109446288 A CN109446288 A CN 109446288A
- Authority
- CN
- China
- Prior art keywords
- map
- text
- sensitive
- feature words
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of internet concerning security matters map detection algorithm based on Spark, belongs to big data technical application, natural language processing field.This method passes through data prediction first, and to map text data carries out Chinese word segmentation, extracts map file Feature Words;Then Text character extraction is carried out, main four kinds of features of weight for extracting Feature Words and sensitive Lexical Similarity, Feature Words weight in the text, Feature Words in the corresponding sensitivity vocabulary of position attribution, the Feature Words in POI text in sensitive dictionary;Finally, the feature extracted by front, is calculated corresponding map file susceptibility by statistics.Present invention combination Internet map file data crawling method, can be realized internet concerning security matters map and detects automatically, accelerate to mitigate artificial detection Internet map pressure to Internet Problems map detection efficiency.
Description
Technical field
The invention belongs to big data technical applications, natural language processing field, relate generally to Internet map sensitive information
Detection method is a kind of based on the research of the internet Spark concerning security matters map detection algorithm.
Background technique
With the rapid development of internet and information technology, Map Service, which has become in people's life, to be obtained
Scarce a part, however occur a large amount of electronic map in network, while providing convenient to people's life, also occur
Some the problem of can not be ignored.For example, the safety issue of national geographic information, in September, 2015, country's mapping geography information
Office checks that domestic and international large scale business website, forum, the geography information 1.3 ten thousand in microblogging are a plurality of, and discovery has that " map " takes
Business 275, " problem map " 321, picture, violation point of interest (point of interest, POI) mark 2336.Map
The main carriers of concerning security matters are to mark POI information in map in violation of rules and regulations, and POI includes the information such as title, longitude and latitude, and a POI can be with table
Show a house, a retail shop, may also indicate that as military base, a military restricted zone etc..So if in electronic map
It is labelled with concerning security matters POI information, and is announced on the internet, it will seriously damages national interests, endangers national security.
It is detected for Internet map concerning security matters, has research by combining the sensitive position information in vector numerical map
Amount, sensitive symbolic information amount, sensitive geometrical information, sensitive topology information amount, sensitive annotation information content and Sensitive Attributes letter
Breath amount COMPREHENSIVE CALCULATING obtains the sensitive information amount of vector numerical map, to assess the sensitive grade of numerical map.Also there is research
The sensitivity value of map file is measured by considering to calculate the POI susceptibility in map, from administration and software identification technology
Two aspects propose corresponding solution.Country's concerning security matters map detection research at present is also fewer, still, in sensitive word
There are many correlative studys for context of detection, such as have by establishing sensitive dictionary, Chinese word segmentation, and String matching detects sensitive in mail
Word, therefore, it is determined that the sensitive grade of mail.Also it has been proposed that by extracting topic text feature word, in conjunction with sensitive lexicon base
In condition random field detection model, sensibility mark is carried out to the characteristic item of document to be measured, and propose based on the quick of kernel method
Feel infomation detection model.
Above method attribute information amount extraction comparison each for map is difficult, and map sensitivity Detection difficulty is big.Moreover,
It is above-mentioned mainly to be detected by the direct matching way of Feature Words and sensitive word for mail or document sensitivity Detection, but this
Kind mode can have the sensitive word covering unrecognized problem of not comprehensive or near synonym.Therefore, it is proposed that it is special by calculating
The similarity of word and sensitive word is levied, to calculate the sensibility of feature word.Since our test object is the POI in map
Short text, so in addition extract 3 kinds of features according to place POI attribute, Feature Words weight in the text, Feature Words are in POI text
Weight of the corresponding sensitive vocabulary of position attribution and Feature Words in sensitive dictionary in this, is joined by 4 kinds of above-mentioned features
Total susceptibility for calculating map POI.In addition, Internet map file can generally have some satellite informations, this is also used as one kind
The attribute of map concerning security matters detection.
With the rapid development of Internet technology, explosive growth, map in corresponding network is presented in all trades and professions data
Data are also increasing rapidly, can no longer meet demand using traditional single machine tupe.Distributed proccessing is great
The analysis and processing of big data are pushed, Spark and Hadoop are distributive parallel computation frameworks popular at present.I
The big data processing frame that is detected using Spark as concerning security matters map because Spark not only has in Hadoop
The advantages of MapReduce, still calculates based on memory, and has optimizing scheduling mechanism and operator expression formula more abundant,
People are based on Spark and many correlative studys have also been made.
By mark place POI and map file satellite information in consideration map, the sensitive information of text is extracted respectively,
The sensitive grade of last COMPREHENSIVE CALCULATING map file.Set forth herein a kind of, and the internet concerning security matters map based on Spark detects calculation
Method, not only increases the accuracy of map detection, while improving the time performance of map detection.
Summary of the invention
Present invention solves the technical problem that: the present invention is for internet electronic map satellite information and map label
The correlation properties of point information, propose a kind of Internet map concerning security matters detection algorithm model.By considering electronic map satellite information
The sensitive grade that map is measured with the susceptibility in map label place, due to including a large amount of map number in current internet
According in order to improve the performance of map detection algorithm, we will test algorithm and realize on Spark processing frame, parallelization processing
Map datum realizes a kind of concerning security matters map detection model of efficiently and accurately.
The technical scheme is that the processing of present invention map file over the ground is broadly divided into three parts: data prediction,
Text character extraction, map susceptibility calculate.Data prediction is mainly to the different-format map got from internet
Document analysis obtains the satellite information of to map description on place POI and internet in map file.Text character extraction master
If POI text carries out feature extraction in map satellite information and map file, main to extract 4 Partial Features: 1. Feature Words
It with the similarity of sensitive word vocabulary, is calculated by Lexical Similarity, can solve that sensitive dictionary is not comprehensive and synonym can not
The case where identification.2. the weight of Feature Words in the text, different words have different weights, that is, text in the text
In keyword shared by maximum weight, the meaning of text can be most represented, so we are by calculating Feature Words in the text
Weight measures sensitivity characteristics representated by word.3. position attribution of the Feature Words in POI text, in POI place position
In information, whether Feature Words are that sensitive place also has a certain impact for place in the position of POI text.4. Feature Words pair
Weight of the sensitive vocabulary in sensitive dictionary is answered, the susceptibility that different sensitive places has is not also identical, such as military base
Susceptibility is higher than infrastructure susceptibility, so we also extract the weight of corresponding sensitive word as feature.Pass through said extracted
4 kinds of features, by statistical calculate map file sensitivity value, then algorithm according to detection map file data set it is quick
Inductance value Sequential output from high to low.Since there are a large amount of map datums on internet, in order to improve the treatability of detection algorithm
Can, we are by algorithm operation on Spark parallel processing frame.
Detailed description of the invention
Fig. 1 is system architecture diagram of the invention;
Fig. 2 is overview flow chart of the invention;
Specific embodiment
Content in order to better illustrate the present invention, below with reference to Figure of description and according to example to tool of the invention
Body implementation is further elaborated.
It is data being executed on Spark Computational frame as Fig. 1 (algorithm framework) show algorithm proposed by the present invention
Storage, algorithm execution are executed on Spark frame.It is illustrated in figure 2 overview flow chart of the invention, comprising: obtain structure
Build sensitive dictionary, data preprocessing module, data characteristics extraction module, map file susceptibility computing module, based on Spark
Internet concerning security matters map detection algorithm is total to four module.
Detailed description below detailed implementation process of the invention.
S1: sensitive dictionary is constructed.Map susceptibility calculates mainly by extracting the similarity of Feature Words and sensitive word,
Therefore the quality of sensitive dictionary is very significant considering that concerning security matters map detection algorithm.We define single sensitive word first
Sensitive dictionary S={ s1,s2,...,sn, however after carrying out Chinese word segmentation due to our to map text datas, text all by
It divides for word, and there are the places of part concerning security matters can also be divided into insensitive word, such as " Chinese rocket is ground
Study carefully base ", this text can be divided into " China ", " rocket ", " research ", " base " four words after Chinese word segmentation, this four
Concerning security matters situation is not present in a word independent detection, but this POI in rocketry base of country is not allow Direct Mark to exist
On map file.For this POI we using combination sensitive word detect, that is, define it is a kind of combination sensitive word it is quick
Feel dictionaryWithWe are when constructing sensitive dictionary, according to sensitive word
The sensibility for representing place is different, inputs corresponding weight V={ υ to the sensitive word1,υ2,...,υn, υiIndicate sensitive word pair
The weight answered.
S2: data prediction.There are the map file of multiple format, such as jpg, the lattice such as dwg, mapInfo on internet
Formula.We indicate that a kind of map file of format, P indicate the POI mark ground point set in map using M=(P, F), and F is indicated
The satellite information of map file, wherein P={ p1,p2,......,pn, piIndicate a mark place in map.We define
MS(pi) indicate mark place piSusceptibility, MS (P) and MS (F) respectively indicates map label place susceptibility and map is attached
The susceptibility of information.
Susceptibility COMPREHENSIVE CALCULATING by calculating map POI information and map satellite information obtains the sensitivity of map file
Information first pre-processes data, data preprocessing module is main for the susceptibility for calculating map POI and satellite information
It is that Chinese word segmentation is carried out to text.Chinese word cutting method used herein is that the Ansj to increase income on Spark carries out Chinese word segmentation.
Ansj is based on n-Gram, the Chinese word segmentation that CRF, HMM are realized, participle accuracy rate has reached 96% or more, realized Chinese at present
Participle, chinese names identification, keyword extraction, the functions such as key marker.We using Ansj extract POI mark text and
The Feature Words of map satellite information, we provide definition:Feature word set after indicating text participle
It closes, L={ l1,l2,......,lnIndicating position attribution of the Feature Words in POI text, main value is { B, I, E }, table
Show Feature Words in POI text beginning of the sentence, sentence neutralizes sentence tail position, and it is feature for subsequent POI text that position attribution vector, which defines,
It extracts.
S3: Text character extraction.The Feature Words that POI marks text and map satellite information are obtained by data prediction
Vector, we extract 4 kinds of features of text according to Feature Words: the similarity of Feature Words and sensitive vocabulary, Feature Words are in the text
Weight, Feature Words in the weight of position attribution and the corresponding sensitive vocabulary of Feature Words in sensitive dictionary in POI text,
Following describe 4 kinds of methods of Text character extraction.
S31: the similarity of Feature Words and sensitive word.By the pretreatment of above-mentioned map file over the ground, map file is extracted
Feature Words, then how to detect the susceptibility of Feature Words? currently, many correlative studys are all by Feature Words and sensitivity
The sensitive directly matched mode of vocabulary extracts the susceptibility of Feature Words in dictionary, but this detection method will certainly exist it is quick
Feel dictionary and cover the incomplete or unrecognized phenomenon of near synonym, so similar to sensitive word using detection Feature Words herein
The mode of degree obtains the similarity of Feature Words and sensitive word first, when two Words similarities are greater than certain threshold value, according to
The susceptibility of similarity quantization Feature Words.
Word in Internet map file is mainly place classification information, so the Feature Words that we extract are mainly
The other word of location category, the detection for Feature Words sensibility are obtained by calculating the similarity of Feature Words and sensitive vocabulary
The susceptibility of Feature Words.Calculating Words similarity at present has many researchs, and there are two types of common Measurement of word similarity:
Method based on World Affairs or certain classification system and the context vector spatial model method based on statistics.First kind meter
The common dictionary for calculating Words similarity has Hownet (hownet), wordNet and Chinese thesaurus, the construction side of these three dictionaries
Method is different.Because our text is mainly short text, i.e., mark location information in map, so being suitable for us
Words similarity method be mainly based upon the Measurement of word similarity of dictionary classification system, used by comparing our
Measurement of word similarity based on Chinese thesaurus." Chinese thesaurus " is that Mei Jiaju et al. is compiled in nineteen eighty-three, dictionary
Not only include the synonym of a word, also contains a certain number of similar words, later Harbin Institute of Technology's information inspection
Rope laboratory completes " Chinese thesaurus extended edition " using numerous word related resources, we calculate Words similarity and are based on
Extended edition is calculated.
By the Words similarity algorithm based on Chinese thesaurus, Feature Words c is calculatediWith sensitive word sjSimilarity
mij, because there are multiple sensitive vocabulary in sensitive dictionary, we take in sensitive dictionary with the maximum word of feature Word similarity,
That is Mij=max (mij,j∈S).When the similarity of this paper defined feature word and sensitive word is more than threshold θ, just think that Feature Words have
There is certain susceptibility, is defined as:
For the Feature Words calculated by above formula there are when sensibility, we are extracting the specific word weight in the text, and
Position of the Feature Words in POI text, the weight feature of the corresponding sensitive vocabulary of Feature Words.
S32: the weight of Feature Words in the text.In the text, different words has different weights in the text,
It is exactly maximum weight shared by the keyword in text, can most represents the meaning of text.Similarly, the map file that we extract
Text, different Feature Words represent different weights in the text, if the big Feature Words of weight are sensitive vocabulary,
A possibility that map file is concerning security matters map file is just high, conversely, the small Feature Words of weight are sensitive vocabulary, then correspondingly
A possibility that map file is concerning security matters map will be lower.Currently, more commonly used term weighing extraction has TF-IDF algorithm,
TF-IDF is a kind of statistical method, for assessing a word for the significance level of a file, passes through word frequency (term respectively
Frequency, TF) and reverse document-frequency (inverse document frequency, IDF) calculate, formula are as follows:
TF-IDF=TF*IDF (2)
Word frequency TF and inverse document frequency IDF are respectively indicated in formula are as follows:
Wherein ni,jIndicate word tiIn file djThe number of middle appearance, njIndicate file djWord summation.| D | it is corpus
Total number of files in library, | { j:ti∈dj| to include word t in corpusiNumber of files, in order to prevent divisor be 0, so
Add 1 on denominator.
Existing TF-IDF is calculated in word c weight, and calculating inverse document frequency IDF is according in search number of files on the net |
D |, then according to the number of files in total document including word c | { j:ti∈dj|, obtain inverse document frequency IDF.But we
Target text is mainly the text with geography information attribute, and the Feature Words extracted are typically all place noun, so
When calculating the inverse document frequency IDF in TF-IDF, we calculate inverse document according to the feature text collection in oneself data set
Frequency IDF calculates the TF-IDF of word according to the general place text information of our extraction as the text in corpus
Weight wi.Formula indicates are as follows:
Wherein ni,jIndicate word ciIn text pjThe number of middle appearance, njIndicate text pjWord summation.| P | for us
Data set text summation, | { j:ci∈pj| to include word c in corpusiTextual data.
S33: position of the Feature Words in POI text.In the differentiation of map POI text sensibility, there are such a feelings
It include " military camp " concerning security matters keyword in condition, such as " Caiyuanba military camp bus station " this POI title, this POI can be identified
For concerning security matters place, but passing through this POI of manual identified is not concerning security matters place, but a bus station.We analyze reason
It is found that whether position of the concerning security matters keyword in POI text for POI is that concerning security matters place can also have certain influence.Cause
This we in front data prediction when, to the position of text Chinese word segmentation record individual features word in the text, use L=
{l1,l2,......,lnIndicate the position attributions of Feature Words, wherein li∈ { B, I, E }, { B, I, E } respectively indicate Feature Words and exist
POI text stem, intermediate and tail position.
S34: the weight of the corresponding sensitive vocabulary of Feature Words.In map different places according to military base correlation degree,
And cause its sensibility different the difference of the geographical environment security implication of country.State Bureau of Surveying and Mapping's map examines center
The positions such as military airfield, works for military operations sensibility with higher is alsied specify in " open map content representation supplementary provisions ".
Also document passes through the sensitivity coefficient of building geographic object to measure the sensibility height of geographic object, we quantify geographical right
As for the geographic object directly related with military use, geographic object with military use indirect correlation, national large foundation is set
Three classes geographic object is applied, the sensitive weight of geographic object, V ∈ { 1,0.7,0.4 } are indicated using symbol V, specific value is shown in Table
3.1.For how which kind of geographic object identification feature word belongs to, we identify according to sensitive vocabulary, we are constructing
When sensitive lexicon, corresponding one sensitive weight of each sensitive word, this weight is V, when we identify feature
When word and sensitive Word similarity meet threshold value, the corresponding weight of sensitive word is extracted.
The classification of 1.1 geographic object of table
S4: map file susceptibility calculation method.The carrier of map file concerning security matters, mainly describes file on internet
Satellite information and map file mark place POI.We calculate the susceptibility MS (F) and map of map file satellite information
Mark the susceptibility MS (M) of susceptibility MS (P) combined calculation map of place POI, formula are as follows:
MS (M)=α MS (P)+β MS (F) (6)
In formula, α, β are respectively weight shared by map satellite information susceptibility and map label place POI susceptibility.
The susceptibility of map satellite information in map file and map label place POI information is calculated, Chinese is passed through
Participle obtains the feature set of words of map satellite information and map label place POI.We pass through four kinds of spies for extracting Feature Words
Sign calculates the susceptibility of Feature Words, since mark place POI is place short text, so we consider Feature Words in short essay
Position feature in this, and map file satellite information is whole section of text, we do not consider the position feature of Feature Words.According to
Said extracted feature, map file satellite information Feature Words susceptibilityIt is sensitive with map file mark place POI Feature Words
DegreeFormula indicates are as follows:
In formula, ωiIndicate the TF-IDF weight of Feature Words in the text, MijIndicate that Feature Words i is similar to sensitive word j's
Degree, VjIndicate the corresponding weight of sensitive word j.
In formula, liIndicate position attribution of the Feature Words i in POI text.
Comprising many mark place POI in map, we pass through the susceptibility for calculating single mark place
To obtain the mark place susceptibility MS (P) of entire map.Single mark is calculated by the susceptibility of Feature Words in POI text
The susceptibility in place, formula are as follows:
Then according to the susceptibility in single mark place, map label place susceptibility MS (P), formula are arrived in calculating are as follows:
Annexed document extracts the susceptibility of Feature Words according to the mapIt calculates annexed document susceptibility MS (F), i.e., are as follows:
We can calculate the susceptibility of an Internet map file as a result, are as follows:
The present invention is by considering that map file sensibility carrier is mainly the POI in map satellite information and map file
It marks in text, in conjunction with Chinese word segmentation, Word similarity, the natural language processings algorithm such as word weight computing extracts ground
Four kinds of features of figure information text.According to the four of proposition kinds of features, map satellite information and map POI text set are calculated separately
Susceptibility, the sensitivity value of entire map file is calculated by counting calculation.Since internet exists largely
Diagram data, and having differences property of map file format, in order to improve the execution efficiency of data algorithm, we exist algorithm
It realizes on Spark parallel processing frame, is improved in terms of detection accuracy and execution performance by emulation testing algorithm.
It is emphasized that the present invention is a kind of concerning security matters detection algorithm for map file, can preferably solve in detection internet
The concerning security matters of map file.
It should be understood that above-mentioned specific embodiment, can make those skilled in the art and reader that this is more fully understood
The implementation method of innovation and creation, it should be understood that protection scope of the present invention is not limited to such special statement and reality
Apply example.Therefore, although description of the invention has been carried out detailed description to the invention referring to drawings and examples,
It is, it will be understood by those of skill in the art that still can be modified or replaced equivalently to the invention, in short, one
The technical solution and its improvement for cutting the spirit and scope for not departing from the invention, should all cover in the invention patent
Protection scope in.
The invention discloses one kind to be based on the internet Spark concerning security matters map detection algorithm, natural language processing, big data
Analysis field.Main includes constructing sensitive dictionary, Map Text Label pretreatment, map file feature extraction, map file susceptibility
Statistics calculates four implementation phases.Firstly, being pre-processed to original map data using Chinese word segmentation ansj algorithm, obtain
Map file Feature Words.Secondly, using Chinese thesaurus Word similarity algorithm, TF-IDF word weight computing is based on
Algorithm calculates the susceptibility of Feature Words, extracts map file feature.Finally file characteristic, statistics calculating map are attached according to the map
The susceptibility for belonging to several POI marks place in information and map file, to obtain the sensitivity value of corresponding map file.In addition,
In order to improve algorithm execution efficiency, corresponding Chinese Word Automatic Segmentation ansj, TF-IDF algorithm and map susceptibility statistics meter
Calculation is realized in Spark parallel computation frame, and algorithm execution efficiency, the map number on the convenient internet of monitoring in time are improved
According to.It is emphasized that the present invention is detected to different-format map file on internet, it is a kind of effective internet
Concerning security matters map detection algorithm.
Claims (7)
1. a kind of internet concerning security matters map detection algorithm based on Spark, is broadly divided into data preprocessing module, Internet map
The carrier that file mainly has classified information has the POI markup information in map file satellite information and map file, ground picture and text
Part satellite information is mainly to issue the description information of people's map file over the ground of data, and it is mainly in map that map POI, which marks place,
Place position title;The building module of sensitive dictionary, sensitive dictionary have important role for the extraction of Feature Words susceptibility, and
And some location informations may be that combination word just has sensibility, individually consider do not have sensibility when a word.Text
Characteristic extracting module, by extracting Feature Words and sensitive Lexical Similarity, the weight of Feature Words in the text, Feature Words are in POI
Position attribution in text, weight of the corresponding sensitive vocabulary of Feature Words in sensitive dictionary.According to Feature Words susceptibility and right
Feature Words attribute in the text is answered, the sensibility of corresponding map file is constructed.Map file susceptibility computing module passes through front
Feature Words feature is extracted, map file susceptibility is calculated by statistical.Concerning security matters map detection algorithm extracts ground picture and text first
Part text data extracts text feature using nature Processing Algorithm, goes out correspondingly according to Feature Words susceptibility combined calculation is extracted
The sensitivity value of map file.
2. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that described
The construction method of sensitive dictionary specifically: algorithm test object is mainly map file, so the classification of sensitive word is mainly state
Family should not disclose location information word, such as some military bases, large-scale national basis facility place.And over the ground by us
Figure POI observes some sensitive informations not instead of by single sensitive word concerning security matters, by way of combining word, such as a ground
Concerning security matters situation may can't be had by occurring " rocket " this word in point information, but if also comprising " grinding in location information
Study carefully base " as word, that is possible to can have the case where concerning security matters.So we are when constructing sensitive dictionary, more than
A kind of sensitive dictionary for combining word is also defined in the sensitive dictionary for constructing single word, passes through the structure of both sensitive dictionaries
It builds, more comprehensively detects the sensitive location information in map.
3. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that extract
The susceptibility of Feature Words after data prediction, the similarity by calculating Feature Words and sensitive word quantify the sensitivity value of Feature Words,
The susceptibility of quantization characteristic word is 1 if Feature Words and sensitive word are completely the same, and Feature Words and sensitive Word similarity are reached
To certain threshold value, then it is assumed that the susceptibility of the specific word is the similar value.By the similarity meter for calculating Feature Words and sensitive word
It calculates, solves sensitive dictionary and cover the incomplete or unrecognized phenomenon of near synonym.By the phase for calculating Feature Words and sensitive word
Like degree, the word that susceptibility is higher than certain threshold value can be extracted, for the subsequent extraction to the specific word, to judge the spy
Sign word corresponds to the sensibility of text.
4. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that extract
Feature Words weight shared in corresponding text, different words represents the different meaning of text in the text, namely more can generation
Its shared weight in corresponding text of the word of table text is bigger.In Map Text Label, if Feature Words are sensitive vocabulary, and
Shared weight is big in the text for it, then the sensitivity value of the map file is just corresponding relatively high, it is subsequent over the ground for us in this way
Map file carries out sensitive grade assessment and plays an important role.
5. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that extract
Feature Words are in the weight of position weight and the corresponding sensitive vocabulary of Feature Words in sensitive dictionary in POI text.By analysis
POI text feature in figure, Feature Words position different in POI text have different weights.Invention defines Feature Words
Three kinds of position attributions in the text, i.e. { B, I, E } respectively indicate Feature Words in POI text stem, intermediate and tail position.Separately
Outside, the susceptibility that concerning security matters place different in map has is not also identical, such as the sensitive place POI of military class accordingly can be than one
The susceptibility that a little places infrastructure POI have is high, so we pass through the weight for defining sensitive vocabulary, to judge different spies
Sign word corresponds to the different sensitive places POI.According to different geographic objects, different weights is distributed sensitive vocabulary.
6. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that described
Map file susceptibility calculation method are as follows: using text feature defined in 3-5, map satellite information and ground are calculated by statistics
Then the susceptibility of POI location information in figure goes out the susceptibility of corresponding map file by this two parts combined calculation.
7. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that described
Internet concerning security matters map detection algorithm based on Spark specifically: parse two parts text in Internet map file first
Information is several POI location informations in map file satellite information and map file respectively.Then to two parts text information into
Line number Data preprocess extracts Feature Words and Feature Words corresponding attribute in the text in text, according to the map place respectively
The feature of information text extracts 4 category feature of Map Text Label Feature Words, calculates map satellite information and ground according to individual features
The susceptibility of several POI location informations in figure, then combined calculation obtains the sensitivity value of corresponding map file, in database
Map file, algorithm can according to the map file sensitivity value sequence output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811216505.6A CN109446288A (en) | 2018-10-18 | 2018-10-18 | One kind being based on the internet Spark concerning security matters map detection algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811216505.6A CN109446288A (en) | 2018-10-18 | 2018-10-18 | One kind being based on the internet Spark concerning security matters map detection algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109446288A true CN109446288A (en) | 2019-03-08 |
Family
ID=65546775
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811216505.6A Pending CN109446288A (en) | 2018-10-18 | 2018-10-18 | One kind being based on the internet Spark concerning security matters map detection algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109446288A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334228A (en) * | 2019-07-09 | 2019-10-15 | 广西壮族自治区基础地理信息中心 | A kind of Internet Problems map screening method based on deep learning |
CN110580416A (en) * | 2019-09-11 | 2019-12-17 | 国网浙江省电力有限公司信息通信分公司 | sensitive data automatic identification method based on artificial intelligence |
CN110888972A (en) * | 2019-10-27 | 2020-03-17 | 北京明朝万达科技股份有限公司 | Sensitive content identification method and device based on Spark Streaming |
CN111209735A (en) * | 2020-01-03 | 2020-05-29 | 广州杰赛科技股份有限公司 | Document sensitivity calculation method and device |
CN112463804A (en) * | 2021-02-02 | 2021-03-09 | 湖南大学 | KDTree-based image database data processing method |
WO2021142600A1 (en) * | 2020-01-14 | 2021-07-22 | 华为技术有限公司 | Image recognition method and related device |
CN114372152A (en) * | 2022-01-05 | 2022-04-19 | 自然资源部地图技术审查中心 | Rapid safety inspection method and device for electronic map POI |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008169A (en) * | 2014-05-30 | 2014-08-27 | 中国测绘科学研究院 | Semanteme based geographical label content safe checking method and device |
CN106445998A (en) * | 2016-05-26 | 2017-02-22 | 达而观信息科技(上海)有限公司 | Text content auditing method and system based on sensitive word |
CN108319630A (en) * | 2017-07-05 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Information processing method, device, storage medium and computer equipment |
CN108519970A (en) * | 2018-02-06 | 2018-09-11 | 平安科技(深圳)有限公司 | The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text |
-
2018
- 2018-10-18 CN CN201811216505.6A patent/CN109446288A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008169A (en) * | 2014-05-30 | 2014-08-27 | 中国测绘科学研究院 | Semanteme based geographical label content safe checking method and device |
CN106445998A (en) * | 2016-05-26 | 2017-02-22 | 达而观信息科技(上海)有限公司 | Text content auditing method and system based on sensitive word |
CN108319630A (en) * | 2017-07-05 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Information processing method, device, storage medium and computer equipment |
CN108519970A (en) * | 2018-02-06 | 2018-09-11 | 平安科技(深圳)有限公司 | The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334228A (en) * | 2019-07-09 | 2019-10-15 | 广西壮族自治区基础地理信息中心 | A kind of Internet Problems map screening method based on deep learning |
CN110580416A (en) * | 2019-09-11 | 2019-12-17 | 国网浙江省电力有限公司信息通信分公司 | sensitive data automatic identification method based on artificial intelligence |
CN110888972A (en) * | 2019-10-27 | 2020-03-17 | 北京明朝万达科技股份有限公司 | Sensitive content identification method and device based on Spark Streaming |
CN111209735A (en) * | 2020-01-03 | 2020-05-29 | 广州杰赛科技股份有限公司 | Document sensitivity calculation method and device |
CN111209735B (en) * | 2020-01-03 | 2023-06-02 | 广州杰赛科技股份有限公司 | Document sensitivity calculation method and device |
WO2021142600A1 (en) * | 2020-01-14 | 2021-07-22 | 华为技术有限公司 | Image recognition method and related device |
CN113396410A (en) * | 2020-01-14 | 2021-09-14 | 华为技术有限公司 | Image identification method and related equipment |
CN112463804A (en) * | 2021-02-02 | 2021-03-09 | 湖南大学 | KDTree-based image database data processing method |
CN112463804B (en) * | 2021-02-02 | 2021-06-15 | 湖南大学 | KDTree-based image database data processing method |
CN114372152A (en) * | 2022-01-05 | 2022-04-19 | 自然资源部地图技术审查中心 | Rapid safety inspection method and device for electronic map POI |
CN114372152B (en) * | 2022-01-05 | 2024-08-16 | 自然资源部地图技术审查中心 | Rapid security inspection method and device for electronic map POI |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114610515B (en) | Multi-feature log anomaly detection method and system based on log full semantics | |
CN109446288A (en) | One kind being based on the internet Spark concerning security matters map detection algorithm | |
CN109800310B (en) | Electric power operation and maintenance text analysis method based on structured expression | |
CN110334213B (en) | Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism | |
CN103853738B (en) | A kind of recognition methods of info web correlation region | |
CN107463658B (en) | Text classification method and device | |
CN110516067A (en) | Public sentiment monitoring method, system and storage medium based on topic detection | |
CN108280130A (en) | A method of finding sensitive data in text big data | |
CN108628828A (en) | A kind of joint abstracting method of viewpoint and its holder based on from attention | |
CN102298635A (en) | Method and system for fusing event information | |
CN106844331A (en) | Sentence similarity calculation method and system | |
CN110222250B (en) | Microblog-oriented emergency trigger word identification method | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN104063387A (en) | Device and method abstracting keywords in text | |
CN106599054A (en) | Method and system for title classification and push | |
JP5426868B2 (en) | Numerical expression processing device | |
CN111061939B (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
CN109918621A (en) | Newsletter archive infringement detection method and device based on digital finger-print and semantic feature | |
CN113449111B (en) | Social governance hot topic automatic identification method based on time-space semantic knowledge migration | |
CN113590810A (en) | Abstract generation model training method, abstract generation device and electronic equipment | |
CN114997288A (en) | Design resource association method | |
CN116719683A (en) | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium | |
CN111079582A (en) | Image recognition English composition running question judgment method | |
Indarapu et al. | Comparative analysis of machine learning algorithms to detect fake news | |
CN113591476A (en) | Data label recommendation method based on machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190308 |