CN109446288A - One kind being based on the internet Spark concerning security matters map detection algorithm - Google Patents

One kind being based on the internet Spark concerning security matters map detection algorithm Download PDF

Info

Publication number
CN109446288A
CN109446288A CN201811216505.6A CN201811216505A CN109446288A CN 109446288 A CN109446288 A CN 109446288A CN 201811216505 A CN201811216505 A CN 201811216505A CN 109446288 A CN109446288 A CN 109446288A
Authority
CN
China
Prior art keywords
map
text
sensitive
feature words
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811216505.6A
Other languages
Chinese (zh)
Inventor
胡敏
崔永胜
黄宏程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201811216505.6A priority Critical patent/CN109446288A/en
Publication of CN109446288A publication Critical patent/CN109446288A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of internet concerning security matters map detection algorithm based on Spark, belongs to big data technical application, natural language processing field.This method passes through data prediction first, and to map text data carries out Chinese word segmentation, extracts map file Feature Words;Then Text character extraction is carried out, main four kinds of features of weight for extracting Feature Words and sensitive Lexical Similarity, Feature Words weight in the text, Feature Words in the corresponding sensitivity vocabulary of position attribution, the Feature Words in POI text in sensitive dictionary;Finally, the feature extracted by front, is calculated corresponding map file susceptibility by statistics.Present invention combination Internet map file data crawling method, can be realized internet concerning security matters map and detects automatically, accelerate to mitigate artificial detection Internet map pressure to Internet Problems map detection efficiency.

Description

One kind being based on the internet Spark concerning security matters map detection algorithm
Technical field
The invention belongs to big data technical applications, natural language processing field, relate generally to Internet map sensitive information Detection method is a kind of based on the research of the internet Spark concerning security matters map detection algorithm.
Background technique
With the rapid development of internet and information technology, Map Service, which has become in people's life, to be obtained Scarce a part, however occur a large amount of electronic map in network, while providing convenient to people's life, also occur Some the problem of can not be ignored.For example, the safety issue of national geographic information, in September, 2015, country's mapping geography information Office checks that domestic and international large scale business website, forum, the geography information 1.3 ten thousand in microblogging are a plurality of, and discovery has that " map " takes Business 275, " problem map " 321, picture, violation point of interest (point of interest, POI) mark 2336.Map The main carriers of concerning security matters are to mark POI information in map in violation of rules and regulations, and POI includes the information such as title, longitude and latitude, and a POI can be with table Show a house, a retail shop, may also indicate that as military base, a military restricted zone etc..So if in electronic map It is labelled with concerning security matters POI information, and is announced on the internet, it will seriously damages national interests, endangers national security.
It is detected for Internet map concerning security matters, has research by combining the sensitive position information in vector numerical map Amount, sensitive symbolic information amount, sensitive geometrical information, sensitive topology information amount, sensitive annotation information content and Sensitive Attributes letter Breath amount COMPREHENSIVE CALCULATING obtains the sensitive information amount of vector numerical map, to assess the sensitive grade of numerical map.Also there is research The sensitivity value of map file is measured by considering to calculate the POI susceptibility in map, from administration and software identification technology Two aspects propose corresponding solution.Country's concerning security matters map detection research at present is also fewer, still, in sensitive word There are many correlative studys for context of detection, such as have by establishing sensitive dictionary, Chinese word segmentation, and String matching detects sensitive in mail Word, therefore, it is determined that the sensitive grade of mail.Also it has been proposed that by extracting topic text feature word, in conjunction with sensitive lexicon base In condition random field detection model, sensibility mark is carried out to the characteristic item of document to be measured, and propose based on the quick of kernel method Feel infomation detection model.
Above method attribute information amount extraction comparison each for map is difficult, and map sensitivity Detection difficulty is big.Moreover, It is above-mentioned mainly to be detected by the direct matching way of Feature Words and sensitive word for mail or document sensitivity Detection, but this Kind mode can have the sensitive word covering unrecognized problem of not comprehensive or near synonym.Therefore, it is proposed that it is special by calculating The similarity of word and sensitive word is levied, to calculate the sensibility of feature word.Since our test object is the POI in map Short text, so in addition extract 3 kinds of features according to place POI attribute, Feature Words weight in the text, Feature Words are in POI text Weight of the corresponding sensitive vocabulary of position attribution and Feature Words in sensitive dictionary in this, is joined by 4 kinds of above-mentioned features Total susceptibility for calculating map POI.In addition, Internet map file can generally have some satellite informations, this is also used as one kind The attribute of map concerning security matters detection.
With the rapid development of Internet technology, explosive growth, map in corresponding network is presented in all trades and professions data Data are also increasing rapidly, can no longer meet demand using traditional single machine tupe.Distributed proccessing is great The analysis and processing of big data are pushed, Spark and Hadoop are distributive parallel computation frameworks popular at present.I The big data processing frame that is detected using Spark as concerning security matters map because Spark not only has in Hadoop The advantages of MapReduce, still calculates based on memory, and has optimizing scheduling mechanism and operator expression formula more abundant, People are based on Spark and many correlative studys have also been made.
By mark place POI and map file satellite information in consideration map, the sensitive information of text is extracted respectively, The sensitive grade of last COMPREHENSIVE CALCULATING map file.Set forth herein a kind of, and the internet concerning security matters map based on Spark detects calculation Method, not only increases the accuracy of map detection, while improving the time performance of map detection.
Summary of the invention
Present invention solves the technical problem that: the present invention is for internet electronic map satellite information and map label The correlation properties of point information, propose a kind of Internet map concerning security matters detection algorithm model.By considering electronic map satellite information The sensitive grade that map is measured with the susceptibility in map label place, due to including a large amount of map number in current internet According in order to improve the performance of map detection algorithm, we will test algorithm and realize on Spark processing frame, parallelization processing Map datum realizes a kind of concerning security matters map detection model of efficiently and accurately.
The technical scheme is that the processing of present invention map file over the ground is broadly divided into three parts: data prediction, Text character extraction, map susceptibility calculate.Data prediction is mainly to the different-format map got from internet Document analysis obtains the satellite information of to map description on place POI and internet in map file.Text character extraction master If POI text carries out feature extraction in map satellite information and map file, main to extract 4 Partial Features: 1. Feature Words It with the similarity of sensitive word vocabulary, is calculated by Lexical Similarity, can solve that sensitive dictionary is not comprehensive and synonym can not The case where identification.2. the weight of Feature Words in the text, different words have different weights, that is, text in the text In keyword shared by maximum weight, the meaning of text can be most represented, so we are by calculating Feature Words in the text Weight measures sensitivity characteristics representated by word.3. position attribution of the Feature Words in POI text, in POI place position In information, whether Feature Words are that sensitive place also has a certain impact for place in the position of POI text.4. Feature Words pair Weight of the sensitive vocabulary in sensitive dictionary is answered, the susceptibility that different sensitive places has is not also identical, such as military base Susceptibility is higher than infrastructure susceptibility, so we also extract the weight of corresponding sensitive word as feature.Pass through said extracted 4 kinds of features, by statistical calculate map file sensitivity value, then algorithm according to detection map file data set it is quick Inductance value Sequential output from high to low.Since there are a large amount of map datums on internet, in order to improve the treatability of detection algorithm Can, we are by algorithm operation on Spark parallel processing frame.
Detailed description of the invention
Fig. 1 is system architecture diagram of the invention;
Fig. 2 is overview flow chart of the invention;
Specific embodiment
Content in order to better illustrate the present invention, below with reference to Figure of description and according to example to tool of the invention Body implementation is further elaborated.
It is data being executed on Spark Computational frame as Fig. 1 (algorithm framework) show algorithm proposed by the present invention Storage, algorithm execution are executed on Spark frame.It is illustrated in figure 2 overview flow chart of the invention, comprising: obtain structure Build sensitive dictionary, data preprocessing module, data characteristics extraction module, map file susceptibility computing module, based on Spark Internet concerning security matters map detection algorithm is total to four module.
Detailed description below detailed implementation process of the invention.
S1: sensitive dictionary is constructed.Map susceptibility calculates mainly by extracting the similarity of Feature Words and sensitive word, Therefore the quality of sensitive dictionary is very significant considering that concerning security matters map detection algorithm.We define single sensitive word first Sensitive dictionary S={ s1,s2,...,sn, however after carrying out Chinese word segmentation due to our to map text datas, text all by It divides for word, and there are the places of part concerning security matters can also be divided into insensitive word, such as " Chinese rocket is ground Study carefully base ", this text can be divided into " China ", " rocket ", " research ", " base " four words after Chinese word segmentation, this four Concerning security matters situation is not present in a word independent detection, but this POI in rocketry base of country is not allow Direct Mark to exist On map file.For this POI we using combination sensitive word detect, that is, define it is a kind of combination sensitive word it is quick Feel dictionaryWithWe are when constructing sensitive dictionary, according to sensitive word The sensibility for representing place is different, inputs corresponding weight V={ υ to the sensitive word12,...,υn, υiIndicate sensitive word pair The weight answered.
S2: data prediction.There are the map file of multiple format, such as jpg, the lattice such as dwg, mapInfo on internet Formula.We indicate that a kind of map file of format, P indicate the POI mark ground point set in map using M=(P, F), and F is indicated The satellite information of map file, wherein P={ p1,p2,......,pn, piIndicate a mark place in map.We define MS(pi) indicate mark place piSusceptibility, MS (P) and MS (F) respectively indicates map label place susceptibility and map is attached The susceptibility of information.
Susceptibility COMPREHENSIVE CALCULATING by calculating map POI information and map satellite information obtains the sensitivity of map file Information first pre-processes data, data preprocessing module is main for the susceptibility for calculating map POI and satellite information It is that Chinese word segmentation is carried out to text.Chinese word cutting method used herein is that the Ansj to increase income on Spark carries out Chinese word segmentation. Ansj is based on n-Gram, the Chinese word segmentation that CRF, HMM are realized, participle accuracy rate has reached 96% or more, realized Chinese at present Participle, chinese names identification, keyword extraction, the functions such as key marker.We using Ansj extract POI mark text and The Feature Words of map satellite information, we provide definition:Feature word set after indicating text participle It closes, L={ l1,l2,......,lnIndicating position attribution of the Feature Words in POI text, main value is { B, I, E }, table Show Feature Words in POI text beginning of the sentence, sentence neutralizes sentence tail position, and it is feature for subsequent POI text that position attribution vector, which defines, It extracts.
S3: Text character extraction.The Feature Words that POI marks text and map satellite information are obtained by data prediction Vector, we extract 4 kinds of features of text according to Feature Words: the similarity of Feature Words and sensitive vocabulary, Feature Words are in the text Weight, Feature Words in the weight of position attribution and the corresponding sensitive vocabulary of Feature Words in sensitive dictionary in POI text, Following describe 4 kinds of methods of Text character extraction.
S31: the similarity of Feature Words and sensitive word.By the pretreatment of above-mentioned map file over the ground, map file is extracted Feature Words, then how to detect the susceptibility of Feature Words? currently, many correlative studys are all by Feature Words and sensitivity The sensitive directly matched mode of vocabulary extracts the susceptibility of Feature Words in dictionary, but this detection method will certainly exist it is quick Feel dictionary and cover the incomplete or unrecognized phenomenon of near synonym, so similar to sensitive word using detection Feature Words herein The mode of degree obtains the similarity of Feature Words and sensitive word first, when two Words similarities are greater than certain threshold value, according to The susceptibility of similarity quantization Feature Words.
Word in Internet map file is mainly place classification information, so the Feature Words that we extract are mainly The other word of location category, the detection for Feature Words sensibility are obtained by calculating the similarity of Feature Words and sensitive vocabulary The susceptibility of Feature Words.Calculating Words similarity at present has many researchs, and there are two types of common Measurement of word similarity: Method based on World Affairs or certain classification system and the context vector spatial model method based on statistics.First kind meter The common dictionary for calculating Words similarity has Hownet (hownet), wordNet and Chinese thesaurus, the construction side of these three dictionaries Method is different.Because our text is mainly short text, i.e., mark location information in map, so being suitable for us Words similarity method be mainly based upon the Measurement of word similarity of dictionary classification system, used by comparing our Measurement of word similarity based on Chinese thesaurus." Chinese thesaurus " is that Mei Jiaju et al. is compiled in nineteen eighty-three, dictionary Not only include the synonym of a word, also contains a certain number of similar words, later Harbin Institute of Technology's information inspection Rope laboratory completes " Chinese thesaurus extended edition " using numerous word related resources, we calculate Words similarity and are based on Extended edition is calculated.
By the Words similarity algorithm based on Chinese thesaurus, Feature Words c is calculatediWith sensitive word sjSimilarity mij, because there are multiple sensitive vocabulary in sensitive dictionary, we take in sensitive dictionary with the maximum word of feature Word similarity, That is Mij=max (mij,j∈S).When the similarity of this paper defined feature word and sensitive word is more than threshold θ, just think that Feature Words have There is certain susceptibility, is defined as:
For the Feature Words calculated by above formula there are when sensibility, we are extracting the specific word weight in the text, and Position of the Feature Words in POI text, the weight feature of the corresponding sensitive vocabulary of Feature Words.
S32: the weight of Feature Words in the text.In the text, different words has different weights in the text, It is exactly maximum weight shared by the keyword in text, can most represents the meaning of text.Similarly, the map file that we extract Text, different Feature Words represent different weights in the text, if the big Feature Words of weight are sensitive vocabulary, A possibility that map file is concerning security matters map file is just high, conversely, the small Feature Words of weight are sensitive vocabulary, then correspondingly A possibility that map file is concerning security matters map will be lower.Currently, more commonly used term weighing extraction has TF-IDF algorithm, TF-IDF is a kind of statistical method, for assessing a word for the significance level of a file, passes through word frequency (term respectively Frequency, TF) and reverse document-frequency (inverse document frequency, IDF) calculate, formula are as follows:
TF-IDF=TF*IDF (2)
Word frequency TF and inverse document frequency IDF are respectively indicated in formula are as follows:
Wherein ni,jIndicate word tiIn file djThe number of middle appearance, njIndicate file djWord summation.| D | it is corpus Total number of files in library, | { j:ti∈dj| to include word t in corpusiNumber of files, in order to prevent divisor be 0, so Add 1 on denominator.
Existing TF-IDF is calculated in word c weight, and calculating inverse document frequency IDF is according in search number of files on the net | D |, then according to the number of files in total document including word c | { j:ti∈dj|, obtain inverse document frequency IDF.But we Target text is mainly the text with geography information attribute, and the Feature Words extracted are typically all place noun, so When calculating the inverse document frequency IDF in TF-IDF, we calculate inverse document according to the feature text collection in oneself data set Frequency IDF calculates the TF-IDF of word according to the general place text information of our extraction as the text in corpus Weight wi.Formula indicates are as follows:
Wherein ni,jIndicate word ciIn text pjThe number of middle appearance, njIndicate text pjWord summation.| P | for us Data set text summation, | { j:ci∈pj| to include word c in corpusiTextual data.
S33: position of the Feature Words in POI text.In the differentiation of map POI text sensibility, there are such a feelings It include " military camp " concerning security matters keyword in condition, such as " Caiyuanba military camp bus station " this POI title, this POI can be identified For concerning security matters place, but passing through this POI of manual identified is not concerning security matters place, but a bus station.We analyze reason It is found that whether position of the concerning security matters keyword in POI text for POI is that concerning security matters place can also have certain influence.Cause This we in front data prediction when, to the position of text Chinese word segmentation record individual features word in the text, use L= {l1,l2,......,lnIndicate the position attributions of Feature Words, wherein li∈ { B, I, E }, { B, I, E } respectively indicate Feature Words and exist POI text stem, intermediate and tail position.
S34: the weight of the corresponding sensitive vocabulary of Feature Words.In map different places according to military base correlation degree, And cause its sensibility different the difference of the geographical environment security implication of country.State Bureau of Surveying and Mapping's map examines center The positions such as military airfield, works for military operations sensibility with higher is alsied specify in " open map content representation supplementary provisions ". Also document passes through the sensitivity coefficient of building geographic object to measure the sensibility height of geographic object, we quantify geographical right As for the geographic object directly related with military use, geographic object with military use indirect correlation, national large foundation is set Three classes geographic object is applied, the sensitive weight of geographic object, V ∈ { 1,0.7,0.4 } are indicated using symbol V, specific value is shown in Table 3.1.For how which kind of geographic object identification feature word belongs to, we identify according to sensitive vocabulary, we are constructing When sensitive lexicon, corresponding one sensitive weight of each sensitive word, this weight is V, when we identify feature When word and sensitive Word similarity meet threshold value, the corresponding weight of sensitive word is extracted.
The classification of 1.1 geographic object of table
S4: map file susceptibility calculation method.The carrier of map file concerning security matters, mainly describes file on internet Satellite information and map file mark place POI.We calculate the susceptibility MS (F) and map of map file satellite information Mark the susceptibility MS (M) of susceptibility MS (P) combined calculation map of place POI, formula are as follows:
MS (M)=α MS (P)+β MS (F) (6)
In formula, α, β are respectively weight shared by map satellite information susceptibility and map label place POI susceptibility.
The susceptibility of map satellite information in map file and map label place POI information is calculated, Chinese is passed through Participle obtains the feature set of words of map satellite information and map label place POI.We pass through four kinds of spies for extracting Feature Words Sign calculates the susceptibility of Feature Words, since mark place POI is place short text, so we consider Feature Words in short essay Position feature in this, and map file satellite information is whole section of text, we do not consider the position feature of Feature Words.According to Said extracted feature, map file satellite information Feature Words susceptibilityIt is sensitive with map file mark place POI Feature Words DegreeFormula indicates are as follows:
In formula, ωiIndicate the TF-IDF weight of Feature Words in the text, MijIndicate that Feature Words i is similar to sensitive word j's Degree, VjIndicate the corresponding weight of sensitive word j.
In formula, liIndicate position attribution of the Feature Words i in POI text.
Comprising many mark place POI in map, we pass through the susceptibility for calculating single mark place To obtain the mark place susceptibility MS (P) of entire map.Single mark is calculated by the susceptibility of Feature Words in POI text The susceptibility in place, formula are as follows:
Then according to the susceptibility in single mark place, map label place susceptibility MS (P), formula are arrived in calculating are as follows:
Annexed document extracts the susceptibility of Feature Words according to the mapIt calculates annexed document susceptibility MS (F), i.e., are as follows:
We can calculate the susceptibility of an Internet map file as a result, are as follows:
The present invention is by considering that map file sensibility carrier is mainly the POI in map satellite information and map file It marks in text, in conjunction with Chinese word segmentation, Word similarity, the natural language processings algorithm such as word weight computing extracts ground Four kinds of features of figure information text.According to the four of proposition kinds of features, map satellite information and map POI text set are calculated separately Susceptibility, the sensitivity value of entire map file is calculated by counting calculation.Since internet exists largely Diagram data, and having differences property of map file format, in order to improve the execution efficiency of data algorithm, we exist algorithm It realizes on Spark parallel processing frame, is improved in terms of detection accuracy and execution performance by emulation testing algorithm. It is emphasized that the present invention is a kind of concerning security matters detection algorithm for map file, can preferably solve in detection internet The concerning security matters of map file.
It should be understood that above-mentioned specific embodiment, can make those skilled in the art and reader that this is more fully understood The implementation method of innovation and creation, it should be understood that protection scope of the present invention is not limited to such special statement and reality Apply example.Therefore, although description of the invention has been carried out detailed description to the invention referring to drawings and examples, It is, it will be understood by those of skill in the art that still can be modified or replaced equivalently to the invention, in short, one The technical solution and its improvement for cutting the spirit and scope for not departing from the invention, should all cover in the invention patent Protection scope in.
The invention discloses one kind to be based on the internet Spark concerning security matters map detection algorithm, natural language processing, big data Analysis field.Main includes constructing sensitive dictionary, Map Text Label pretreatment, map file feature extraction, map file susceptibility Statistics calculates four implementation phases.Firstly, being pre-processed to original map data using Chinese word segmentation ansj algorithm, obtain Map file Feature Words.Secondly, using Chinese thesaurus Word similarity algorithm, TF-IDF word weight computing is based on Algorithm calculates the susceptibility of Feature Words, extracts map file feature.Finally file characteristic, statistics calculating map are attached according to the map The susceptibility for belonging to several POI marks place in information and map file, to obtain the sensitivity value of corresponding map file.In addition, In order to improve algorithm execution efficiency, corresponding Chinese Word Automatic Segmentation ansj, TF-IDF algorithm and map susceptibility statistics meter Calculation is realized in Spark parallel computation frame, and algorithm execution efficiency, the map number on the convenient internet of monitoring in time are improved According to.It is emphasized that the present invention is detected to different-format map file on internet, it is a kind of effective internet Concerning security matters map detection algorithm.

Claims (7)

1. a kind of internet concerning security matters map detection algorithm based on Spark, is broadly divided into data preprocessing module, Internet map The carrier that file mainly has classified information has the POI markup information in map file satellite information and map file, ground picture and text Part satellite information is mainly to issue the description information of people's map file over the ground of data, and it is mainly in map that map POI, which marks place, Place position title;The building module of sensitive dictionary, sensitive dictionary have important role for the extraction of Feature Words susceptibility, and And some location informations may be that combination word just has sensibility, individually consider do not have sensibility when a word.Text Characteristic extracting module, by extracting Feature Words and sensitive Lexical Similarity, the weight of Feature Words in the text, Feature Words are in POI Position attribution in text, weight of the corresponding sensitive vocabulary of Feature Words in sensitive dictionary.According to Feature Words susceptibility and right Feature Words attribute in the text is answered, the sensibility of corresponding map file is constructed.Map file susceptibility computing module passes through front Feature Words feature is extracted, map file susceptibility is calculated by statistical.Concerning security matters map detection algorithm extracts ground picture and text first Part text data extracts text feature using nature Processing Algorithm, goes out correspondingly according to Feature Words susceptibility combined calculation is extracted The sensitivity value of map file.
2. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that described The construction method of sensitive dictionary specifically: algorithm test object is mainly map file, so the classification of sensitive word is mainly state Family should not disclose location information word, such as some military bases, large-scale national basis facility place.And over the ground by us Figure POI observes some sensitive informations not instead of by single sensitive word concerning security matters, by way of combining word, such as a ground Concerning security matters situation may can't be had by occurring " rocket " this word in point information, but if also comprising " grinding in location information Study carefully base " as word, that is possible to can have the case where concerning security matters.So we are when constructing sensitive dictionary, more than A kind of sensitive dictionary for combining word is also defined in the sensitive dictionary for constructing single word, passes through the structure of both sensitive dictionaries It builds, more comprehensively detects the sensitive location information in map.
3. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that extract The susceptibility of Feature Words after data prediction, the similarity by calculating Feature Words and sensitive word quantify the sensitivity value of Feature Words, The susceptibility of quantization characteristic word is 1 if Feature Words and sensitive word are completely the same, and Feature Words and sensitive Word similarity are reached To certain threshold value, then it is assumed that the susceptibility of the specific word is the similar value.By the similarity meter for calculating Feature Words and sensitive word It calculates, solves sensitive dictionary and cover the incomplete or unrecognized phenomenon of near synonym.By the phase for calculating Feature Words and sensitive word Like degree, the word that susceptibility is higher than certain threshold value can be extracted, for the subsequent extraction to the specific word, to judge the spy Sign word corresponds to the sensibility of text.
4. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that extract Feature Words weight shared in corresponding text, different words represents the different meaning of text in the text, namely more can generation Its shared weight in corresponding text of the word of table text is bigger.In Map Text Label, if Feature Words are sensitive vocabulary, and Shared weight is big in the text for it, then the sensitivity value of the map file is just corresponding relatively high, it is subsequent over the ground for us in this way Map file carries out sensitive grade assessment and plays an important role.
5. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that extract Feature Words are in the weight of position weight and the corresponding sensitive vocabulary of Feature Words in sensitive dictionary in POI text.By analysis POI text feature in figure, Feature Words position different in POI text have different weights.Invention defines Feature Words Three kinds of position attributions in the text, i.e. { B, I, E } respectively indicate Feature Words in POI text stem, intermediate and tail position.Separately Outside, the susceptibility that concerning security matters place different in map has is not also identical, such as the sensitive place POI of military class accordingly can be than one The susceptibility that a little places infrastructure POI have is high, so we pass through the weight for defining sensitive vocabulary, to judge different spies Sign word corresponds to the different sensitive places POI.According to different geographic objects, different weights is distributed sensitive vocabulary.
6. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that described Map file susceptibility calculation method are as follows: using text feature defined in 3-5, map satellite information and ground are calculated by statistics Then the susceptibility of POI location information in figure goes out the susceptibility of corresponding map file by this two parts combined calculation.
7. a kind of internet concerning security matters map detection algorithm based on Spark according to claim 1, it is characterised in that described Internet concerning security matters map detection algorithm based on Spark specifically: parse two parts text in Internet map file first Information is several POI location informations in map file satellite information and map file respectively.Then to two parts text information into Line number Data preprocess extracts Feature Words and Feature Words corresponding attribute in the text in text, according to the map place respectively The feature of information text extracts 4 category feature of Map Text Label Feature Words, calculates map satellite information and ground according to individual features The susceptibility of several POI location informations in figure, then combined calculation obtains the sensitivity value of corresponding map file, in database Map file, algorithm can according to the map file sensitivity value sequence output.
CN201811216505.6A 2018-10-18 2018-10-18 One kind being based on the internet Spark concerning security matters map detection algorithm Pending CN109446288A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811216505.6A CN109446288A (en) 2018-10-18 2018-10-18 One kind being based on the internet Spark concerning security matters map detection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811216505.6A CN109446288A (en) 2018-10-18 2018-10-18 One kind being based on the internet Spark concerning security matters map detection algorithm

Publications (1)

Publication Number Publication Date
CN109446288A true CN109446288A (en) 2019-03-08

Family

ID=65546775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811216505.6A Pending CN109446288A (en) 2018-10-18 2018-10-18 One kind being based on the internet Spark concerning security matters map detection algorithm

Country Status (1)

Country Link
CN (1) CN109446288A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334228A (en) * 2019-07-09 2019-10-15 广西壮族自治区基础地理信息中心 A kind of Internet Problems map screening method based on deep learning
CN110580416A (en) * 2019-09-11 2019-12-17 国网浙江省电力有限公司信息通信分公司 sensitive data automatic identification method based on artificial intelligence
CN110888972A (en) * 2019-10-27 2020-03-17 北京明朝万达科技股份有限公司 Sensitive content identification method and device based on Spark Streaming
CN111209735A (en) * 2020-01-03 2020-05-29 广州杰赛科技股份有限公司 Document sensitivity calculation method and device
CN112463804A (en) * 2021-02-02 2021-03-09 湖南大学 KDTree-based image database data processing method
WO2021142600A1 (en) * 2020-01-14 2021-07-22 华为技术有限公司 Image recognition method and related device
CN114372152A (en) * 2022-01-05 2022-04-19 自然资源部地图技术审查中心 Rapid safety inspection method and device for electronic map POI

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN108319630A (en) * 2017-07-05 2018-07-24 腾讯科技(深圳)有限公司 Information processing method, device, storage medium and computer equipment
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN108319630A (en) * 2017-07-05 2018-07-24 腾讯科技(深圳)有限公司 Information processing method, device, storage medium and computer equipment
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334228A (en) * 2019-07-09 2019-10-15 广西壮族自治区基础地理信息中心 A kind of Internet Problems map screening method based on deep learning
CN110580416A (en) * 2019-09-11 2019-12-17 国网浙江省电力有限公司信息通信分公司 sensitive data automatic identification method based on artificial intelligence
CN110888972A (en) * 2019-10-27 2020-03-17 北京明朝万达科技股份有限公司 Sensitive content identification method and device based on Spark Streaming
CN111209735A (en) * 2020-01-03 2020-05-29 广州杰赛科技股份有限公司 Document sensitivity calculation method and device
CN111209735B (en) * 2020-01-03 2023-06-02 广州杰赛科技股份有限公司 Document sensitivity calculation method and device
WO2021142600A1 (en) * 2020-01-14 2021-07-22 华为技术有限公司 Image recognition method and related device
CN113396410A (en) * 2020-01-14 2021-09-14 华为技术有限公司 Image identification method and related equipment
CN112463804A (en) * 2021-02-02 2021-03-09 湖南大学 KDTree-based image database data processing method
CN112463804B (en) * 2021-02-02 2021-06-15 湖南大学 KDTree-based image database data processing method
CN114372152A (en) * 2022-01-05 2022-04-19 自然资源部地图技术审查中心 Rapid safety inspection method and device for electronic map POI
CN114372152B (en) * 2022-01-05 2024-08-16 自然资源部地图技术审查中心 Rapid security inspection method and device for electronic map POI

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
CN109446288A (en) One kind being based on the internet Spark concerning security matters map detection algorithm
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN110334213B (en) Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism
CN103853738B (en) A kind of recognition methods of info web correlation region
CN107463658B (en) Text classification method and device
CN110516067A (en) Public sentiment monitoring method, system and storage medium based on topic detection
CN108280130A (en) A method of finding sensitive data in text big data
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN102298635A (en) Method and system for fusing event information
CN106844331A (en) Sentence similarity calculation method and system
CN110222250B (en) Microblog-oriented emergency trigger word identification method
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN104063387A (en) Device and method abstracting keywords in text
CN106599054A (en) Method and system for title classification and push
JP5426868B2 (en) Numerical expression processing device
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN109918621A (en) Newsletter archive infringement detection method and device based on digital finger-print and semantic feature
CN113449111B (en) Social governance hot topic automatic identification method based on time-space semantic knowledge migration
CN113590810A (en) Abstract generation model training method, abstract generation device and electronic equipment
CN114997288A (en) Design resource association method
CN116719683A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium
CN111079582A (en) Image recognition English composition running question judgment method
Indarapu et al. Comparative analysis of machine learning algorithms to detect fake news
CN113591476A (en) Data label recommendation method based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190308