CN109508557A

CN109508557A - A kind of file path keyword recognition method of association user privacy

Info

Publication number: CN109508557A
Application number: CN201811228942.XA
Authority: CN
Inventors: 冯云; 崔翔; 刘宝旭; 刘潮歌; 刘奇旭
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2019-03-22

Abstract

The present invention provides a kind of file path keyword recognition method of association user privacy, comprising the following steps: file path set to be processed is obtained, with the All Files path of the computer system from a user for one group；File path is pre-processed, including capital and small letter is unified, entry segmentation, stop word filtering；Divide these three algorithms of the term frequency-inverse document frequency method of gained entry according to for the context relation method of fullpath, canonical matching method and for file path, carries out the identification of file path keyword；It uses expert graded to assign different weights for above-mentioned three kinds of algorithms, and carries out the normalization of weight, give a mark for each keyword；According to the scoring event of keyword, the keyword of the association user privacy of this group of file path is obtained according to score height.

Description

A kind of file path keyword recognition method of association user privacy

Technical field

The present invention relates to computer big datas and text-processing field, in particular to a kind of association user privacy File path keyword recognition method.

Background technique

Keyword is the word of one or more subject contents that can express one section of text, for determining text categories, table Content of text is stated to play a key role.Under big data era, keyword identification technology text mining, information retrieval, from It plays an important role in the fields such as right Language Processing.Currently, thering are many technologies to compare for crucial word identification problem Maturation is being constantly progressive, such as term frequency-inverse document frequency method, information gain method traditional statistics method and LDA theme mould The machine learning algorithms such as type, RAKE.By carrying out keyword identification to text, it is subject to further to handle analysis, can accomplishes Mass text classification, text snippet generate, text emotion is analyzed and text source speculates etc..

Current keyword identification technology is all around natural language text, and needing length to reache a certain level could be real Now preferably keyword recognition effect.However, there is also other various texts in addition to natural language text, and e.g., generation Code, database instruction etc. have semantic programming language text and network linking, file path etc. to have structure without semantic text This.For text described above, the keyword of keyword and natural language text is different, or even says from universal significance And this concept of keyword is not present, it is only applicable in special scenes, and the most length of these texts is not grown, therefore rare corresponding Keyword identification technology.

For file path, all there is a large amount of files in each computer, also there is a large amount of file Path.And computer belongs to personal or unit, is easy in file path there is clue relevant to owner's identity, It can be used for identifying a people or a unit, i.e. privacy of user.In simple terms, under this scene of association user privacy, text The keyword in part path refers to can be used as clue for identifying the word of owner's identity.Since file path can be retained in use In the program of document, exploitation that family is edited, privacy of user is caused to reveal, therefore the file path for studying association user privacy closes Keyword identification technology has positive meaning.

Summary of the invention

In view of the above-mentioned problems, the invention proposes a kind of file path keyword recognition methods of association user privacy.It should Method can identify the keyword in file path, these keywords can identify the identity of system owners, with privacy of user It is associated.

In order to achieve the above object, the specific technical solution that the present invention takes is:

A kind of file path keyword recognition method of association user privacy, comprising the following steps:

File path set to be processed is obtained, with the All Files path of the computer system from a user for one Group；

File path is pre-processed, including capital and small letter is unified, entry segmentation, stop word filtering；

Divide gained entry according to for the context relation method of fullpath, canonical matching method and for file path Term frequency-inverse document frequency method these three algorithms, carry out the identification of file path keyword；

Different weights is assigned for above-mentioned three kinds of algorithms, and carries out the normalization of weight, is given a mark for each keyword；

According to the scoring event of keyword, the keyword of this group of file path is obtained according to score height.

Further, entry segmentation refers to the feature according to file path, using forward slash " ", back slash "/" with And colon ": " is split entry, for blank character contained in every level-one directory name or filename without segmentation.

Further, the stop word of the stop word filtering includes default disk symbol, file suffixes name.

Further, the context relation method includes the following three types specific algorithm:

1) using bit identification word identification keyword is faced, it includes before the keyword of same sequence that this, which faces bit identification word, Word afterwards；

2) utilization scope mark word identifies that keyword, the scope identifier word refer to the word for indicating a class file；

3) identify that keyword, the end word are the last one entry of each path, i.e. filename using end word；

The sequence from father to son by bibliographic structure to each entry in path by being numbered to obtain.

Further, the canonical matching method refers to that all appear in file path of matching has certain text special The entry of sign, such entry include email address, date, pure digi-tal entry.

Further, the step of term frequency-inverse document frequency method includes:

File when the file path group number got is less than a threshold value, by an AUTHORITATIVE DATA collection, with processing target The inverse document frequency value that path carries out all entries in file destination path together calculates；

When the file path group number got is more than or equal to a threshold value, directly carried out using the file path of processing target The inverse document frequency value of entry calculates；

Each entry is calculated for each group of term frequency-inverse document frequency values；

For each group, the average value of the term frequency-inverse document frequency values of all entries is taken；

Above-mentioned term frequency-inverse document frequency values are higher than the entry of above-mentioned average value as keyword.

Further, the AUTHORITATIVE DATA collection is the multiple groups file path for the separate sources collected in advance.

Further, use expert graded for the canonical matching method, term frequency-inverse document frequency method and context pass It is that three specific algorithms of method assign different weights；

The expert graded are as follows: accuracy, three validity, stability above-mentioned algorithms of index evaluation are used, for every kind of calculation Method assigns different scores, the score of three indexs of gained is added, and the score of every kind of algorithm is normalized, by numerical value It is limited between 0 to 1, obtains the weight of every kind of algorithm；Wherein, accuracy refers to that can the algorithm accurately recognize needs Result；Validity refers to entry that the algorithm recognizes for confirming that the entry is the effectiveness of keyword；Stability Refer to the influence degree that the algorithm is subject to by the variation of input data set.

A kind of file path Keyword Spotting System of association user privacy, including memory and processor, the memory Computer program is stored, which is configured as being executed by the processor, which includes respectively walking for executing in the above method Rapid instruction.

A kind of computer readable storage medium storing computer program, the computer program include instruction, which works as The server is made to execute each step in the above method when being executed by the processor of server.

Due to thinking that crucial word concept is not present in file path in universal significance, the prior art is difficult to regard to keyword and to text The privacy of user in part path is found, and the present invention is directed to this scene of privacy of user, proposes file path keyword Definition, by the keyword relevant to system owners' privacy in identification file path, to identify the identity of system owners, It is associated with privacy of user, compensate for the deficiencies in the prior art.

Detailed description of the invention

Fig. 1 is the blanket process of the file path keyword recognition method of association user privacy in one embodiment of the invention Figure.

Fig. 2 is the structural schematic diagram of file path keyword recognizer in one embodiment of the invention.

Fig. 3 is the flow diagram of context relation method in one embodiment of the invention.

Fig. 4 is the flow diagram of term frequency-inverse document frequency method in one embodiment of the invention.

Specific embodiment

To make those skilled in the art more fully understand the technical solution in the embodiment of the present invention, and make mesh of the invention , feature and advantage can be more obvious and easy to understand, technological core in the present invention is made with example with reference to the accompanying drawing further It is described in detail.

The present embodiment provides a kind of file path keyword recognition method of association user privacy, flow chart as shown in Figure 1, Specifically includes the following steps:

Step 100: file path set to be processed is obtained, with the All Files road of the computer system from a user Diameter is one group.

Step 200: file path being pre-processed, including capital and small letter is unified, entry segmentation, stop word filtering.Specifically For, path is unified for upper case or lower case；According to the feature of file path, entry using forward slash " ", back slash "/" with And colon ": " is split, for blank character contained in every level-one directory name or filename without segmentation；Stop word Including the default such as " C " " D " disk symbol and file suffixes name etc..

Step 300: keyword identification being carried out to file path, including three kinds of algorithms: closing for the context of fullpath It is method, canonical matching method and the term frequency-inverse document frequency method for dividing gained entry for file path.

Step 400: applying expert graded, be that every kind of algorithm is assigned according to three accuracy, validity and stability indexs Different weights is given, weight is normalized, is given a mark for each keyword.

Step 500: according to the scoring event of keyword, obtaining the keyword of this group of file path according to score height.Institute Keyword be for the system owners have discrimination, mark degree word, can be used for the exposing system owner name, The pet name, is engaged in the information such as industry, affiliated unit at internet platform account, thus association user privacy.

Fig. 2 show the schematic diagram of file path keyword recognizer, is described as follows:

Step 310: this is context relation method, using there are contexts in context relation method combination All Files path The feature vocabulary of specific position carries out keyword decision.File path is no semantic text, but has centainly structural, on Hereafter relations act is established on this basis.

Step 320: this is term frequency-inverse document frequency method.Term frequency-inverse document frequency is calculated for entry, for commenting Estimate entry most representational for a document, while avoiding the influence of universal everyday words.The significance level of one entry Directly proportional to the number that it occurs in one group of data, the number occurred in overall data with it is inversely proportional.When an entry The frequency of occurrences is higher in one group of data, then its word frequency is higher；When an entry is appeared in overall data with upper frequency In multi-group data, then its inverse document frequency is lower.For example, the frequency that " Users " may occur in the file path of a system Rate is very high, but it and do not have high discrimination because the frequency that it occurs in overall data is also very high.An and proprietary name Word, such as a Business Name, usually in the file path of a system frequency of occurrences it is high and in overall data the frequency of occurrences Not high, then its term frequency-inverse document frequency values will be relatively high, therefore has higher significance level.

Step 330: this is canonical matching method.Canonical matching has one for matching all appear in file path Determine the entry of text feature, e.g., email address, the usually structure of " user name@domain name "；Date, the form of expression multiplicity but all With text feature；Pure digi-tal entry with length range limitation, such as Tencent QQ account.

Fig. 3 show the flow diagram of context relation method, specific as follows:

Step 311: each path being serialized, i.e., bibliographic structure progress from father to sub- is pressed to entry each in path Number.

Step 312: using bit identification word identification keyword is faced, word and rear mark word are identified before specifically including.Preceding mark word Refer to the word that sequence is located at before keyword, by taking 7 system of Windows as an example, with the user file of operating system account name name Double-layered quilt is used for storage file and software data, this file is the sub-folder of the file of entitled " Users ", i.e., Entry after " Users " is particularly likely that operating system account name.Mark word refers to the word that sequence is located at after keyword afterwards, For example, user folder is located on the file of entitled " QQ " for the QQ platform of Tencent with the name of QQ account.

Step 313: utilization scope identifies word and identifies keyword.People have the habit to arrange the document by class, for example, entitled It is under the file of " work " to store file relevant to work more.Therefore this class noun occur in the file path, then its Subsequent path entry may be related to the industry of system owners or unit.

Step 314: identifying keyword using end word.The last one entry of each path, i.e. filename, also by conduct Keyword usually has lower word frequency and position limited, it is difficult to obtained by other algorithms, but it is very possible directly be The occupation of the system owner is related.

Fig. 4 show the flow diagram of term frequency-inverse document frequency method, and detailed process is as follows:

Step 321: according to the principle for arriving the algorithm, will only be calculated with a group of file path without representative word Frequently-inverse document frequency value, needs the support of mass data.Therefore a threshold value is set, when the system group number got is less than threshold When value, needs to be calculated together with the file path of input by the data of AUTHORITATIVE DATA collection, obtain the road in goal systems The inverse document frequency value of all entries of diameter.AUTHORITATIVE DATA collection can be collects the multiple groups file path of coming intentionally, collects source Should be diversified, avoid the feelings higher there are proper noun document frequency a certain caused by the identical file path in multiple groups source Condition.The keyword extraction energy of privacy of user is associated to the AUTHORITATIVE DATA collection using term frequency-inverse document frequency method described herein It accesses and generally acknowledges effective result.The capacity of the AUTHORITATIVE DATA collection needs to reach a certain amount grade, a degree of different to tolerate Normal sample, to avoid the adverse effect to arithmetic result.The value of threshold value should be by testing decision repeatedly, that is, to different numbers The different data collection of amount carries out the keyword extraction experiment of multiple term frequency-inverse document frequency method, can obtain effective result with determination Data group scale, carry out threshold value in conjunction with the minimum value or mean value of many experiments.

Step 322: when the file path group number got is greater than threshold value, not needing by other data sets, directly benefit It is calculated with input data.

Step 323: calculating each entry for each group of term frequency-inverse document frequency values.

Step 324: being directed to each group, take the average value of the term frequency-inverse document frequency values of all entries.

Step 325: the entry that term frequency-inverse document frequency values are higher than average value is considered keyword.

If the following table 1 is to use the marking situation of expert graded in an embodiment, it is described as follows:

Table 1

Expert graded gives a mark to three indexs of above-mentioned five kinds of algorithms using ten point system.

Accuracy is for assessing whether the algorithm can accurately identify being needed as a result, evaluation is the algorithm itself Performance, wherein due to canonical matching method rely on pattern match, as a result can not entirely accurate, therefore take the circumstances into consideration deduct points, and for Other four kinds of algorithms give 10 points.

Validity is used to assess the effectiveness of entry that the algorithm recognizes for confirming as keyword, and evaluation is The algorithm acts on the performance of keyword identification, that is, whether its keyword identified is strictly to be associated with privacy of user Keyword is given according to the principle, characteristic and effect of every kind of algorithm and gives a mark.

Whether stability is used to assess the algorithm easy to be impacted because of the variation of input data set, wherein only word Frequently-inverse document frequency method needs to be calculated by universal class data, it is thus possible to be affected, deduct points as one sees fit.

The addition of three kinds of index obatained scores of every kind of algorithm is obtained into total score, renormalization, by numerical value be limited in 0 to 1 it Between.

It, can be according to the specific of the experience of user and algorithm it should be noted that the method for expert estimation is not unique Realization degree is changed.For example, perfect with canonical matching method match pattern, accuracy index score is available to be mentioned Rise etc..

It should be noted last that the above case study on implementation is only used to illustrate the technical scheme of the present invention and not to limit it, although It is described the invention in detail using example, those skilled in the art should understand that, it can be to technology of the invention Scheme is modified or equivalencing, without departing from the spirit and scope of the technical solution of the present invention, should all cover in this hair In bright scope of the claims.

Claims

1. a kind of file path keyword recognition method of association user privacy, comprising the following steps:

According to for the context relation method of fullpath, canonical matching method and the word for dividing for file path gained entry Frequently these three algorithms of-inverse document frequency method carry out the identification of file path keyword；

2. the method as described in claim 1, which is characterized in that the entry segmentation refers to the feature according to file path, benefit With forward slash " ", back slash "/" and colon ": " entry is split, for institute in every level-one directory name or filename The blank character contained is without segmentation.

3. the method as described in claim 1, which is characterized in that the stop word of stop word filtering include default disk symbol, File suffixes name.

4. the method as described in claim 1, which is characterized in that the canonical matching method refers to that matching is all and appears in file The entry with certain text feature in path, such entry include email address, date, pure digi-tal entry.

5. the method as described in claim 1, which is characterized in that the step of term frequency-inverse document frequency method includes:

File path when the file path group number got is less than a threshold value, by an AUTHORITATIVE DATA collection, with processing target The inverse document frequency value for carrying out all entries in file destination path together calculates；

When the file path group number got is more than or equal to a threshold value, entry directly is carried out using the file path of processing target Inverse document frequency value calculate；

6. method as claimed in claim 5, which is characterized in that the AUTHORITATIVE DATA collection is the more of the separate sources collected in advance Group file path.

7. the method as described in claim 1, which is characterized in that the context relation method includes the following three types specific method:

1) using bit identification word identification keyword is faced, this faces the front and back that bit identification word includes the keyword positioned at same sequence Word；

3) identify that keyword, the end word are the last one entry of each path using end word；

8. the method for claim 7, which is characterized in that use expert graded inverse for the canonical matching method, word frequency- Three specific algorithms of document frequency method and context relation method assign different weights；

The expert graded are as follows: use accuracy, three validity, stability above-mentioned algorithms of index evaluation, be that every kind of algorithm is assigned Different scores is given, the score of three indexs of gained is added, and the score of every kind of algorithm is normalized, numerical value is limited Between 0 to 1, the weight of every kind of algorithm is obtained；Wherein, accuracy refers to that can the algorithm accurately recognize the knot of needs Fruit；Validity refers to entry that the algorithm recognizes for confirming that the entry is the effectiveness of keyword；Stability refers to The influence degree that the algorithm is subject to by the variation of input data set.

9. a kind of file path Keyword Spotting System of association user privacy, including memory and processor, the memory are deposited Computer program is stored up, which is configured as being executed by the processor, which includes for executing the claims 1 to 8 The instruction of each step in any method.

10. it is a kind of store computer program computer readable storage medium, the computer program include instruction, the instruction when by The processor of server makes the server execute each step in any method of the claims 1 to 8 when executing Suddenly.