CN109508557A - A kind of file path keyword recognition method of association user privacy - Google Patents

A kind of file path keyword recognition method of association user privacy Download PDF

Info

Publication number
CN109508557A
CN109508557A CN201811228942.XA CN201811228942A CN109508557A CN 109508557 A CN109508557 A CN 109508557A CN 201811228942 A CN201811228942 A CN 201811228942A CN 109508557 A CN109508557 A CN 109508557A
Authority
CN
China
Prior art keywords
keyword
file path
entry
word
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811228942.XA
Other languages
Chinese (zh)
Inventor
冯云
崔翔
刘宝旭
刘潮歌
刘奇旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201811228942.XA priority Critical patent/CN109508557A/en
Publication of CN109508557A publication Critical patent/CN109508557A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The present invention provides a kind of file path keyword recognition method of association user privacy, comprising the following steps: file path set to be processed is obtained, with the All Files path of the computer system from a user for one group;File path is pre-processed, including capital and small letter is unified, entry segmentation, stop word filtering;Divide these three algorithms of the term frequency-inverse document frequency method of gained entry according to for the context relation method of fullpath, canonical matching method and for file path, carries out the identification of file path keyword;It uses expert graded to assign different weights for above-mentioned three kinds of algorithms, and carries out the normalization of weight, give a mark for each keyword;According to the scoring event of keyword, the keyword of the association user privacy of this group of file path is obtained according to score height.

Description

A kind of file path keyword recognition method of association user privacy
Technical field
The present invention relates to computer big datas and text-processing field, in particular to a kind of association user privacy File path keyword recognition method.
Background technique
Keyword is the word of one or more subject contents that can express one section of text, for determining text categories, table Content of text is stated to play a key role.Under big data era, keyword identification technology text mining, information retrieval, from It plays an important role in the fields such as right Language Processing.Currently, thering are many technologies to compare for crucial word identification problem Maturation is being constantly progressive, such as term frequency-inverse document frequency method, information gain method traditional statistics method and LDA theme mould The machine learning algorithms such as type, RAKE.By carrying out keyword identification to text, it is subject to further to handle analysis, can accomplishes Mass text classification, text snippet generate, text emotion is analyzed and text source speculates etc..
Current keyword identification technology is all around natural language text, and needing length to reache a certain level could be real Now preferably keyword recognition effect.However, there is also other various texts in addition to natural language text, and e.g., generation Code, database instruction etc. have semantic programming language text and network linking, file path etc. to have structure without semantic text This.For text described above, the keyword of keyword and natural language text is different, or even says from universal significance And this concept of keyword is not present, it is only applicable in special scenes, and the most length of these texts is not grown, therefore rare corresponding Keyword identification technology.
For file path, all there is a large amount of files in each computer, also there is a large amount of file Path.And computer belongs to personal or unit, is easy in file path there is clue relevant to owner's identity, It can be used for identifying a people or a unit, i.e. privacy of user.In simple terms, under this scene of association user privacy, text The keyword in part path refers to can be used as clue for identifying the word of owner's identity.Since file path can be retained in use In the program of document, exploitation that family is edited, privacy of user is caused to reveal, therefore the file path for studying association user privacy closes Keyword identification technology has positive meaning.
Summary of the invention
In view of the above-mentioned problems, the invention proposes a kind of file path keyword recognition methods of association user privacy.It should Method can identify the keyword in file path, these keywords can identify the identity of system owners, with privacy of user It is associated.
In order to achieve the above object, the specific technical solution that the present invention takes is:
A kind of file path keyword recognition method of association user privacy, comprising the following steps:
File path set to be processed is obtained, with the All Files path of the computer system from a user for one Group;
File path is pre-processed, including capital and small letter is unified, entry segmentation, stop word filtering;
Divide gained entry according to for the context relation method of fullpath, canonical matching method and for file path Term frequency-inverse document frequency method these three algorithms, carry out the identification of file path keyword;
Different weights is assigned for above-mentioned three kinds of algorithms, and carries out the normalization of weight, is given a mark for each keyword;
According to the scoring event of keyword, the keyword of this group of file path is obtained according to score height.
Further, entry segmentation refers to the feature according to file path, using forward slash " ", back slash "/" with And colon ": " is split entry, for blank character contained in every level-one directory name or filename without segmentation.
Further, the stop word of the stop word filtering includes default disk symbol, file suffixes name.
Further, the context relation method includes the following three types specific algorithm:
1) using bit identification word identification keyword is faced, it includes before the keyword of same sequence that this, which faces bit identification word, Word afterwards;
2) utilization scope mark word identifies that keyword, the scope identifier word refer to the word for indicating a class file;
3) identify that keyword, the end word are the last one entry of each path, i.e. filename using end word;
The sequence from father to son by bibliographic structure to each entry in path by being numbered to obtain.
Further, the canonical matching method refers to that all appear in file path of matching has certain text special The entry of sign, such entry include email address, date, pure digi-tal entry.
Further, the step of term frequency-inverse document frequency method includes:
File when the file path group number got is less than a threshold value, by an AUTHORITATIVE DATA collection, with processing target The inverse document frequency value that path carries out all entries in file destination path together calculates;
When the file path group number got is more than or equal to a threshold value, directly carried out using the file path of processing target The inverse document frequency value of entry calculates;
Each entry is calculated for each group of term frequency-inverse document frequency values;
For each group, the average value of the term frequency-inverse document frequency values of all entries is taken;
Above-mentioned term frequency-inverse document frequency values are higher than the entry of above-mentioned average value as keyword.
Further, the AUTHORITATIVE DATA collection is the multiple groups file path for the separate sources collected in advance.
Further, use expert graded for the canonical matching method, term frequency-inverse document frequency method and context pass It is that three specific algorithms of method assign different weights;
The expert graded are as follows: accuracy, three validity, stability above-mentioned algorithms of index evaluation are used, for every kind of calculation Method assigns different scores, the score of three indexs of gained is added, and the score of every kind of algorithm is normalized, by numerical value It is limited between 0 to 1, obtains the weight of every kind of algorithm;Wherein, accuracy refers to that can the algorithm accurately recognize needs Result;Validity refers to entry that the algorithm recognizes for confirming that the entry is the effectiveness of keyword;Stability Refer to the influence degree that the algorithm is subject to by the variation of input data set.
A kind of file path Keyword Spotting System of association user privacy, including memory and processor, the memory Computer program is stored, which is configured as being executed by the processor, which includes respectively walking for executing in the above method Rapid instruction.
A kind of computer readable storage medium storing computer program, the computer program include instruction, which works as The server is made to execute each step in the above method when being executed by the processor of server.
Due to thinking that crucial word concept is not present in file path in universal significance, the prior art is difficult to regard to keyword and to text The privacy of user in part path is found, and the present invention is directed to this scene of privacy of user, proposes file path keyword Definition, by the keyword relevant to system owners' privacy in identification file path, to identify the identity of system owners, It is associated with privacy of user, compensate for the deficiencies in the prior art.
Detailed description of the invention
Fig. 1 is the blanket process of the file path keyword recognition method of association user privacy in one embodiment of the invention Figure.
Fig. 2 is the structural schematic diagram of file path keyword recognizer in one embodiment of the invention.
Fig. 3 is the flow diagram of context relation method in one embodiment of the invention.
Fig. 4 is the flow diagram of term frequency-inverse document frequency method in one embodiment of the invention.
Specific embodiment
To make those skilled in the art more fully understand the technical solution in the embodiment of the present invention, and make mesh of the invention , feature and advantage can be more obvious and easy to understand, technological core in the present invention is made with example with reference to the accompanying drawing further It is described in detail.
The present embodiment provides a kind of file path keyword recognition method of association user privacy, flow chart as shown in Figure 1, Specifically includes the following steps:
Step 100: file path set to be processed is obtained, with the All Files road of the computer system from a user Diameter is one group.
Step 200: file path being pre-processed, including capital and small letter is unified, entry segmentation, stop word filtering.Specifically For, path is unified for upper case or lower case;According to the feature of file path, entry using forward slash " ", back slash "/" with And colon ": " is split, for blank character contained in every level-one directory name or filename without segmentation;Stop word Including the default such as " C " " D " disk symbol and file suffixes name etc..
Step 300: keyword identification being carried out to file path, including three kinds of algorithms: closing for the context of fullpath It is method, canonical matching method and the term frequency-inverse document frequency method for dividing gained entry for file path.
Step 400: applying expert graded, be that every kind of algorithm is assigned according to three accuracy, validity and stability indexs Different weights is given, weight is normalized, is given a mark for each keyword.
Step 500: according to the scoring event of keyword, obtaining the keyword of this group of file path according to score height.Institute Keyword be for the system owners have discrimination, mark degree word, can be used for the exposing system owner name, The pet name, is engaged in the information such as industry, affiliated unit at internet platform account, thus association user privacy.
Fig. 2 show the schematic diagram of file path keyword recognizer, is described as follows:
Step 310: this is context relation method, using there are contexts in context relation method combination All Files path The feature vocabulary of specific position carries out keyword decision.File path is no semantic text, but has centainly structural, on Hereafter relations act is established on this basis.
Step 320: this is term frequency-inverse document frequency method.Term frequency-inverse document frequency is calculated for entry, for commenting Estimate entry most representational for a document, while avoiding the influence of universal everyday words.The significance level of one entry Directly proportional to the number that it occurs in one group of data, the number occurred in overall data with it is inversely proportional.When an entry The frequency of occurrences is higher in one group of data, then its word frequency is higher;When an entry is appeared in overall data with upper frequency In multi-group data, then its inverse document frequency is lower.For example, the frequency that " Users " may occur in the file path of a system Rate is very high, but it and do not have high discrimination because the frequency that it occurs in overall data is also very high.An and proprietary name Word, such as a Business Name, usually in the file path of a system frequency of occurrences it is high and in overall data the frequency of occurrences Not high, then its term frequency-inverse document frequency values will be relatively high, therefore has higher significance level.
Step 330: this is canonical matching method.Canonical matching has one for matching all appear in file path Determine the entry of text feature, e.g., email address, the usually structure of " user name@domain name ";Date, the form of expression multiplicity but all With text feature;Pure digi-tal entry with length range limitation, such as Tencent QQ account.
Fig. 3 show the flow diagram of context relation method, specific as follows:
Step 311: each path being serialized, i.e., bibliographic structure progress from father to sub- is pressed to entry each in path Number.
Step 312: using bit identification word identification keyword is faced, word and rear mark word are identified before specifically including.Preceding mark word Refer to the word that sequence is located at before keyword, by taking 7 system of Windows as an example, with the user file of operating system account name name Double-layered quilt is used for storage file and software data, this file is the sub-folder of the file of entitled " Users ", i.e., Entry after " Users " is particularly likely that operating system account name.Mark word refers to the word that sequence is located at after keyword afterwards, For example, user folder is located on the file of entitled " QQ " for the QQ platform of Tencent with the name of QQ account.
Step 313: utilization scope identifies word and identifies keyword.People have the habit to arrange the document by class, for example, entitled It is under the file of " work " to store file relevant to work more.Therefore this class noun occur in the file path, then its Subsequent path entry may be related to the industry of system owners or unit.
Step 314: identifying keyword using end word.The last one entry of each path, i.e. filename, also by conduct Keyword usually has lower word frequency and position limited, it is difficult to obtained by other algorithms, but it is very possible directly be The occupation of the system owner is related.
Fig. 4 show the flow diagram of term frequency-inverse document frequency method, and detailed process is as follows:
Step 321: according to the principle for arriving the algorithm, will only be calculated with a group of file path without representative word Frequently-inverse document frequency value, needs the support of mass data.Therefore a threshold value is set, when the system group number got is less than threshold When value, needs to be calculated together with the file path of input by the data of AUTHORITATIVE DATA collection, obtain the road in goal systems The inverse document frequency value of all entries of diameter.AUTHORITATIVE DATA collection can be collects the multiple groups file path of coming intentionally, collects source Should be diversified, avoid the feelings higher there are proper noun document frequency a certain caused by the identical file path in multiple groups source Condition.The keyword extraction energy of privacy of user is associated to the AUTHORITATIVE DATA collection using term frequency-inverse document frequency method described herein It accesses and generally acknowledges effective result.The capacity of the AUTHORITATIVE DATA collection needs to reach a certain amount grade, a degree of different to tolerate Normal sample, to avoid the adverse effect to arithmetic result.The value of threshold value should be by testing decision repeatedly, that is, to different numbers The different data collection of amount carries out the keyword extraction experiment of multiple term frequency-inverse document frequency method, can obtain effective result with determination Data group scale, carry out threshold value in conjunction with the minimum value or mean value of many experiments.
Step 322: when the file path group number got is greater than threshold value, not needing by other data sets, directly benefit It is calculated with input data.
Step 323: calculating each entry for each group of term frequency-inverse document frequency values.
Step 324: being directed to each group, take the average value of the term frequency-inverse document frequency values of all entries.
Step 325: the entry that term frequency-inverse document frequency values are higher than average value is considered keyword.
If the following table 1 is to use the marking situation of expert graded in an embodiment, it is described as follows:
Table 1
Expert graded gives a mark to three indexs of above-mentioned five kinds of algorithms using ten point system.
Accuracy is for assessing whether the algorithm can accurately identify being needed as a result, evaluation is the algorithm itself Performance, wherein due to canonical matching method rely on pattern match, as a result can not entirely accurate, therefore take the circumstances into consideration deduct points, and for Other four kinds of algorithms give 10 points.
Validity is used to assess the effectiveness of entry that the algorithm recognizes for confirming as keyword, and evaluation is The algorithm acts on the performance of keyword identification, that is, whether its keyword identified is strictly to be associated with privacy of user Keyword is given according to the principle, characteristic and effect of every kind of algorithm and gives a mark.
Whether stability is used to assess the algorithm easy to be impacted because of the variation of input data set, wherein only word Frequently-inverse document frequency method needs to be calculated by universal class data, it is thus possible to be affected, deduct points as one sees fit.
The addition of three kinds of index obatained scores of every kind of algorithm is obtained into total score, renormalization, by numerical value be limited in 0 to 1 it Between.
It, can be according to the specific of the experience of user and algorithm it should be noted that the method for expert estimation is not unique Realization degree is changed.For example, perfect with canonical matching method match pattern, accuracy index score is available to be mentioned Rise etc..
It should be noted last that the above case study on implementation is only used to illustrate the technical scheme of the present invention and not to limit it, although It is described the invention in detail using example, those skilled in the art should understand that, it can be to technology of the invention Scheme is modified or equivalencing, without departing from the spirit and scope of the technical solution of the present invention, should all cover in this hair In bright scope of the claims.

Claims (10)

1. a kind of file path keyword recognition method of association user privacy, comprising the following steps:
File path set to be processed is obtained, with the All Files path of the computer system from a user for one group;
File path is pre-processed, including capital and small letter is unified, entry segmentation, stop word filtering;
According to for the context relation method of fullpath, canonical matching method and the word for dividing for file path gained entry Frequently these three algorithms of-inverse document frequency method carry out the identification of file path keyword;
Different weights is assigned for above-mentioned three kinds of algorithms, and carries out the normalization of weight, is given a mark for each keyword;
According to the scoring event of keyword, the keyword of this group of file path is obtained according to score height.
2. the method as described in claim 1, which is characterized in that the entry segmentation refers to the feature according to file path, benefit With forward slash " ", back slash "/" and colon ": " entry is split, for institute in every level-one directory name or filename The blank character contained is without segmentation.
3. the method as described in claim 1, which is characterized in that the stop word of stop word filtering include default disk symbol, File suffixes name.
4. the method as described in claim 1, which is characterized in that the canonical matching method refers to that matching is all and appears in file The entry with certain text feature in path, such entry include email address, date, pure digi-tal entry.
5. the method as described in claim 1, which is characterized in that the step of term frequency-inverse document frequency method includes:
File path when the file path group number got is less than a threshold value, by an AUTHORITATIVE DATA collection, with processing target The inverse document frequency value for carrying out all entries in file destination path together calculates;
When the file path group number got is more than or equal to a threshold value, entry directly is carried out using the file path of processing target Inverse document frequency value calculate;
Each entry is calculated for each group of term frequency-inverse document frequency values;
For each group, the average value of the term frequency-inverse document frequency values of all entries is taken;
Above-mentioned term frequency-inverse document frequency values are higher than the entry of above-mentioned average value as keyword.
6. method as claimed in claim 5, which is characterized in that the AUTHORITATIVE DATA collection is the more of the separate sources collected in advance Group file path.
7. the method as described in claim 1, which is characterized in that the context relation method includes the following three types specific method:
1) using bit identification word identification keyword is faced, this faces the front and back that bit identification word includes the keyword positioned at same sequence Word;
2) utilization scope mark word identifies that keyword, the scope identifier word refer to the word for indicating a class file;
3) identify that keyword, the end word are the last one entry of each path using end word;
The sequence from father to son by bibliographic structure to each entry in path by being numbered to obtain.
8. the method for claim 7, which is characterized in that use expert graded inverse for the canonical matching method, word frequency- Three specific algorithms of document frequency method and context relation method assign different weights;
The expert graded are as follows: use accuracy, three validity, stability above-mentioned algorithms of index evaluation, be that every kind of algorithm is assigned Different scores is given, the score of three indexs of gained is added, and the score of every kind of algorithm is normalized, numerical value is limited Between 0 to 1, the weight of every kind of algorithm is obtained;Wherein, accuracy refers to that can the algorithm accurately recognize the knot of needs Fruit;Validity refers to entry that the algorithm recognizes for confirming that the entry is the effectiveness of keyword;Stability refers to The influence degree that the algorithm is subject to by the variation of input data set.
9. a kind of file path Keyword Spotting System of association user privacy, including memory and processor, the memory are deposited Computer program is stored up, which is configured as being executed by the processor, which includes for executing the claims 1 to 8 The instruction of each step in any method.
10. it is a kind of store computer program computer readable storage medium, the computer program include instruction, the instruction when by The processor of server makes the server execute each step in any method of the claims 1 to 8 when executing Suddenly.
CN201811228942.XA 2018-10-22 2018-10-22 A kind of file path keyword recognition method of association user privacy Pending CN109508557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811228942.XA CN109508557A (en) 2018-10-22 2018-10-22 A kind of file path keyword recognition method of association user privacy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811228942.XA CN109508557A (en) 2018-10-22 2018-10-22 A kind of file path keyword recognition method of association user privacy

Publications (1)

Publication Number Publication Date
CN109508557A true CN109508557A (en) 2019-03-22

Family

ID=65746930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811228942.XA Pending CN109508557A (en) 2018-10-22 2018-10-22 A kind of file path keyword recognition method of association user privacy

Country Status (1)

Country Link
CN (1) CN109508557A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610090A (en) * 2019-08-28 2019-12-24 北京小米移动软件有限公司 Information processing method and device, and storage medium
CN112925755A (en) * 2021-02-18 2021-06-08 安徽中科美络信息技术有限公司 Intelligent storage method and device for ultra-long path of file system
CN114826732A (en) * 2022-04-25 2022-07-29 南京大学 Dynamic detection and tracing method for android system privacy stealing behavior

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291304A (en) * 2008-06-13 2008-10-22 清华大学 Transplantable network information sharing method
CN104750852A (en) * 2015-04-14 2015-07-01 海量云图(北京)数据技术有限公司 Method for finding and classifying Chinese address data
US9215243B2 (en) * 2013-09-30 2015-12-15 Globalfoundries Inc. Identifying and ranking pirated media content
CN105488100A (en) * 2015-11-18 2016-04-13 国信司南(北京)地理信息技术有限公司 Efficient detection and discovery system for secret-associated geographic data in non secret-associated environment
CN106202556A (en) * 2016-07-28 2016-12-07 中国电子科技集团公司第二十八研究所 A kind of mass text key word rapid extracting method based on Spark
CN107918740A (en) * 2017-12-02 2018-04-17 北京明朝万达科技股份有限公司 A kind of sensitive data decision-making decision method and system
CN108427767A (en) * 2018-03-28 2018-08-21 广州市创新互联网教育研究院 A kind of correlating method of knowledget opic and resource file

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291304A (en) * 2008-06-13 2008-10-22 清华大学 Transplantable network information sharing method
US9215243B2 (en) * 2013-09-30 2015-12-15 Globalfoundries Inc. Identifying and ranking pirated media content
CN104750852A (en) * 2015-04-14 2015-07-01 海量云图(北京)数据技术有限公司 Method for finding and classifying Chinese address data
CN105488100A (en) * 2015-11-18 2016-04-13 国信司南(北京)地理信息技术有限公司 Efficient detection and discovery system for secret-associated geographic data in non secret-associated environment
CN106202556A (en) * 2016-07-28 2016-12-07 中国电子科技集团公司第二十八研究所 A kind of mass text key word rapid extracting method based on Spark
CN107918740A (en) * 2017-12-02 2018-04-17 北京明朝万达科技股份有限公司 A kind of sensitive data decision-making decision method and system
CN108427767A (en) * 2018-03-28 2018-08-21 广州市创新互联网教育研究院 A kind of correlating method of knowledget opic and resource file

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUN FENG, BAOXU LIU 等: ""A Systematic Method on PDF Privacy Leakage Issues"", 《2018 17TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS/ 12TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (TRUSTCOM/BIGDATASE)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610090A (en) * 2019-08-28 2019-12-24 北京小米移动软件有限公司 Information processing method and device, and storage medium
CN112925755A (en) * 2021-02-18 2021-06-08 安徽中科美络信息技术有限公司 Intelligent storage method and device for ultra-long path of file system
CN114826732A (en) * 2022-04-25 2022-07-29 南京大学 Dynamic detection and tracing method for android system privacy stealing behavior

Similar Documents

Publication Publication Date Title
KR102092691B1 (en) Web page training methods and devices, and search intention identification methods and devices
KR102431549B1 (en) Causality recognition device and computer program therefor
CN108073568B (en) Keyword extraction method and device
CN106874279B (en) Method and device for generating application category label
CN108280114B (en) Deep learning-based user literature reading interest analysis method
WO2017097231A1 (en) Topic processing method and device
JP4233836B2 (en) Automatic document classification system, unnecessary word determination method, automatic document classification method, and program
CN112131863B (en) Comment opinion theme extraction method, electronic equipment and storage medium
US10353925B2 (en) Document classification device, document classification method, and computer readable medium
CN108027814B (en) Stop word recognition method and device
CN108090216B (en) Label prediction method, device and storage medium
CN106940726B (en) Creative automatic generation method and terminal based on knowledge network
CN107506472B (en) Method for classifying browsed webpages of students
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN109508557A (en) A kind of file path keyword recognition method of association user privacy
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
De Boom et al. Semantics-driven event clustering in Twitter feeds
CN112836509A (en) Expert system knowledge base construction method and system
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN108475265B (en) Method and device for acquiring unknown words
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
CN112163415A (en) User intention identification method and device for feedback content and electronic equipment
CN108733733B (en) Biomedical text classification method, system and storage medium based on machine learning
CN113449063B (en) Method and device for constructing document structure information retrieval library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190322

WD01 Invention patent application deemed withdrawn after publication