CN109657070B - A Construction Method of Terminal Assisted SWOT Index System - Google Patents

A Construction Method of Terminal Assisted SWOT Index System Download PDF

Info

Publication number
CN109657070B
CN109657070B CN201811515374.1A CN201811515374A CN109657070B CN 109657070 B CN109657070 B CN 109657070B CN 201811515374 A CN201811515374 A CN 201811515374A CN 109657070 B CN109657070 B CN 109657070B
Authority
CN
China
Prior art keywords
keywords
word
text data
keyword
index system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811515374.1A
Other languages
Chinese (zh)
Other versions
CN109657070A (en
Inventor
石进
韩进
金鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201811515374.1A priority Critical patent/CN109657070B/en
Publication of CN109657070A publication Critical patent/CN109657070A/en
Application granted granted Critical
Publication of CN109657070B publication Critical patent/CN109657070B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of terminal centralized storage, in particular to a terminal auxiliary deviceSWOTThe construction method of the index system comprises the following steps: step (a)S100, extracting keywords of a text data set based on a terminal; step (a)S200 keyword clustering andSWOTmapping an index system; step (a)S300, generating an index system weight suggestion. The automatic extraction and clustering of the keywords effectively saves the human resources of the expert and avoids to a certain extentSWOTInfluence of human interference factors in the system construction process.

Description

一种终端辅助SWOT指标体系的构建方法A Construction Method of Terminal Assisted SWOT Index System

技术领域technical field

本发明涉及终端集中存储领域,具体为一种终端辅助SWOT指标体系的构建方法。The invention relates to the field of terminal centralized storage, in particular to a method for constructing a terminal-assisted SWOT index system.

背景技术Background technique

SWOT分析方法(其中Strengths:内部优势因素,Weakness:内部弱势因素,Opportunities:外部机遇因素,Threats:外部威胁因素)是一种经典的竞争情报分析工具,由哈佛商学院的K.J.安德鲁斯于1971年在其《公司战略概念》一书中提出。该方法的主要内容是围绕着分析目标进行广泛地调查与信息收集,然后对收集到的信息予以分析,判断影响目标的外部机遇及外部威胁,目标实施的内部优势和劣势四方面因素。SWOT分析方法既可以进行简单的初步分析,定性地了解分析目标的总体概况,同时也可以实现目标的战略策略形成,实施或控制决策。SWOT analysis method (Strengths: internal advantage factors, Weakness: internal weakness factors, Opportunities: external opportunity factors, Threats: external threat factors) is a classic competitive intelligence analysis tool, developed by K.J. Andrews of Harvard Business School in 1971 In his book "Corporate Strategy Concepts". The main content of this method is to carry out extensive investigation and information collection around the analysis target, and then analyze the collected information to judge the external opportunities and threats affecting the target, and the internal advantages and disadvantages of target implementation. The SWOT analysis method can not only conduct simple preliminary analysis, qualitatively understand the overall situation of the analysis target, but also can realize the strategic strategy formation, implementation or control decision of the target.

由于SWOT分析方法从分析目标总体出发,可以清晰地列出影响目标实施的优势、劣势、机会和威胁因素,并加以综合分析,将影响目标实施的复杂因素明朗化,决策者可以清楚地掌握目标实施中可能存在的风险与机遇,从而提高决策的准确性。因此SWOT分析方法现已成为现代政府部门、企业在管理与决策中最为常用的分析工具,得到了广泛的应用与研究。Since the SWOT analysis method starts from the overall analysis of the target, it can clearly list the advantages, disadvantages, opportunities and threats that affect the implementation of the target, and make a comprehensive analysis to clarify the complex factors that affect the implementation of the target, so that decision makers can clearly grasp the target Risks and opportunities that may exist in the implementation, so as to improve the accuracy of decision-making. Therefore, the SWOT analysis method has become the most commonly used analysis tool in the management and decision-making of modern government departments and enterprises, and has been widely used and researched.

基于上述技术问题需要设计一种新的终端辅助SWOT指标体系的构建方法。Based on the above technical problems, it is necessary to design a new construction method of terminal-assisted SWOT index system.

发明内容Contents of the invention

本发明的目的是提供一种终端辅助SWOT指标体系的构建方法。The purpose of the present invention is to provide a method for constructing a terminal-assisted SWOT index system.

为了解决上述技术问题,本发明提供了一种终端辅助SWOT指标体系的构建方法,包括:In order to solve the above-mentioned technical problems, the present invention provides a method for constructing a terminal-assisted SWOT indicator system, including:

步骤S100,对文本数据集的关键词提取;Step S100, extracting keywords from the text data set;

步骤S200,关键词聚类和SWOT指标体系映射;以及Step S200, keyword clustering and SWOT indicator system mapping; and

步骤S300,生成指标体系权重建议。Step S300, generating weight suggestions for the index system.

进一步,所述步骤S100中对文本数据集的关键词提取的方法包括:Further, the method for extracting keywords from the text data set in the step S100 includes:

步骤S101,停用词过滤,对采集的文本数据集进行中文分词之后,通过积累挑选形成的停用词表,过滤文本数据中的停用词;Step S101, filter stop words, after performing Chinese word segmentation on the collected text data set, filter the stop words in the text data by accumulating and selecting the stop word list formed;

步骤S102,特定词过滤,通过搜索引擎对词进行搜索,对于搜索结果少于阈值的词,判断其为特定词,然后将特定词过滤;Step S102, filtering specific words, searching for words through a search engine, and judging that the words whose search results are less than a threshold are specific words, and then filtering the specific words;

步骤S103,关键词提取,通过改进的TF/IDF算法进行关键词提取。Step S103, keyword extraction, keyword extraction is performed through the improved TF/IDF algorithm.

进一步,所述改进的TF/IDF算法为:Further, the improved TF/IDF algorithm is:

式1

Figure SMS_1
Formula 1
Figure SMS_1

式2 Wi={W|TF/IDF(wi)>η};Formula 2 W i ={W|TF/IDF(w i )>η};

式3 W=∪WiFormula 3 W=∪W i ;

式4

Figure SMS_2
Formula 4
Figure SMS_2

式中,TF/IDF(wi)为标号为i的文本数据中词w的TF/IDF权值;TF(wi)为词w在标号为i的文本数据中出现的频数;N为文本数据集包含的文本数据数;d为包含词w的文本数据数;In the formula, TF/IDF( wi ) is the TF/IDF weight of word w in the text data labeled i; TF( wi ) is the frequency of word w appearing in the text data labeled i; N is the text The number of text data contained in the data set; d is the number of text data containing word w;

所述通过改进的TF/IDF算法进行关键词提取的方法包括:The method for keyword extraction by the improved TF/IDF algorithm comprises:

通过式1计算出文本数据集中每个文本数据中包含的关键词的TF/IDF权值;Calculate the TF/IDF weight of the keywords contained in each text data in the text data set by formula 1;

根据各文本数据中关键词的TF/IDF权值按大小进行排序;Sort by size according to the TF/IDF weight of keywords in each text data;

提取权值大于阈值η的关键词形成标号为i的文本数据的关键词集合Wi,所有文本数据的Wi集合汇总为文本数据集的关键词W集合;The keyword of extracting weight is greater than threshold value n forms the keyword collection Wi of the text data that label is i, and the Wi collection of all text data is summarized as the keyword W collection of text data set;

针对W集合中的关键词两两配对,计算比值C;Calculate the ratio C for pairwise pairing of keywords in the W set;

式4中TFsum(Wa)指某关键词a在W集合中出现的频数累加和,TFsum(Wb)是指某关键词b在W集合中出现的频数累加和,G(Wa)是指该关键词a在搜索引擎中获取的检索页面结果数;G(Wb)是指该关键词b在搜索引擎中获取的检索页面结果数;比值C为一对关键词a和b的TFsum值与G值的乘积的比值,并且按比值的结果对W集合中的关键词排序,并按顺序显示以对关键词加以修正。In Equation 4, TF sum (W a ) refers to the cumulative sum of the frequency of a certain keyword a appearing in the W set, TF sum (W b ) refers to the cumulative sum of the frequency of a certain keyword b appearing in the W set, G(W a ) refers to the number of search page results obtained by the keyword a in the search engine; G(W b ) refers to the number of search page results obtained by the keyword b in the search engine; the ratio C is a pair of keywords a and b The ratio of the product of the TF sum value and the G value, and the keywords in the W set are sorted according to the result of the ratio, and displayed in order to correct the keywords.

进一步,所述步骤S200中关键词聚类和SWOT指标体系映射的方法包括:Further, the method for keyword clustering and SWOT index system mapping in the step S200 includes:

步骤S201,依据中国分类主题词表,实现对关键词的初次分类,对照中国分类主题词表,将当前文本数据集中提取出的关键词进行分类,建立初始的关键词分类结构;Step S201, according to the Chinese classification thesaurus, realize the initial classification of keywords, compare the Chinese classification thesaurus, classify the keywords extracted from the current text data set, and establish the initial keyword classification structure;

步骤S202,针对初步分类后,剩余的在中国分类主题词表中无法对应分类的关键词,依据词的近义程度作为词与词的距离度量,采用K_MEANS聚类方法对剩余关键词进行聚类;Step S202, for the remaining keywords that cannot be classified in the Chinese classification subject vocabulary after the preliminary classification, use the K_MEANS clustering method to cluster the remaining keywords according to the degree of similarity of words as the distance measure between words and words ;

步骤S203,在终端辅助聚类完成之后,再将聚类后的关键词分类展现并修正;Step S203, after the terminal-assisted clustering is completed, the clustered keywords are classified and displayed and corrected;

步骤S204,经过对关键词聚类的重复迭代以及对聚类后的关键词分类修正后,根据聚类后的词类的分类信息,将词类映射成对应地指标,即Step S204, after repeated iterations of keyword clustering and correction of the clustered keyword classification, according to the classification information of the clustered word categories, the word categories are mapped to corresponding indicators, namely

建立SWOT分析的指标体系。Establish an indicator system for SWOT analysis.

进一步,所述步骤S300中生成指标体系权重建议的方法包括:选择影响指标体系权重判断的因素;Further, the method for generating the weight suggestion of the index system in the step S300 includes: selecting factors that affect the judgment of the weight of the index system;

所述影响指标体系权重判断的因素包括:The factors affecting the weight judgment of the indicator system include:

词类包含的关键词的词量:通过分析关键词聚类过程中生成的各词类所包含的关键词数量,以判断该词类所映射生成指标权重,即关键词数量越多的词类其对应的指标权重越大;The word volume of the keywords contained in the part of speech: by analyzing the number of keywords contained in each part of speech generated during the keyword clustering process, to determine the weight of the index generated by the map of the part of speech, that is, the corresponding index of the part of speech with more keywords The greater the weight;

词类包含的关键词的词频:为词类中包含的所有关键词在文本数据集中出现的频次累计和;以及The word frequency of the keywords contained in the part of speech: the cumulative sum of the frequencies of all keywords contained in the part of speech appearing in the text data set; and

词类包含的关键词的时效性:为一个词类包含的关键词在时间维度上的词频统计显示出该关键词在时间维度上被关注的程度。The timeliness of keywords included in a part of speech: The word frequency statistics in the time dimension of keywords included in a part of speech show the degree of attention to the keyword in the time dimension.

进一步,所述步骤S300中生成指标体系权重建议的方法还包括:基于影响指标体系权重判断的因素构建指标体系权重建议的生成公式,即Further, the method for generating the weight suggestion of the index system in the step S300 also includes: constructing a formula for generating the weight suggestion of the index system based on the factors that affect the weight judgment of the index system, namely

Figure SMS_3
Figure SMS_3

式中,R(W)为一个词类对应的指标权重建议;i从1到k为该词类中包含的关键词数,依次对该类中所有关键词进行计算;j从1到d为包括该词类中某个词w的文本数据,依次对包含该词w的所有文本数据进行计算;遍历包含词w的文本数据,分别计算第j个包含词w的时间衰减函数;TF(wj)为词w在文本数据j中出现的频次;e-μ(t-tc)为时间衰减函数;μ为衰减常数;t为该文本数据出现的时间;tc为当前时刻;In the formula, R(W) is the index weight suggestion corresponding to a part of speech; i from 1 to k is the number of keywords contained in the part of speech, and all keywords in the class are calculated in turn; j from 1 to d is the number of keywords included in the part of speech For the text data of a certain word w in the part of speech, calculate all the text data containing the word w in turn; traverse the text data containing the word w, and calculate the time decay function of the jth word w containing the word respectively; TF(w j ) is The frequency of word w appearing in text data j; e -μ(t-tc) is the time decay function; μ is the decay constant; t is the time when the text data appears; tc is the current moment;

计算各词类的R(W)权重建议值之后生成指标权重建议。After calculating the R(W) weight suggestion value of each part of speech, an index weight suggestion is generated.

本发明的有益效果是,本发明基于终端对文本数据集的关键词提取,并且将关键词聚类和SWOT指标体系映射;以及生成指标体系权重建议,实现了关键词的自动提取和聚类。The beneficial effect of the present invention is that the present invention is based on the keyword extraction of the text data set by the terminal, and maps the keyword clustering to the SWOT index system; and generates index system weight suggestions, thereby realizing automatic keyword extraction and clustering.

附图说明Description of drawings

下面结合附图和实施例对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

图1是本发明所涉及的终端辅助SWOT指标体系的构建方法的流程图。Fig. 1 is a flow chart of the construction method of the terminal-assisted SWOT index system involved in the present invention.

具体实施方式Detailed ways

现在结合附图对本发明作进一步详细的说明。这些附图均为简化的示意图,仅以示意方式说明本发明的基本结构,因此其仅显示与本发明有关的构成。The present invention is described in further detail now in conjunction with accompanying drawing. These drawings are all simplified schematic diagrams, which only illustrate the basic structure of the present invention in a schematic manner, so they only show the configurations related to the present invention.

实施例1Example 1

图1是本发明所涉及的终端辅助SWOT指标体系的构建方法的流程图。Fig. 1 is a flow chart of the construction method of the terminal-assisted SWOT index system involved in the present invention.

如图1所示,本实施例提供了一种终端辅助SWOT指标体系的构建方法,包括:As shown in Figure 1, this embodiment provides a method for constructing a terminal-assisted SWOT indicator system, including:

步骤S100,基于终端对文本数据集的关键词提取;Step S100, based on the keyword extraction of the text data set by the terminal;

步骤S200,关键词聚类和SWOT指标体系映射;以及Step S200, keyword clustering and SWOT indicator system mapping; and

步骤S300,生成指标体系权重建议;Step S300, generating weight suggestions for the index system;

在本实施中,终端可以但不限于采用计算机,以对SWOT指标体系的构建进行辅助;关键词的自动提取和聚类,有效节省了专家人力资源,并在一定程度上避免了SWOT体系构建过程中人为干扰因素的影响。In this implementation, the terminal can, but is not limited to, use a computer to assist in the construction of the SWOT index system; the automatic extraction and clustering of keywords effectively saves human resources for experts and avoids the process of building the SWOT system to a certain extent Influenced by human interference factors.

在本实施例中,所述步骤S100,基于终端对文本数据集的关键词提取的方法包括:步骤S101,停用词过滤,对终端采集的文本数据集进行中文分词之后,通过积累挑选形成的停用词表,过滤文本数据中的停用词,所述停用词一般是语气助词、虚词以及数量词等;步骤S102,特定词过滤,通过搜索引擎对词进行搜索,对于搜索结果少于阈值的词,判断其为特定词,然后将特定词过滤,所述特定词一般为地名、人名等指向性很强的特定词;不同于停用词,特定词难以通过定制词表来进行过滤,在相关的研究工作中,很多实用词分类推理来实现特定词的归类,判断词是否是地名、人名等,但是这种推断存在一定的不可靠性;使用搜索引擎,如Google、百度等,来判断特定词;例如Google每次搜索时都会显示搜索的结果数,使用特定词进行搜索会得到较少的搜索页面数,因此对于搜索结果少于一定阈值的词,可以判断其为特定词,予以过滤;Google的检索特定词可以通过Google的算法来自动完成;步骤S103,关键词提取,通过改进的TF/IDF算法进行关键词提取;TF/IDF算法是目前主流的关键词提取算法,TF(Term Frequency:词频),指的是某个词在某个文本中出现的次数,IDF(Inverse Document Frequency:逆文档频率)。In this embodiment, the step S100, the terminal-based method for extracting keywords from the text data set includes: step S101, stop word filtering, after performing Chinese word segmentation on the text data set collected by the terminal, the words formed by accumulating and selecting Stop word list, filter stop words in the text data, the stop words are generally modal auxiliary words, function words and quantifiers, etc.; step S102, filter specific words, search words by search engine, and the search results are less than the threshold The word is judged to be a specific word, and then the specific word is filtered, and the specific word is generally a specific word with strong directivity such as place name and person name; unlike stop words, it is difficult to filter specific words through a customized vocabulary, In related research work, many practical word classification reasoning is used to classify specific words and judge whether words are place names, personal names, etc., but this kind of inference has certain unreliability; using search engines, such as Google, Baidu, etc., To judge specific words; for example, Google will display the number of search results every time you search, and you will get fewer search pages if you use specific words to search. Therefore, for words with less than a certain threshold in search results, you can judge them as specific words. be filtered; Google's retrieval of specific words can be automatically completed by Google's algorithm; step S103, keyword extraction, keyword extraction is carried out by the improved TF/IDF algorithm; TF/IDF algorithm is the current mainstream keyword extraction algorithm, TF (Term Frequency: word frequency), refers to the number of times a word appears in a text, IDF (Inverse Document Frequency: inverse document frequency).

在本实施例中,所需要实现的是面向整个文本数据集提取出该集合中的关键词,传统的TF/IDF算法是针对某一个文档来提取该文档中的关键词,因此对传统TF/IDF算法进行改进;所述改进的TF/IDF算法为:In this embodiment, what needs to be implemented is to extract the keywords in the set for the entire text data set. The traditional TF/IDF algorithm is to extract the keywords in the document for a certain document, so for the traditional TF/IDF The IDF algorithm is improved; the improved TF/IDF algorithm is:

式1

Figure SMS_4
Formula 1
Figure SMS_4

式2 Wi={W|TF/IDF(wi)>η};Formula 2 W i ={W|TF/IDF(w i )>η};

式3 W=∪WiFormula 3 W=∪W i ;

式4

Figure SMS_5
Formula 4
Figure SMS_5

式中,TF/IDF(wi)为标号为i的文本数据中词w的TF/IDF权值;TF(wi)为词w在标号为i的文本数据中出现的频数;N为文本数据集包含的文本数据数;d为包含词w的文本数据数;In the formula, TF/IDF( wi ) is the TF/IDF weight of word w in the text data labeled i; TF( wi ) is the frequency of word w appearing in the text data labeled i; N is the text The number of text data contained in the data set; d is the number of text data containing word w;

所述通过改进的TF/IDF算法进行关键词提取的方法包括:通过式1计算出文本数据集中每个文本数据中包含的关键词的TF/IDF权值;根据各文本数据中关键词的TF/IDF权值按大小进行排序;提取权值大于阈值η的关键词形成标号为i的文本数据的关键词集合Wi,所有文本数据的Wi集合汇总为文本数据集的关键词W集合;针对W集合中的关键词两两配对,计算比值C;式4中TFsum(Wa)指某关键词a在W集合中出现的频数累加和,TFsum(Wb)是指某关键词b在W集合中出现的频数累加和,G(Wa)是指该关键词a在搜索引擎中获取的检索页面结果数;G(Wb)是指该关键词b在搜索引擎中获取的检索页面结果数;比值C为一对关键词a和b的TFsum值与G值(G值与TFsum的表现形式一样,指的是一对关键词在搜索引擎中获取的检索页面结果数)的乘积的比值,并且按比值的结果对W集合中的关键词排序,并按顺序显示以对关键词加以修正。The method for extracting keywords through the improved TF/IDF algorithm includes: calculating the TF/IDF weight of keywords contained in each text data in the text data set by formula 1; according to the TF of keywords in each text data The /IDF weights are sorted by size; the keywords whose weight is greater than the threshold η are extracted to form the keyword set Wi of the text data labeled i, and the Wi collections of all text data are summarized as the keyword W collection of the text data set; for W The keywords in the set are paired in pairs, and the ratio C is calculated; in formula 4, TF sum (W a ) refers to the cumulative sum of the frequency of a certain keyword a in the W set, and TF sum (W b ) refers to the frequency of a certain keyword b in the W set. The cumulative sum of frequencies that appear in the W set, G(W a ) refers to the number of search page results obtained by the keyword a in the search engine; G(W b ) refers to the search page results obtained by the keyword b in the search engine The number of results; the ratio C is the TF sum value and G value of a pair of keywords a and b (the G value is the same as the expression form of TF sum , referring to the number of search page results obtained by a pair of keywords in the search engine) The ratio of the product, and sort the keywords in the W collection according to the result of the ratio, and display in order to correct the keywords.

在本实施例中,所述步骤S200中关键词聚类和SWOT指标体系映射的方法包括:步骤S201,依据中国分类主题词表,实现对关键词的初次分类,对照中国分类主题词表,将当前文本数据集中提取出的关键词进行分类,建立初始的关键词分类结构;步骤S202,针对初步分类后,剩余的在中国分类主题词表中无法对应分类的关键词,依据词的近义程度作为词与词的距离度量,采用K MEANS聚类方法对剩余关键词进行聚类;步骤S203,在终端辅助聚类完成之后,再将聚类后的关键词分类展现并修正,所述修正的方法可以但不限于通过人工进行修正;步骤S204,经过关键词聚类的重复迭代以及对聚类后的关键词分类修正后,根据聚类后的词类的分类信息,将词类映射成对应地指标,即建立SWOT分析的指标体系。In this embodiment, the method of keyword clustering and SWOT index system mapping in step S200 includes: step S201, according to the Chinese classification thesaurus, realize the initial classification of keywords, compare the Chinese classification thesaurus, and Classify the keywords extracted from the current text data set, and establish an initial keyword classification structure; step S202, for the remaining keywords that cannot be classified in the Chinese classification subject list after the preliminary classification, according to the degree of similarity of the words As the distance measure between words, the K MEANS clustering method is used to cluster the remaining keywords; step S203, after the terminal-assisted clustering is completed, the clustered keywords are classified and displayed and corrected, and the corrected The method can be, but not limited to, manually corrected; step S204, after repeated iterations of keyword clustering and correction of the clustered keyword classification, according to the classification information of the clustered word categories, the word categories are mapped to corresponding indicators , that is, to establish the index system of SWOT analysis.

在本实施例中,所述步骤S300,生成指标体系权重建议的方法包括:选择影响指标体系权重判断的因素;指标体系中各指标对于分析结果的支持度是不一样,即有些指标是主要因素,而有些指标则为次要因素,本实施例通过三个影响指标体系权重判断的因素来生成权重建议;所述影响指标体系权重判断的因素包括:In this embodiment, the step S300, the method for generating the weight suggestion of the index system includes: selecting factors that affect the weight judgment of the index system; the support degree of each index in the index system for the analysis result is different, that is, some indexes are the main factors , while some indicators are secondary factors. In this embodiment, weight suggestions are generated through three factors that affect the weight judgment of the index system; the factors that affect the weight judgment of the index system include:

词类包含的关键词的词量:通过分析关键词聚类过程中生成的各词类所包含的关键词数量,来判断该词类所映射生成指标权重,即关键词数量越多的词类其对应的指标权重越大;The word volume of the keywords contained in the part of speech: by analyzing the number of keywords contained in each part of speech generated during the keyword clustering process, the weight of the index generated by the mapping of the part of speech is judged, that is, the index corresponding to the part of speech with more keywords The greater the weight;

词类包含的关键词的词频:除了关键词数量之外,词类所包含的关键词词频也是该词类映射的指标的权重判断依据,词类包含的关键词词频即该词类中包含的所有关键词在文本数据集中出现的频次累计和;Word frequency of keywords contained in a part of speech: In addition to the number of keywords, the frequency of keywords contained in a part of speech is also the basis for judging the weight of the index mapped to the part of speech. Cumulative sum of frequencies appearing in the data set;

词类包含的关键词的时效性:关键词在某个时间段中出现的频率,通过开源数据采集到的文本数据集都带有时间属性,文本数据中的词也附加有该文本数据的时间属性,在分析与提取关键词的时候并未考察词的时间属性,而一个词类包含的关键词在时间维度上的词频统计显示出该关键词在时间维度上被关注的程度,即该词类包含的关键词的时效性也是判断其对应指标权重的要素。The timeliness of the keywords contained in the word category: the frequency of keywords appearing in a certain period of time, the text data sets collected through open source data all have time attributes, and the words in the text data are also attached with the time attributes of the text data , the time attribute of words is not considered when analyzing and extracting keywords, but the word frequency statistics of the keywords contained in a part of speech in the time dimension show the degree of attention of the keyword in the time dimension, that is, the words contained in the part of speech The timeliness of keywords is also an element in judging the weight of their corresponding indicators.

在本实施例中,所述步骤S300,生成指标体系权重建议的方法还包括:基于影响指标体系权重判断的因素构建指标体系权重建议的生成公式,即In this embodiment, the step S300, the method for generating the weight suggestion of the index system further includes: constructing a formula for generating the weight suggestion of the index system based on the factors that affect the weight judgment of the index system, namely

Figure SMS_6
Figure SMS_6

式中,R(W)为一个词类对应的指标权重建议;i从1到k为该词类中包含的关键词数,依次对该类中所有关键词进行计算;j从1到d为包括该词类中某个词w的文本数据,依次对包含该词w的所有文本数据进行计算;遍历包含词w的文本数据,分别计算第j个包含词w的时间衰减函数;TF(wj)为词w在文本数据j中出现的频次;e-μ(t-tc)为时间衰减函数;μ为衰减常数;t为该文本数据出现的时间;tc为当前时刻;计算各词类的R(W)权重建议值之后生成指标权重建议。In the formula, R(W) is the index weight suggestion corresponding to a part of speech; i from 1 to k is the number of keywords contained in the part of speech, and all keywords in the class are calculated in turn; j from 1 to d is the number of keywords included in the part of speech For the text data of a certain word w in the part of speech, calculate all the text data containing the word w in turn; traverse the text data containing the word w, and calculate the time decay function of the jth word w containing the word respectively; TF(w j ) is The frequency of word w appearing in text data j; e -μ(t-tc) is the time decay function; μ is the decay constant; t is the time when the text data appears; tc is the current moment; calculate the R(W ) to generate an indicator weight suggestion after the weight suggestion value.

综上所述,本发明通过对文本数据集的关键词提取,以关键词聚类和SWOT指标体系映射,最后生成指标体系权重建议,实现了关键词的自动提取和聚类,有效节省了专家人力资源,并在一定程度上避免了SWOT体系构建过程中人为干扰因素的影响;并且本发明还可以有效地节省情报分析人员的工作量,以及在一定程度上可以减少SWOT指标体系构建过程中的干扰因素,这对于SWOT分析的应用也具有促进意义。In summary, the present invention extracts keywords from text data sets, maps keywords clustering and SWOT index system, and finally generates index system weight suggestions, realizes automatic keyword extraction and clustering, and effectively saves experts human resources, and to a certain extent avoid the influence of human interference factors in the SWOT system construction process; and the present invention can also effectively save the workload of intelligence analysts, and can reduce the SWOT index system construction process Interfering factors, which also have a facilitative significance for the application of SWOT analysis.

通过第一次迭代用以获取SWOT分析目标相关的关键词并对关键词进行聚类,第二次迭代的目标将关键词类映射成SWOT的评估指标,最后一部分,在SWOT评估指标生成之后,通过算法生成SWOT指标权重的建议。The first iteration is used to obtain the keywords related to the SWOT analysis target and cluster the keywords. The goal of the second iteration is to map the keyword classes into SWOT evaluation indicators. The last part, after the SWOT evaluation indicators are generated, is passed Algorithms generate recommendations for SWOT indicator weights.

以上述依据本发明的理想实施例为启示,通过上述的说明内容,相关工作人员完全可以在不偏离本项发明技术思想的范围内,进行多样的变更以及修改。本项发明的技术性范围并不局限于说明书上的内容,必须要根据权利要求范围来确定其技术性范围。Inspired by the above-mentioned ideal embodiment according to the present invention, through the above-mentioned description content, relevant workers can make various changes and modifications within the scope of not departing from the technical idea of the present invention. The technical scope of the present invention is not limited to the content in the specification, but must be determined according to the scope of the claims.

Claims (3)

1. The construction method of the terminal-assisted SWOT index system is characterized by comprising the following steps of:
step S100, extracting keywords of a text data set;
step S200, keyword clustering and SWOT index system mapping; and
step S300, generating an index system weight suggestion;
the method for extracting the keywords of the text data set in the step S100 comprises the following steps:
step S101, filtering stop words, namely after Chinese word segmentation is carried out on an acquired text data set, filtering the stop words in the text data by accumulating stop word lists formed by selection;
step S102, filtering specific words, searching the words through a search engine, judging the words with search results less than a threshold value as specific words, and filtering the specific words;
step S103, extracting keywords, namely extracting the keywords through an improved TF/IDF algorithm;
the improved TF/IDF algorithm is:
1 (1)
Figure FDA0004159262550000011
2W n ={W|TF/IDF(w n )>η};
3W = u W n
4. The method is to
Figure FDA0004159262550000012
Wherein TF/IDF (w n ) TF/IDF weight for word w in text data numbered n; TF (w) n ) Frequency of occurrence of the word w in the text data with the number n; n is the number of text data contained in the text data set; d is the number of text data containing word w;
the method for extracting the keywords through the improved TF/IDF algorithm comprises the following steps:
calculating TF/IDF weight of keywords contained in each text data in the text data set through the method 1;
sorting according to the size of the TF/IDF weight of the keywords in each text data;
extracting keywords with weights larger than a threshold value eta to form a keyword set W of text data with a label of n n W of all text data n The collection summary is a keyword W collection of the text data collection;
aiming at pairwise pairing of keywords in the W set, calculating a ratio C;
TF in 4 sum (W a ) Refers to the cumulative sum of the frequency of occurrence of a certain keyword a in a W set, TF sum (W b ) Refers to the cumulative sum of the frequency of occurrence of a certain keyword b in a W set, G (W) a ) The search page result number obtained by the keyword a in the search engine is referred to; g (W) b ) The result number of the search page obtained by the keyword b in the search engine is referred to; the ratio C is TF of a pair of keywords a and b sum The ratio of the product of the value and the G value, and the keywords in the W set are ordered according to the result of the ratio and displayed in sequence to correct the keywords;
the step S200, the method for mapping the keyword clusters and the SWOT index system comprises the following steps:
step S201, according to the Chinese classification subject word list, the primary classification of the keywords is realized, the keywords extracted from the current text data set are classified by contrasting with the Chinese classification subject word list, and an initial keyword classification structure is established;
step S202, clustering the residual keywords by adopting a K_MEANS clustering method according to the word proximity degree serving as a word-word distance measure for the keywords which cannot be classified correspondingly in the Chinese classification subject word list after preliminary classification;
step S203, after the terminal auxiliary clustering is completed, the clustered keywords are displayed and corrected in a classified manner;
step S204, after repeated iteration of keyword clustering and classification correction of clustered keywords, mapping the word classes into corresponding indexes according to classification information of the clustered word classes, namely
And establishing an index system of SWOT analysis.
2. The construction method according to claim 1, wherein,
the method for generating the index system weight suggestion in the step S300 includes: selecting factors influencing weight judgment of an index system;
the factors influencing the weight judgment of the index system comprise:
word quantity of keywords contained in word class: judging the index weight generated by mapping each word class by analyzing the number of keywords contained in each word class generated in the keyword clustering process, namely, the index weight corresponding to the word class with the larger number of keywords is larger;
word frequency of keywords contained in word class: accumulating and summing the frequencies of all keywords contained in the word class in the text data set;
the timeliness of keywords contained in the word class: word frequency statistics in the time dimension for keywords contained in a word class show how much attention is paid to the keywords in the time dimension.
3. The construction method according to claim 2, wherein,
the method for generating the index system weight suggestion in the step S300 further includes: constructing a generation formula of index system weight suggestion based on factors influencing index system weight judgment, namely
Figure FDA0004159262550000031
Wherein R (W) is an index weight suggestion corresponding to a word class; i is the number of keywords contained in the word class from 1 to k, and all keywords in the word class are calculated in sequence; j is text data comprising a word w in the word class from 1 to d, and all text data comprising the word w are sequentially calculated; traversing text data containing word w, and respectively calculating the j-th time decay function containing word w; TF (w) j ) Is the frequency with which word w appears in text data j; e, e -μ(t-tc) As a function of time decay; mu is the decay constant; t is the time when the text data appears; tc is the current time;
an index weight suggestion is generated after calculating an R (W) weight suggestion value for each word class.
CN201811515374.1A 2018-12-11 2018-12-11 A Construction Method of Terminal Assisted SWOT Index System Active CN109657070B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811515374.1A CN109657070B (en) 2018-12-11 2018-12-11 A Construction Method of Terminal Assisted SWOT Index System

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811515374.1A CN109657070B (en) 2018-12-11 2018-12-11 A Construction Method of Terminal Assisted SWOT Index System

Publications (2)

Publication Number Publication Date
CN109657070A CN109657070A (en) 2019-04-19
CN109657070B true CN109657070B (en) 2023-06-09

Family

ID=66113770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811515374.1A Active CN109657070B (en) 2018-12-11 2018-12-11 A Construction Method of Terminal Assisted SWOT Index System

Country Status (1)

Country Link
CN (1) CN109657070B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532357B (en) * 2019-09-04 2024-03-12 深圳前海微众银行股份有限公司 ESG scoring system generation method, device, equipment and readable storage medium
CN110991785B (en) * 2019-10-11 2023-07-25 平安科技(深圳)有限公司 Index extraction method and device based on text, computer equipment and storage medium
CN111767401B (en) * 2020-07-02 2023-04-28 中国标准化研究院 A method for automatic generation of NQI index
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN113849464A (en) * 2021-09-29 2021-12-28 联想(北京)有限公司 Information processing method and apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web Page Classification Method Based on Keyword Frequency Analysis
CN104008143A (en) * 2014-05-09 2014-08-27 启秀科技(北京)有限公司 Vocational ability index system establishment method based on data mining
CN107958344A (en) * 2017-12-13 2018-04-24 国网陕西省电力公司经济技术研究院 A kind of power distribution network Development Strategy Analysis method based on AHP and SWOT
CN108062306A (en) * 2017-12-29 2018-05-22 国信优易数据有限公司 A kind of index system establishment system and method for business environment evaluation
CN108241652A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 Keyword clustering method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web Page Classification Method Based on Keyword Frequency Analysis
CN104008143A (en) * 2014-05-09 2014-08-27 启秀科技(北京)有限公司 Vocational ability index system establishment method based on data mining
CN108241652A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 Keyword clustering method and device
CN107958344A (en) * 2017-12-13 2018-04-24 国网陕西省电力公司经济技术研究院 A kind of power distribution network Development Strategy Analysis method based on AHP and SWOT
CN108062306A (en) * 2017-12-29 2018-05-22 国信优易数据有限公司 A kind of index system establishment system and method for business environment evaluation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Cluster validity index: Comparative study and a new validity index with high performance;Chaimae Ouchicha等;《LOPAL "18: Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications》;20180530;第1-6页 *
长三角地区潜在战略性主导产...评价及筛选——以江苏省为例;黄敏;《时代金融》;20120830;第24卷(第8期);第79、83页 *

Also Published As

Publication number Publication date
CN109657070A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN109657070B (en) A Construction Method of Terminal Assisted SWOT Index System
CN107220295B (en) Searching and mediating strategy recommendation method for human-human contradiction mediating case
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
CN107577785B (en) Hierarchical multi-label classification method suitable for legal identification
WO2022156328A1 (en) Restful-type web service clustering method fusing service cooperation relationships
CN103631859B (en) Intelligent review expert recommending method for science and technology projects
CN106682172A (en) Keyword-based document research hotspot recommending method
CN111260223A (en) A trial risk intelligent identification and early warning method, system, medium and equipment
CN102495892A (en) Webpage information extraction method
CN111899890B (en) Medical data similarity detection system and method based on bit string hash
CN110569273A (en) A patent retrieval system and method based on relevance ranking
CN108563773A (en) The accurate search ordering method of legal provision of knowledge based collection of illustrative plates
CN103049575A (en) Topic-adaptive academic conference searching system
CN110633365A (en) A hierarchical multi-label text classification method and system based on word vectors
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN104346379A (en) Method for identifying data elements on basis of logic and statistic technologies
CN106227788A (en) Database query method based on Lucene
CN107977393A (en) A kind of recommended engine design method based on data collection of illustrative plates, Information Atlas, knowledge mapping and wisdom collection of illustrative plates towards 5W question and answer
CN113377957B (en) Classification method and system of national economic industry based on knowledge map
CN104881398A (en) Method for extracting author affiliation information of English literature published by Chinese authors
CN109684484B (en) A Construction System of SWOT Index System
CN114943216B (en) Attribute-level opinion mining method for case microblogs based on graph attention network
CN114817567A (en) Construction method of classification symbol co-occurrence network, method and system for identifying technical opportunity
CN107992524B (en) Expert information searching and domain scoring computing method
CN101408893A (en) Method for rapidly clustering documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant