WO2022105123A1 - Text classification method, topic generation method, apparatus, device, and medium - Google Patents

Text classification method, topic generation method, apparatus, device, and medium Download PDF

Info

Publication number
WO2022105123A1
WO2022105123A1 PCT/CN2021/090711 CN2021090711W WO2022105123A1 WO 2022105123 A1 WO2022105123 A1 WO 2022105123A1 CN 2021090711 W CN2021090711 W CN 2021090711W WO 2022105123 A1 WO2022105123 A1 WO 2022105123A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
common
vector
classified
Prior art date
Application number
PCT/CN2021/090711
Other languages
French (fr)
Chinese (zh)
Inventor
刘金克
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022105123A1 publication Critical patent/WO2022105123A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a text classification method, topic generation method, apparatus, device, and medium, said text classification method comprising: scraping for online articles and obtaining keywords corresponding to each article (S1); obtaining common keywords among articles and constructing a representation graph on the basis of the common keywords, each node in the representation graph representing one article, and connecting lines between nodes having common keywords (S2); calculating the closeness between each node and other connected nodes on the basis of the common keywords to obtain a node vector of each node on the basis of the closeness (S3); feeding the node vector of each node into a predetermined classification model and training to obtain a set of classified individual nodes outputted by the classification model (S4). The method is able to classify texts accurately.

Description

文本分类的方法、话题生成的方法、装置、设备及介质Method for text classification, method, device, device and medium for topic generation
本申请要求于2020年11月19日提交中国专利局、申请号为CN202011305385.4,发明名称为“文本分类的方法、话题生成的方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on November 19, 2020, with the application number CN202011305385.4, and the invention title is "Text Classification Method, Topic Generation Method, Apparatus, Equipment and Medium", The entire contents of which are incorporated herein by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种文本分类的方法、话题生成的方法、装置、设备及介质。The present application relates to the field of artificial intelligence technology, and in particular, to a method for text classification, and a method, apparatus, device and medium for topic generation.
背景技术Background technique
目前,网络上每天产出大量信息,包括突发事件、事件分析、舆情预测、社会发展事件等等,信息依靠互联网实现快速传播,每个人都能快速获取到大量的信息。文本分类在信息处理中占据着重要地位,通过有效方法对信息进行准确分类,对于信息的处理具有很大的价值。传统的文本分类方法包括两种,一种是基于聚类和相似度的方法,通过计算文本的标题或摘要的相似度,把相关的文本聚类在一起,另一种是基于分类模型的方法,比如对文章等文本使用RNN、Text-CNN等算法建模,输出文本分类。At present, a large amount of information is produced on the Internet every day, including emergencies, event analysis, public opinion prediction, social development events, etc. Information relies on the Internet to achieve rapid dissemination, and everyone can quickly obtain a large amount of information. Text classification occupies an important position in information processing, and it is of great value for information processing to accurately classify information through effective methods. There are two traditional text classification methods, one is a method based on clustering and similarity, which clusters related texts together by calculating the similarity of the title or abstract of the text, and the other is a method based on classification models. , such as using RNN, Text-CNN and other algorithms to model texts such as articles, and output text classification.
发明人意识到上述方法都是处理的文本的序列化表征特征,能够取得一定效果,但文本包含的信息是非常多的,例如,对于某篇文章,其对另外的多篇文章存在关联关系,这种两两之间的关联关系对于该篇文章来说是相对的,能够表征该篇文章分别与另外的多篇文章的相对关联程度,而通过序列化表征特征的方法无法挖掘这种内在关系,也就无法准确分类文本,因此,对文本进行准确分类的技术有待进一步提高。The inventor realizes that the above methods are all serialized representation features of the processed text, which can achieve certain effects, but the text contains a lot of information. The relationship between the two is relative to the article, which can represent the relative degree of correlation between the article and other articles, and the method of serializing the characteristics cannot mine this internal relationship. , the text cannot be accurately classified, therefore, the technology for accurate text classification needs to be further improved.
发明内容SUMMARY OF THE INVENTION
一种文本分类的方法,包括:A method of text classification, including:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合。The node vector of each node is input into a predetermined classification model for training, and a set of each classified node output by the classification model is obtained.
一种基于上述的文本分类的方法的话题生成的方法,所述话题生成的方法包括:A method for topic generation based on the above-mentioned text classification method, the method for topic generation includes:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
一种文本分类的装置,包括:A text classification device, comprising:
抓取模块,用于抓取网络文章,获取各篇文章对应的关键词;The crawling module is used to crawl web articles and obtain the keywords corresponding to each article;
构建模块,用于获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;The building module is used to obtain the common keywords between each article, and build a representation graph based on the common keywords. Each node in the representation graph represents an article, and one of the two nodes with the common keyword is used. connection between;
处理模块,用于基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;a processing module, configured to calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
分类模块,用于将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合。The classification module is configured to input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.
一种计算机设备,所述计算机设备包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:A computer device, the computer device comprising a memory and a processor connected to the memory, the memory stores a computer program that can run on the processor, and the processor executes the computer program to achieve Follow the steps below:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
或者实现如下步骤:Or implement the following steps:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如下步骤:A computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
或者实现如下步骤:Or implement the following steps:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
本申请可以准确地将最相似的文章分为一类,得到更好的分类。This application can accurately classify the most similar articles into one category for better classification.
附图说明Description of drawings
图1为本申请文本分类的方法一实施例的流程示意图;1 is a schematic flowchart of an embodiment of a method for text classification of the application;
图2为图1中的表征图的示意图;Fig. 2 is the schematic diagram of the characterization diagram in Fig. 1;
图3为图1中基于共同关键词计算每个节点与相连的其他节点之间的紧密度,基于紧密度获取每个节点的节点向量的步骤的细化流程示意图;Fig. 3 is a detailed flow diagram of the step of calculating the tightness between each node and other connected nodes based on the common keyword in Fig. 1, and obtaining the node vector of each node based on the tightness;
图4为图1中将每个节点的节点向量输入至预定的分类模型中进行训练,获取分类模型输出的已分类的各个节点的集合的步骤的细化流程示意图;Fig. 4 is the refinement flow schematic diagram of the step of inputting the node vector of each node into a predetermined classification model for training in Fig. 1, and obtaining the set of each classified node output by the classification model;
图5为本申请话题生成的方法一实施例的流程示意图;FIG. 5 is a schematic flowchart of an embodiment of a method for generating a topic of the application;
图6为本申请文本分类的装置一实施例的结构示意图;FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for text classification of the application;
图7为本申请计算机设备一实施例的硬件架构的示意图。FIG. 7 is a schematic diagram of a hardware architecture of an embodiment of a computer device of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions involving "first", "second", etc. in this application are only for the purpose of description, and should not be construed as indicating or implying their relative importance or implying the number of indicated technical features . Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In addition, the technical solutions between the various embodiments can be combined with each other, but must be based on the realization by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that the combination of such technical solutions does not exist. , is not within the scope of protection claimed in this application.
参阅图1所示,是本申请文本分类的方法一实施例的流程示意图,该方法包括:Referring to FIG. 1, it is a schematic flowchart of an embodiment of a method for text classification of the present application. The method includes:
步骤S1,抓取网络文章,获取各篇文章对应的关键词;Step S1, grab web articles, and obtain keywords corresponding to each article;
其中,可以定时(例如每天)从网络上抓取网络文章,从而生成对应时段的话题。网络文章包括不同标签类别的文章,例如要闻、财经、教育、体育等标签类别的网络文章。Wherein, web articles can be grabbed from the web periodically (for example, every day), so as to generate topics of the corresponding period. Web articles include articles with different tag categories, such as news, finance, education, sports, and other tag categories.
其中,首先对每篇文章进行分词,可以使用分词工具逐个对每篇文章进行分词处理,例如使用Stanford汉语分词工具、jieba分词等分词工具进行分词处理。对于每篇文章,分词处理后可以得到一个对应的单词列表。Among them, each article is firstly segmented, and word segmentation tools can be used to segment each article one by one. For example, word segmentation tools such as Stanford Chinese word segmentation tool and jieba word segmentation tool are used to perform word segmentation processing. For each article, a corresponding word list can be obtained after word segmentation.
通过预定的关键词提取算法提取关键词,例如,使用TF-IDF(Term Frequency-Inverse Document Frequency,词项频率-逆向文本频率)算法、LSA(Latent Semantic Analysis,隐性语义分析)算法或者PLSA(Probabilisitic Latent Semantic Analysis,概率隐性语义分析)算法等关键词提取算法中的任意一种算法对每篇文章的单词列表进行计算,得到分值较高的词作为文章的关键词。作为另一种实施方式,本实施例还可以同时使用多种关键词提取算法提取一篇文章的关键词,将多种关键词提取算法中所提取到的相同的关键词作为该篇文章的关键词。Extract keywords through a predetermined keyword extraction algorithm, for example, use TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, LSA (Latent Semantic Analysis, latent semantic analysis) algorithm or PLSA ( Any one of the keyword extraction algorithms such as Probabilisitic Latent Semantic Analysis, probabilistic implicit semantic analysis) algorithm calculates the word list of each article, and obtains the word with higher score as the keyword of the article. As another implementation, this embodiment can also use multiple keyword extraction algorithms to extract keywords of an article at the same time, and use the same keywords extracted from multiple keyword extraction algorithms as the key of the article word.
步骤S2,获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;In step S2, common keywords between each article are acquired, and a characterization graph is constructed based on the common keywords. Each node in the characterization graph represents an article. connection;
分析两篇文章之间是否有共同的关键词,如果两篇文章之间有共同的关键词,那么每篇文章作为一个节点,在这两个节点之间进行连线。在分析完所抓取的全部文章后,各个节点之间有共同关键词的均进行连线,这样,构建表征图,构建好的表征图如图2所示。Analyze whether there are common keywords between the two articles. If there are common keywords between the two articles, then each article is used as a node to connect the two nodes. After analyzing all the captured articles, all nodes with common keywords are connected. In this way, a characterization graph is constructed, and the constructed characterization graph is shown in Figure 2.
步骤S3,基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Step S3, calculating the degree of closeness between each node and other connected nodes based on the common keyword, and obtaining the node vector of each node based on the degree of closeness;
其中,在一实施例中,如图3所示,步骤S3包括:Wherein, in one embodiment, as shown in FIG. 3 , step S3 includes:
步骤S31,统计相连的两节点对应的两篇文章中所述共同关键词的数量;Step S31, count the number of common keywords described in the two articles corresponding to the two connected nodes;
步骤S32,统计每个共同关键词在相连的两节点对应的两篇文章中分别出现的次数;Step S32, count the number of times that each common keyword appears in the two articles corresponding to the two connected nodes;
步骤S33,基于所述共同关键词的数量及所述分别出现的次数计算每个节点与相连的其他节点之间的紧密度S:Step S33, calculating the tightness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:
Figure PCTCN2021090711-appb-000001
其中,A、B代表所述表征图中相连的节点,n为A、B两节点所对应的两篇文章中共同关键词的数量,i为共同关键词的序号,A i为第i个共同关键词在节点A所对应的文章中出现的次数,B i为第i个共同关键词在节点B所对应的文章中出现的次数,μ为共同关键词的数量的倒数。
Figure PCTCN2021090711-appb-000002
是所有比值的加和,乘以μ后就是平均到各个共同关键词的比值。通过紧密度S表达两篇具有共同关键的文章之间的关联关系,及紧密的程度。其中,当两篇文章很相似的情况下,紧密度S的值趋近于1,例如两篇相同文章的S值为1。如果两篇文章不相似,那么S值会趋近于0或者比1大很多,相当于在值1的附近波动,波动较大。
Figure PCTCN2021090711-appb-000001
Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords.
Figure PCTCN2021090711-appb-000002
is the sum of all ratios, and multiplied by μ is the ratio averaged to each common keyword. The relationship between two articles with common key and the degree of closeness are expressed by the degree of closeness S. Among them, when two articles are very similar, the value of closeness S tends to be 1, for example, the value of S of two identical articles is 1. If the two articles are not similar, then the S value will approach 0 or be much larger than 1, which is equivalent to fluctuating around the value of 1, and the fluctuation is large.
步骤S34,将每个节点与相连的其他节点之间的紧密度进行向量化,得到每个节点对应的节点向量。Step S34, vectorize the closeness between each node and other connected nodes to obtain a node vector corresponding to each node.
本实施例中,将每个节点与相连的其他节点之间的紧密度进行向量化,得到所述节点对应的节点向量。例如,抓取的所有文章节点表示为A0,A1,A2,…,An,节点A0和节点A1的紧密度是S1,和节点A2的紧密度是S2,以此类推得到节点A0的节点向量表示为(S1,S2,…,Sn),继而构建了每一篇文章的节点向量表达,完成节点A0的向量化,最终得到表征图中每一个节点的向量表达。每个节点向量表达中不仅包含了关键词的序列特征,还包含了每个节点与其他节点的紧密程度。In this embodiment, the closeness between each node and other connected nodes is vectorized to obtain a node vector corresponding to the node. For example, all the captured article nodes are represented as A0, A1, A2, ..., An, the tightness of node A0 and node A1 is S1, and the tightness of node A2 is S2, and so on to get the node vector representation of node A0 is (S1, S2, ..., Sn), and then constructs the node vector representation of each article, completes the vectorization of node A0, and finally obtains the vector representation that characterizes each node in the graph. The vector representation of each node contains not only the sequence features of keywords, but also the closeness of each node to other nodes.
步骤S4,将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合。Step S4, input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.
其中,预定的分类模型可以是朴素贝叶斯模型(NB model),随机森林模型(RF),SVM分类模型,KNN分类模型,神经网络分类模型中的任一种,当然也可以是其他的深度学习文本分类模型,例如fastText模型、TextCNN模型等。本实施例中的分类模型采用图神经网络(Graph Neural Network,GNN)。图神经网络是用于学习包含大量连接的图的联结主义模型。当信息在图的节点之间传播时,图神经网络会捕捉到节点的独立性。与其他分类模型不同的是,图神经网络会保持一种状态,这个状态可以代表来源于人为指定的深度上的信息。此外,图神经网络的目标是学习到每个节点的邻居的状态嵌入,这个状态嵌入是向量且可以用来产生输出。本实施例具体采用的是图神经网络中的图注意力网络(Graph Attention Networks),图注意力网络是在图神经网络中引入了注意力机制,通过注意力机制给与较为重要的节点更大的权重。Among them, the predetermined classification model can be any one of Naive Bayesian model (NB model), random forest model (RF), SVM classification model, KNN classification model, neural network classification model, and of course other deep Learn text classification models, such as fastText model, TextCNN model, etc. The classification model in this embodiment adopts a graph neural network (Graph Neural Network, GNN). Graph neural networks are connectionist models for learning graphs that contain a large number of connections. Graph neural networks capture the independence of nodes as information is propagated among the nodes of the graph. Unlike other classification models, graph neural networks maintain a state that can represent information derived from human-specified depths. Furthermore, the goal of a graph neural network is to learn the state embeddings of each node's neighbors, which are vectors and can be used to produce outputs. This embodiment specifically adopts the graph attention network (Graph Attention Networks) in the graph neural network. The graph attention network introduces the attention mechanism into the graph neural network, and gives more important nodes through the attention mechanism. the weight of.
其中,在一实施例中,如图4所示,步骤S4包括:Wherein, in one embodiment, as shown in FIG. 4 , step S4 includes:
步骤S41,将每个节点的节点向量输入到图注意力网络,将节点向量输入到图注意力网络的各节点作为各待分类节点,计算每个待分类节点的损失函数;Step S41, input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;
步骤S42,对于每个待分类节点,在所述损失函数最小化时,基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度,所述邻居节点为所述表征图中与所述待分类节点相连的节点;Step S42, for each node to be classified, when the loss function is minimized, calculate the contribution degree of neighbor nodes to the node to be classified based on the node vector of the node to be classified, and the neighbor node is the representation graph. A node connected to the to-be-classified node;
步骤S43,基于所述贡献度对所述邻居节点进行聚合。Step S43: Aggregate the neighbor nodes based on the contribution degree.
其中,采用的损失函数是鼓励较相似的节点进行聚合,而较不相似的节点则在空间中被远离。损失函数的公式为:Among them, the adopted loss function is to encourage more similar nodes to aggregate, while less similar nodes are spaced away. The formula for the loss function is:
Figure PCTCN2021090711-appb-000003
Figure PCTCN2021090711-appb-000003
其中,Z u为节点u生成的嵌入向量(即embedding向量),节点v是节点u随机游走到的邻居节点,Z v为节点v生成的嵌入向量,σ代表sigmoid函数,T为转置,负样本为随机游走后不能成为邻居节点的节点,Q是负样本的数量,E是概率分布的期望值,P n(v) 是负样本的概率分布,n为节点序号,“~”为服从分布。 Among them, Z u is the embedding vector generated by node u (that is, the embedding vector), node v is the neighbor node that node u randomly walks to, Z v is the embedding vector generated by node v, σ represents the sigmoid function, T is the transpose, Negative samples are nodes that cannot become neighbor nodes after random walk, Q is the number of negative samples, E is the expected value of the probability distribution, P n (v) is the probability distribution of negative samples, n is the node number, and “~” is the obedience distributed.
其中,将每一节点的节点向量输入到图注意力网络中,将这些节点作为待分类节点,对于每个待分类节点,在待分类节点的损失函数最小化时,计算邻居节点对该待分类节点贡献度,基于所述贡献度对所述邻居节点进行聚合,输出若干个分类,每个分类中包含的节点均为最相似的节点。这里的分类指的是按照文章内容的相似程度进行的分类,越相似的文章其属于同一类别的概率越大。Among them, the node vector of each node is input into the graph attention network, and these nodes are regarded as the nodes to be classified. For each node to be classified, when the loss function of the node to be classified is minimized, the neighbor nodes are calculated for the to-be-classified node. Node contribution degree, the neighbor nodes are aggregated based on the contribution degree, and several categories are output, and the nodes included in each category are the most similar nodes. The classification here refers to the classification according to the similarity of the content of the articles. The more similar articles are, the greater the probability of belonging to the same category.
其中,基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度包括:Wherein, calculating the contribution of neighbor nodes to the to-be-classified node based on the node vector of the to-be-classified node includes:
e AB=LeakyReLU(α T[W A||W B]),其中,A、B为所述表征图中相连的节点,节点A为待分类节点,节点B为节点A的邻居节点,e AB为邻居节点B对节点A的贡献度,LeakyReLU为带泄露修正线性单元函数,该函数可以进行非线性转换激活,W A为节点A的节点向量,W B为节点B的节点向量,||为W A和W B节点向量的拼接,α为共享注意力计算函数,α T为共享注意力计算函数的转置。 e AB =LeakyReLU(α T [W A ||W B ]), where A and B are connected nodes in the representation graph, node A is the node to be classified, node B is the neighbor node of node A, and e AB is the contribution of neighbor node B to node A, LeakyReLU is a linear unit function with leakage correction, which can perform nonlinear transformation activation, W A is the node vector of node A, W B is the node vector of node B, and || is The concatenation of W A and W B node vectors, α is the shared attention calculation function, α T is the transpose of the shared attention calculation function.
在生成下一个隐层的新特征时,节点A会根据邻居节点B的贡献度e AB,贡献度e AB越大代表节点聚合到一起的概率越大。 When generating new features of the next hidden layer, node A will base on the contribution degree e AB of neighbor node B. The greater the contribution degree e AB , the greater the probability of the nodes being aggregated together.
其中,邻居节点B对节点A生成新特征的贡献度e AB,是通过图注意力网络使用前馈神经网络去计算,图注意力网络计算邻居节点的贡献度,聚合相近节点。其中,某个节点可以只聚合在一个分类中,也可以在多个不同的分类中。 Among them, the contribution degree e AB of the neighbor node B to the node A to generate new features is calculated by using the feedforward neural network through the graph attention network. The graph attention network calculates the contribution degree of the neighbor nodes and aggregates similar nodes. Among them, a node can be aggregated in only one category or in multiple different categories.
其中,上述步骤S4进一步包括:利用归一化指数函数计算聚合后每一节点聚合在当前类别下时对应的分值,基于所述分值确定所述节点对应的类别。Wherein, the above step S4 further includes: using a normalized exponential function to calculate a score corresponding to each node aggregated under the current category after aggregation, and determining the category corresponding to the node based on the score.
归一化指数函数的计算公式如下:The formula for calculating the normalized exponential function is as follows:
Figure PCTCN2021090711-appb-000004
其中,p(y|x)为节点x属于类别y的概率,C为类别的集合,c为某一类别的序号,W为向量映射矩阵。p(y|x)越大,节点划分到对应类别下的概率越大。本实施例中,获取某个节点划分到各个类别下对应的概率p(y|x),该概率p(y|x)作为该节点划分到各个类别下对应的分值,将分值最大的类别作为该节点最终确定的类别。
Figure PCTCN2021090711-appb-000004
Among them, p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a certain category, and W is the vector mapping matrix. The larger p(y|x) is, the greater the probability of the node being divided into the corresponding category. In this embodiment, the probability p(y|x) corresponding to the division of a node into each category is obtained, and the probability p(y|x) is used as the corresponding score for the node to be divided into each category, and the highest score Category as the finalized category for this node.
本实施例通过文章之间的共同关键词构建表征图,计算表征图中每个节点与其他相连节点的紧密度,从而得到每个节点对应的节点向量,将每个节点的节点向量输入分类模型中进行训练,得到分类后各个节点的集合,本实施例通过构建文章的表征图,将节点与相连节点的紧密度作为节点向量,通过将节点向量输入至分类模型中进行训练来对节点进行分类,能够挖掘一个节点与其他的多个节点相对的关联紧密度,这种紧密度为该文章与其他文章间进一步的内在联系或者空间联系,通过这种内在联系或者空间联系,可以准确地将最相似的文章分为一类,得到更好的分类。In this embodiment, a representation graph is constructed by common keywords between articles, and the closeness of each node and other connected nodes in the representation graph is calculated, so as to obtain a node vector corresponding to each node, and the node vector of each node is input into the classification model In this example, by constructing the representation graph of the article, the closeness between the node and the connected node is used as the node vector, and the node vector is input into the classification model for training to classify the nodes. , which can mine the relative closeness of the association between a node and other nodes. This closeness is the further internal or spatial connection between the article and other articles. Through this internal connection or spatial connection, the most Similar articles are grouped into one category for better classification.
本申请还提供一种基于上述的文本分类的方法的话题生成的方法,如图5所示,该话题生成的方法包括:The present application also provides a method for generating a topic based on the above-mentioned method for text classification. As shown in FIG. 5 , the method for generating a topic includes:
步骤S101,抓取网络文章,获取各篇文章对应的关键词;Step S101, crawl web articles, and obtain keywords corresponding to each article;
步骤S102,获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;In step S102, common keywords between each article are acquired, and a characterization graph is constructed based on the common keywords. Each node in the characterization graph represents an article. connection;
步骤S103,基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Step S103, calculating the degree of closeness between each node and other connected nodes based on the common keyword, and obtaining the node vector of each node based on the degree of closeness;
步骤S104,将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;Step S104, input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model;
步骤S105,从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文 章的共同信息,基于所述共同信息生成话题。In step S105, a preset number of nodes are selected from the sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
其中,上述的步骤S101至步骤S104的限定可参考上述文本分类的方法的各个实施例。在步骤S5中,在一实施方式中,从各类别的集合中选取预设数量的节点,可以将各个节点划分到各个类别下对应的分值按照从大至小的顺序将所述节点进行排序,选取分值最大的预设数量的节点,例如,选取分值最大的5个节点。获取所选取节点对应的文章的共同信息,基于该共同信息生成话题。其中,可以在预设数量的节点中,获取2个或2个以上的节点对应的文章中的共同信息,或者直接获取全部节点对应的文章中的共同信息,根据这几个节点所在的类别及共同信息生成话题。其中,获取文章的共同信息,可以利用现有技术中提取文本特征的手段实现,此处不做过多说明。For the definitions of the above steps S101 to S104, reference may be made to the various embodiments of the above text classification method. In step S5, in one embodiment, a preset number of nodes are selected from the sets of various categories, and each node can be divided into the corresponding scores under each category, and the nodes can be sorted in descending order. , select the preset number of nodes with the largest score, for example, select the 5 nodes with the largest score. Obtain common information of articles corresponding to the selected nodes, and generate topics based on the common information. Among them, the common information in the articles corresponding to two or more nodes can be obtained from the preset number of nodes, or the common information in the articles corresponding to all the nodes can be directly obtained. Common information generates topics. Among them, the common information of the article can be obtained by using the method of extracting text features in the prior art, which will not be described here.
本实施例通过文章之间的共同关键词构建表征图,计算表征图中每个节点与其他相连节点的紧密度,从而得到每个节点对应的节点向量,将每个节点的节点向量输入分类模型中进行训练,得到分类后各个节点的集合,本实施例通过构建文章的表征图,将节点与相连节点的紧密度作为节点向量,通过将节点向量输入至分类模型中进行训练来对节点进行分类,能够挖掘一个节点与其他的多个节点相对的关联紧密度,这种紧密度为该文章与其他文章间进一步的内在联系或者空间联系,通过这种内在联系或者空间联系,可以准确地将最相似的文章分为一类,基于该分类提取节点的共同信息并生成话题,能够得到优质的话题。In this embodiment, a representation graph is constructed by common keywords between articles, and the closeness of each node and other connected nodes in the representation graph is calculated, so as to obtain a node vector corresponding to each node, and the node vector of each node is input into the classification model In this example, by constructing the representation graph of the article, the closeness between the node and the connected node is used as the node vector, and the node vector is input into the classification model for training to classify the nodes. , which can mine the relative closeness of the association between a node and other nodes. This closeness is the further internal or spatial connection between the article and other articles. Through this internal connection or spatial connection, the most Similar articles are classified into one category. Based on this category, common information of nodes is extracted and topics are generated, which can obtain high-quality topics.
在一实施例中,本申请提供一种文本分类的装置,该文本分类的装置与上述实施例中文本分类的方法一一对应。如图6所示,该文本分类的装置包括:In one embodiment, the present application provides an apparatus for text classification, and the apparatus for text classification corresponds one-to-one with the method for text classification in the foregoing embodiment. As shown in Figure 6, the text classification device includes:
抓取模块101,用于抓取网络文章,获取各篇文章对应的关键词;The crawling module 101 is used for crawling web articles and obtaining keywords corresponding to each article;
构建模块102,用于获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;The building module 102 is used to obtain the common keywords between the articles, and build a representation graph based on the common keywords. Each node in the representation graph represents an article, and there are pairs of nodes with common keywords. connection between;
处理模块103,用于基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;A processing module 103, configured to calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain the node vector of each node based on the degree of closeness;
分类模块104,用于将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合。The classification module 104 is configured to input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.
文本分类的装置的具体限定可以参见上文中对于文本分类的方法的限定,在此不再赘述。上述文本分类的装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the apparatus for text classification, reference may be made to the definition of the method for text classification above, which will not be repeated here. Each module in the above-mentioned apparatus for text classification may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一实施例中,本申请提供一种话题生成的装置,该话题生成的装置与上述实施例中话题生成的方法一一对应。该话题生成的装置包括:In one embodiment, the present application provides an apparatus for generating a topic, and the apparatus for generating a topic corresponds one-to-one with the method for generating a topic in the foregoing embodiment. The means for generating the topic include:
抓取模块,用于抓取网络文章,获取各篇文章对应的关键词;The crawling module is used to crawl web articles and obtain the keywords corresponding to each article;
构建模块,用于获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;The building module is used to obtain the common keywords between each article, and build a representation graph based on the common keywords. Each node in the representation graph represents an article, and one of the two nodes with the common keyword is used. connection between;
处理模块,用于基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;a processing module, configured to calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
分类模块,用于将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合。The classification module is configured to input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.
生成模块,用于从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。The generating module is configured to select a preset number of nodes from the sets of various categories, extract common information of the corresponding articles based on the selected nodes, and generate topics based on the common information.
话题生成的装置的具体限定可以参见上文中对于话题生成的方法的限定,在此不再赘述。上述话题生成的装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the apparatus for generating a topic, refer to the limitation on the method for generating a topic above, which will not be repeated here. Each module in the above topic generating apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。所述计算机设备可以是PC(Personal Computer,个人电脑),或者是智能手机、平板电脑、计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。In one embodiment, a computer device is provided, which is a device capable of automatically performing numerical calculation and/or information processing according to pre-set or stored instructions. The computer equipment can be a PC (Personal Computer, personal computer), or a smart phone, a tablet computer, a computer, a single network server, a server group composed of multiple network servers, or a cloud-based computing system consisting of a large number of hosts or networks. A cloud composed of servers, in which cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computer sets.
如图7所示,所述计算机设备可包括,但不仅限于,可通过系统总线相互通信连接的存储器11、处理器12、网络接口13,存储器11存储有可在处理器12上运行的计算机程序。需要指出的是,图7仅示出了具有组件11-13的计算机设备,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。As shown in FIG. 7 , the computer device may include, but is not limited to, a memory 11 , a processor 12 , and a network interface 13 that can be communicatively connected to each other through a system bus, and the memory 11 stores a computer program that can run on the processor 12 . It should be noted that FIG. 7 only shows a computer device having components 11-13, but it should be understood that it is not required to implement all of the illustrated components, and more or less components may be implemented instead.
其中,存储器11可以是非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。本实施例中,存储器11的可读存储介质通常用于存储安装于计算机设备的操作系统和各类应用软件,例如存储本申请一实施例中的计算机程序的程序代码等。此外,存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。Among them, the memory 11 may be a non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc. In this embodiment, the readable storage medium of the memory 11 is generally used to store the operating system and various application software installed in the computer device, for example, to store the program code of the computer program in an embodiment of the present application. In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片,用于运行所述存储器11中存储的程序代码或者处理数据,例如运行计算机程序等。The processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments, and is used for running the data stored in the memory 11. Program code or processing data, such as running a computer program, etc.
所述网络接口13可包括标准的无线网络接口、有线网络接口,该网络接口13通常用于在所述计算机设备与其他电子设备之间建立通信连接。The network interface 13 may include a standard wireless network interface and a wired network interface, and the network interface 13 is generally used to establish a communication connection between the computer device and other electronic devices.
所述计算机程序存储在存储器11中,包括至少一个存储在存储器11中的计算机可读指令,该至少一个计算机可读指令可被处理器12执行,以实现以下步骤:The computer program is stored in the memory 11 and includes at least one computer-readable instruction stored in the memory 11, the at least one computer-readable instruction being executable by the processor 12 to implement the following steps:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;或者Input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model; or
该至少一个计算机可读指令可被处理器12执行,以实现以下步骤:The at least one computer-readable instruction is executable by the processor 12 to implement the following steps:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
在一个实施例中,本申请提供了一种计算机可读存储介质,计算机可读存储介质可以 是非易失性和/或易失性存储器,其上存储有计算机程序,计算机程序被处理器执行时实现如下步骤:In one embodiment, the present application provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile and/or volatile memory on which a computer program is stored, and when the computer program is executed by a processor Implement the following steps:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
或者实现如下步骤:Or implement the following steps:
抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
例如图1所示的步骤S1至步骤S4,或者图5所示的步骤S1至步骤S5。或者,计算机程序被处理器执行时实现上述实施例中文本分类的装置的各模块/单元的功能,例如图6所示模块101至模块104的功能。为避免重复,这里不再赘述。For example, steps S1 to S4 shown in FIG. 1 , or steps S1 to S5 shown in FIG. 5 . Alternatively, when the computer program is executed by the processor, the functions of each module/unit of the apparatus for text classification in the above-mentioned embodiment are implemented, for example, the functions of modules 101 to 104 shown in FIG. 6 . In order to avoid repetition, details are not repeated here.
需要强调的是,为进一步保证上述共同信息、话题及表征图等数据的私密性和安全性,上述共同信息、话题及表征图等数据还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the data such as the above common information, topics and representation graphs, the above common information, topics and representation graphs and other data can also be stored in the nodes of a blockchain.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序在执行时,可包括如上述各方法的实施例的流程。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through a computer program. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. .
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, device, article or method comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, apparatus, article or method.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields , are similarly included within the scope of patent protection of this application.

Claims (20)

  1. 一种文本分类的方法,其中,包括:A method of text classification, including:
    抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
    获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
    基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
    将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合。The node vector of each node is input into a predetermined classification model for training, and a set of each classified node output by the classification model is obtained.
  2. 根据权利要求1所述的文本分类的方法,其中,所述基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量的步骤,具体包括:The method for text classification according to claim 1, wherein the degree of closeness between each node and other connected nodes is calculated based on the common keywords, and the degree of closeness of the node vector of each node is obtained based on the degree of closeness steps, including:
    统计相连的两节点对应的两篇文章中所述共同关键词的数量;Count the number of common keywords described in the two articles corresponding to the two connected nodes;
    统计每个共同关键词在相连的两节点对应的两篇文章中分别出现的次数;Count the number of times each common keyword appears in the two articles corresponding to the two connected nodes;
    基于所述共同关键词的数量及所述分别出现的次数计算每个节点与相连的其他节点之间的紧密度S:Calculate the closeness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:
    Figure PCTCN2021090711-appb-100001
    其中,A、B代表所述表征图中相连的节点,n为A、B两节点所对应的两篇文章中共同关键词的数量,i为共同关键词的序号,A i为第i个共同关键词在节点A所对应的文章中出现的次数,B i为第i个共同关键词在节点B所对应的文章中出现的次数,μ为共同关键词的数量的倒数;
    Figure PCTCN2021090711-appb-100001
    Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords;
    将每个节点与相连的其他节点之间的紧密度进行向量化,得到每个节点对应的节点向量。The closeness between each node and other connected nodes is vectorized, and the node vector corresponding to each node is obtained.
  3. 根据权利要求1或2所述的文本分类的方法,其中,所述将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合的步骤,具体包括:The method for text classification according to claim 1 or 2, wherein the node vector of each node is input into a predetermined classification model for training, and a set of classified nodes output from the classification model is obtained. steps, including:
    将每个节点的节点向量输入到图注意力网络,将节点向量输入到图注意力网络的各节点作为各待分类节点,计算每个待分类节点的损失函数;Input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;
    对于每个待分类节点,在所述损失函数最小化时,基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度,所述邻居节点为所述表征图中与所述待分类节点相连的节点;For each node to be classified, when the loss function is minimized, the contribution degree of neighbor nodes to the node to be classified is calculated based on the node vector of the node to be classified, and the neighbor node is the same as that in the representation graph. Describe the nodes connected to the nodes to be classified;
    基于所述贡献度对所述邻居节点进行聚合。The neighbor nodes are aggregated based on the contribution degree.
  4. 根据权利要求3所述的文本分类的方法,其中,所述基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度包括:The method for text classification according to claim 3, wherein calculating the contribution of neighbor nodes to the node to be classified based on the node vector of the node to be classified comprises:
    e AB=LeakyReLU(α T[W A||W B]),其中,LeakyReLU为带泄露修正线性单元函数,A、B为所述表征图中相连的节点,W A为节点A的节点向量,W B为节点B的节点向量,||为W A和W B节点向量的拼接,α为共享注意力计算函数,α T为共享注意力计算函数的转置。 e AB =LeakyReLU(α T [W A ||W B ]), wherein LeakyReLU is a linear unit function with leak correction, A and B are the connected nodes in the representation graph, W A is the node vector of node A, W B is the node vector of node B, || is the concatenation of W A and W B node vectors, α is the shared attention calculation function, and α T is the transpose of the shared attention calculation function.
  5. 根据权利要求3所述的文本分类的方法,其中,所述将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合的步骤,进一步包括:The method for text classification according to claim 3, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining the set of each classified node output by the classification model, Further includes:
    利用归一化指数函数计算聚合后每一节点聚合在当前类别下时对应的分值;Use the normalized exponential function to calculate the corresponding score when each node is aggregated under the current category after aggregation;
    基于所述分值确定所述节点对应的类别。The category corresponding to the node is determined based on the score.
  6. 根据权利要求5所述的文本分类的方法,其中,所述利用归一化指数函数计算聚合后每一节点聚合在当前类别下时对应的分值包括:The method for text classification according to claim 5, wherein the calculation of a score corresponding to each node aggregated under the current category after the aggregation by using a normalized exponential function comprises:
    Figure PCTCN2021090711-appb-100002
    其中,
    Figure PCTCN2021090711-appb-100002
    in,
    p(y|x)为节点x属于类别y的概率,C为类别的集合,c为某一类别的序号,W为向量映射矩阵。p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a category, and W is the vector mapping matrix.
  7. 一种基于权利要求1至6任一项所述的文本分类的方法的话题生成的方法,其中,所述话题生成的方法包括:A method for generating topics based on the method for text classification according to any one of claims 1 to 6, wherein the method for generating topics comprises:
    抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
    获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
    基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
    将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
    从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
  8. 一种文本分类的装置,其中,包括:A text classification device, comprising:
    抓取模块,用于抓取网络文章,获取各篇文章对应的关键词;The crawling module is used to crawl web articles and obtain the keywords corresponding to each article;
    构建模块,用于获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;The building module is used to obtain the common keywords between each article, and build a representation graph based on the common keywords. Each node in the representation graph represents an article, and one of the two nodes with the common keyword is used. connection between;
    处理模块,用于基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;a processing module, configured to calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
    分类模块,用于将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合。The classification module is configured to input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.
  9. 一种计算机设备,所述计算机设备包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如下步骤:A computer device comprising a memory and a processor connected to the memory, the memory storing a computer program executable on the processor, wherein the processor executes the computer program When implementing the following steps:
    抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
    获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
    基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
    将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
    或者实现如下步骤:Or implement the following steps:
    抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
    获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
    基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
    将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
    从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
  10. 根据权利要求9所述的计算机设备,其中,所述基于所述共同关键词计算每个节点 与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量的步骤,具体包括:The computer device according to claim 9, wherein the step of calculating the degree of closeness between each node and other connected nodes based on the common keyword, and obtaining the node vector of each node based on the degree of closeness, Specifically include:
    统计相连的两节点对应的两篇文章中所述共同关键词的数量;Count the number of common keywords described in the two articles corresponding to the two connected nodes;
    统计每个共同关键词在相连的两节点对应的两篇文章中分别出现的次数;Count the number of times each common keyword appears in the two articles corresponding to the two connected nodes;
    基于所述共同关键词的数量及所述分别出现的次数计算每个节点与相连的其他节点之间的紧密度S:Calculate the closeness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:
    Figure PCTCN2021090711-appb-100003
    其中,A、B代表所述表征图中相连的节点,n为A、B两节点所对应的两篇文章中共同关键词的数量,i为共同关键词的序号,A i为第i个共同关键词在节点A所对应的文章中出现的次数,B i为第i个共同关键词在节点B所对应的文章中出现的次数,μ为共同关键词的数量的倒数;
    Figure PCTCN2021090711-appb-100003
    Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords;
    将每个节点与相连的其他节点之间的紧密度进行向量化,得到每个节点对应的节点向量。The closeness between each node and other connected nodes is vectorized, and the node vector corresponding to each node is obtained.
  11. 根据权利要求9或10所述的计算机设备,其中,所述将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合的步骤,具体包括:The computer device according to claim 9 or 10, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining the set of each classified node output by the classification model, Specifically include:
    将每个节点的节点向量输入到图注意力网络,将节点向量输入到图注意力网络的各节点作为各待分类节点,计算每个待分类节点的损失函数;Input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;
    对于每个待分类节点,在所述损失函数最小化时,基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度,所述邻居节点为所述表征图中与所述待分类节点相连的节点;For each node to be classified, when the loss function is minimized, the contribution degree of neighbor nodes to the node to be classified is calculated based on the node vector of the node to be classified, and the neighbor node is the same as that in the representation graph. Describe the nodes connected to the nodes to be classified;
    基于所述贡献度对所述邻居节点进行聚合。The neighbor nodes are aggregated based on the contribution degree.
  12. 根据权利要求11所述的计算机设备,其中,所述基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度包括:The computer device according to claim 11, wherein calculating the contribution degree of neighbor nodes to the node to be classified based on the node vector of the node to be classified comprises:
    e AB=LeakyReLU(α T[W A||W B]),其中,LeakyReLU为带泄露修正线性单元函数,A、B为所述表征图中相连的节点,W A为节点A的节点向量,W B为节点B的节点向量,||为W A和W B节点向量的拼接,α为共享注意力计算函数,α T为共享注意力计算函数的转置。 e AB =LeakyReLU(α T [W A ||W B ]), wherein LeakyReLU is a linear unit function with leak correction, A and B are the connected nodes in the representation graph, W A is the node vector of node A, W B is the node vector of node B, || is the concatenation of W A and W B node vectors, α is the shared attention calculation function, and α T is the transpose of the shared attention calculation function.
  13. 根据权利要求11所述的计算机设备,其中,所述将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合的步骤,进一步包括:The computer device according to claim 11, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of each classified node output by the classification model, further comprises: :
    利用归一化指数函数计算聚合后每一节点聚合在当前类别下时对应的分值;Use the normalized exponential function to calculate the corresponding score when each node is aggregated under the current category after aggregation;
    基于所述分值确定所述节点对应的类别。The category corresponding to the node is determined based on the score.
  14. 根据权利要求13所述的计算机设备,其中,所述利用归一化指数函数计算聚合后每一节点聚合在当前类别下时对应的分值包括:The computer device according to claim 13, wherein, after calculating the aggregation by using a normalized exponential function, the corresponding score when each node is aggregated under the current category comprises:
    Figure PCTCN2021090711-appb-100004
    其中,
    Figure PCTCN2021090711-appb-100004
    in,
    p(y|x)为节点x属于类别y的概率,C为类别的集合,c为某一类别的序号,W为向量映射矩阵。p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a category, and W is the vector mapping matrix.
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the following steps are implemented:
    抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
    获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
    基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度 获取每个节点的节点向量;Calculate the closeness between each node and other connected nodes based on the common keyword, and obtain the node vector of each node based on the closeness;
    将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
    或者实现如下步骤:Or implement the following steps:
    抓取网络文章,获取各篇文章对应的关键词;Crawl web articles and obtain keywords corresponding to each article;
    获取各篇文章两两之间的共同关键词,基于所述共同关键词构建表征图,所述表征图中每个节点代表一篇文章,有共同关键词的两两节点之间进行连线;Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;
    基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量;Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;
    将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合;inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;
    从各类别的集合中选取预设数量的节点,基于所选取节点提取对应的文章的共同信息,基于所述共同信息生成话题。A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述基于所述共同关键词计算每个节点与相连的其他节点之间的紧密度,基于所述紧密度获取每个节点的节点向量的步骤,具体包括:The computer-readable storage medium according to claim 15, wherein the degree of closeness between each node and other connected nodes is calculated based on the common keyword, and a node vector of each node is obtained based on the degree of closeness steps, including:
    统计相连的两节点对应的两篇文章中所述共同关键词的数量;Count the number of common keywords described in the two articles corresponding to the two connected nodes;
    统计每个共同关键词在相连的两节点对应的两篇文章中分别出现的次数;Count the number of times each common keyword appears in the two articles corresponding to the two connected nodes;
    基于所述共同关键词的数量及所述分别出现的次数计算每个节点与相连的其他节点之间的紧密度S:Calculate the closeness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:
    Figure PCTCN2021090711-appb-100005
    其中,A、B代表所述表征图中相连的节点,n为A、B两节点所对应的两篇文章中共同关键词的数量,i为共同关键词的序号,A i为第i个共同关键词在节点A所对应的文章中出现的次数,B i为第i个共同关键词在节点B所对应的文章中出现的次数,μ为共同关键词的数量的倒数;
    Figure PCTCN2021090711-appb-100005
    Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords;
    将每个节点与相连的其他节点之间的紧密度进行向量化,得到每个节点对应的节点向量。The closeness between each node and other connected nodes is vectorized, and the node vector corresponding to each node is obtained.
  17. 根据权利要求15或16所述的计算机可读存储介质,其中,所述将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合的步骤,具体包括:The computer-readable storage medium according to claim 15 or 16, wherein the node vector of each node is input into a predetermined classification model for training, and a set of each classified node output by the classification model is obtained. steps, including:
    将每个节点的节点向量输入到图注意力网络,将节点向量输入到图注意力网络的各节点作为各待分类节点,计算每个待分类节点的损失函数;Input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;
    对于每个待分类节点,在所述损失函数最小化时,基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度,所述邻居节点为所述表征图中与所述待分类节点相连的节点;For each node to be classified, when the loss function is minimized, the contribution degree of neighbor nodes to the node to be classified is calculated based on the node vector of the node to be classified, and the neighbor node is the same as that in the representation graph. Describe the nodes connected to the nodes to be classified;
    基于所述贡献度对所述邻居节点进行聚合。The neighbor nodes are aggregated based on the contribution degree.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述基于所述待分类节点的节点向量计算邻居节点对所述待分类节点的贡献度包括:The computer-readable storage medium according to claim 17, wherein calculating the contribution of neighbor nodes to the node to be classified based on the node vector of the node to be classified comprises:
    e AB=LeakyReLU(α T[W A||W B]),其中,LeakyReLU为带泄露修正线性单元函数,A、B为所述表征图中相连的节点,W A为节点A的节点向量,W B为节点B的节点向量,||为W A和W B节点向量的拼接,α为共享注意力计算函数,α T为共享注意力计算函数的转置。 e AB =LeakyReLU(α T [W A ||W B ]), wherein LeakyReLU is a linear unit function with leak correction, A and B are the connected nodes in the representation graph, W A is the node vector of node A, W B is the node vector of node B, || is the concatenation of W A and W B node vectors, α is the shared attention calculation function, and α T is the transpose of the shared attention calculation function.
  19. 根据权利要求17所述的计算机可读存储介质,其中,所述将每个节点的节点向量输入至预定的分类模型中进行训练,获取所述分类模型输出的已分类的各个节点的集合的步骤,进一步包括:The computer-readable storage medium according to claim 17, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model , which further includes:
    利用归一化指数函数计算聚合后每一节点聚合在当前类别下时对应的分值;Use the normalized exponential function to calculate the corresponding score when each node is aggregated under the current category after aggregation;
    基于所述分值确定所述节点对应的类别。The category corresponding to the node is determined based on the score.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述利用归一化指数函数计算聚合后每一节点聚合在当前类别下时对应的分值包括:The computer-readable storage medium according to claim 19, wherein after calculating the aggregation by using a normalized exponential function, a score corresponding to each node aggregated under the current category comprises:
    Figure PCTCN2021090711-appb-100006
    其中,
    Figure PCTCN2021090711-appb-100006
    in,
    p(y|x)为节点x属于类别y的概率,C为类别的集合,c为某一类别的序号,W为向量映射矩阵。p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a category, and W is the vector mapping matrix.
PCT/CN2021/090711 2020-11-19 2021-04-28 Text classification method, topic generation method, apparatus, device, and medium WO2022105123A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011305385.4 2020-11-19
CN202011305385.4A CN112380344B (en) 2020-11-19 2020-11-19 Text classification method, topic generation method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2022105123A1 true WO2022105123A1 (en) 2022-05-27

Family

ID=74584415

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090711 WO2022105123A1 (en) 2020-11-19 2021-04-28 Text classification method, topic generation method, apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN112380344B (en)
WO (1) WO2022105123A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493490A (en) * 2023-11-17 2024-02-02 南京信息工程大学 Topic detection method, device, equipment and medium based on heterogeneous multi-relation graph
CN117493490B (en) * 2023-11-17 2024-05-14 南京信息工程大学 Topic detection method, device, equipment and medium based on heterogeneous multi-relation graph

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380344B (en) * 2020-11-19 2023-08-22 平安科技(深圳)有限公司 Text classification method, topic generation method, device, equipment and medium
CN113254603B (en) * 2021-07-08 2021-10-01 北京语言大学 Method and device for automatically constructing field vocabulary based on classification system
CN113722483B (en) * 2021-08-31 2023-08-22 平安银行股份有限公司 Topic classification method, device, equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
CN108804432A (en) * 2017-04-26 2018-11-13 慧科讯业有限公司 It is a kind of based on network media data Stream Discovery and to track the mthods, systems and devices of much-talked-about topic
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal device based on machine learning
US20190095425A1 (en) * 2017-09-28 2019-03-28 Oracle International Corporation Enabling autonomous agents to discriminate between questions and requests
CN109977223A (en) * 2019-03-06 2019-07-05 中南大学 A method of the figure convolutional network of fusion capsule mechanism classifies to paper
CN110032606A (en) * 2019-03-29 2019-07-19 阿里巴巴集团控股有限公司 A kind of sample clustering method and device
CN110134764A (en) * 2019-04-26 2019-08-16 中国地质大学(武汉) A kind of automatic classification method and system of text data
CN110175224A (en) * 2019-06-03 2019-08-27 安徽大学 Patent recommended method and device based on semantic interlink Heterogeneous Information internet startup disk
CN110543563A (en) * 2019-08-20 2019-12-06 暨南大学 Hierarchical text classification method and system
CN110704626A (en) * 2019-09-30 2020-01-17 北京邮电大学 Short text classification method and device
CN111125358A (en) * 2019-12-17 2020-05-08 北京工商大学 Text classification method based on hypergraph
CN112380344A (en) * 2020-11-19 2021-02-19 平安科技(深圳)有限公司 Text classification method, topic generation method, device, equipment and medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591988B (en) * 2012-01-16 2014-10-15 西安电子科技大学 Short text classification method based on semantic graphs
CN108228587A (en) * 2016-12-13 2018-06-29 北大方正集团有限公司 Stock discrimination method and Stock discrimination device
CN110019659B (en) * 2017-07-31 2021-07-30 北京国双科技有限公司 Method and device for searching referee document
CN110196920B (en) * 2018-05-10 2024-02-09 腾讯科技(北京)有限公司 Text data processing method and device, storage medium and electronic device
CN109299379B (en) * 2018-10-30 2021-02-05 东软集团股份有限公司 Article recommendation method and device, storage medium and electronic equipment
CN109522410B (en) * 2018-11-09 2021-02-09 北京百度网讯科技有限公司 Document clustering method and platform, server and computer readable medium
CN110489558B (en) * 2019-08-23 2022-03-18 网易传媒科技(北京)有限公司 Article aggregation method and device, medium and computing equipment
CN110781275B (en) * 2019-09-18 2022-05-10 中国电子科技集团公司第二十八研究所 Question answering distinguishing method based on multiple characteristics and computer storage medium
CN111428488A (en) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 Resume data information analyzing and matching method and device, electronic equipment and medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804432A (en) * 2017-04-26 2018-11-13 慧科讯业有限公司 It is a kind of based on network media data Stream Discovery and to track the mthods, systems and devices of much-talked-about topic
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
US20190095425A1 (en) * 2017-09-28 2019-03-28 Oracle International Corporation Enabling autonomous agents to discriminate between questions and requests
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal device based on machine learning
CN109977223A (en) * 2019-03-06 2019-07-05 中南大学 A method of the figure convolutional network of fusion capsule mechanism classifies to paper
CN110032606A (en) * 2019-03-29 2019-07-19 阿里巴巴集团控股有限公司 A kind of sample clustering method and device
CN110134764A (en) * 2019-04-26 2019-08-16 中国地质大学(武汉) A kind of automatic classification method and system of text data
CN110175224A (en) * 2019-06-03 2019-08-27 安徽大学 Patent recommended method and device based on semantic interlink Heterogeneous Information internet startup disk
CN110543563A (en) * 2019-08-20 2019-12-06 暨南大学 Hierarchical text classification method and system
CN110704626A (en) * 2019-09-30 2020-01-17 北京邮电大学 Short text classification method and device
CN111125358A (en) * 2019-12-17 2020-05-08 北京工商大学 Text classification method based on hypergraph
CN112380344A (en) * 2020-11-19 2021-02-19 平安科技(深圳)有限公司 Text classification method, topic generation method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493490A (en) * 2023-11-17 2024-02-02 南京信息工程大学 Topic detection method, device, equipment and medium based on heterogeneous multi-relation graph
CN117493490B (en) * 2023-11-17 2024-05-14 南京信息工程大学 Topic detection method, device, equipment and medium based on heterogeneous multi-relation graph

Also Published As

Publication number Publication date
CN112380344A (en) 2021-02-19
CN112380344B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
WO2022105123A1 (en) Text classification method, topic generation method, apparatus, device, and medium
Gui et al. Large-scale embedding learning in heterogeneous event data
US11341417B2 (en) Method and apparatus for completing a knowledge graph
Gui et al. Embedding learning with events in heterogeneous information networks
WO2021120677A1 (en) Warehousing model training method and device, computer device and storage medium
Ma et al. Learn to forget: Machine unlearning via neuron masking
Cheng et al. Fblg: A simple and effective approach for temporal dependence discovery from time series data
Song et al. eXtreme gradient boosting for identifying individual users across different digital devices
US20210117454A1 (en) Decentralized Latent Semantic Index Using Distributed Average Consensus
US11449788B2 (en) Systems and methods for online annotation of source data using skill estimation
Sun et al. Extreme learning machine for classification over uncertain data
EP3839764A1 (en) Method and system for detecting duplicate document using vector quantization
Cai et al. Network linear discriminant analysis
Wang et al. Semi-supervised node classification on graphs: Markov random fields vs. graph neural networks
US11941867B2 (en) Neural network training using the soft nearest neighbor loss
Song et al. Top-k link recommendation in social networks
Tu et al. Crowdwt: Crowdsourcing via joint modeling of workers and tasks
Xu et al. Latent interest and topic mining on user-item bipartite networks
Ding et al. User identification across multiple social networks based on naive Bayes model
Meeus et al. Achilles’ heels: vulnerable record identification in synthetic data publishing
CN114386604A (en) Model distillation method, device, equipment and storage medium based on multi-teacher model
Wu et al. Collaborative filtering recommendation based on conditional probability and weight adjusting
Hajdu et al. Use of artificial neural networks to identify fake profiles
Xu et al. Cluster-aware multiplex InfoMax for unsupervised graph representation learning
CN115099875A (en) Data classification method based on decision tree model and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893279

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893279

Country of ref document: EP

Kind code of ref document: A1