CN109684483B - Knowledge graph construction method and device, computer equipment and storage medium - Google Patents

Knowledge graph construction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN109684483B
CN109684483B CN201811510375.7A CN201811510375A CN109684483B CN 109684483 B CN109684483 B CN 109684483B CN 201811510375 A CN201811510375 A CN 201811510375A CN 109684483 B CN109684483 B CN 109684483B
Authority
CN
China
Prior art keywords
text corpus
topic model
wechat
keyword
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811510375.7A
Other languages
Chinese (zh)
Other versions
CN109684483A (en
Inventor
吴壮伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811510375.7A priority Critical patent/CN109684483B/en
Publication of CN109684483A publication Critical patent/CN109684483A/en
Application granted granted Critical
Publication of CN109684483B publication Critical patent/CN109684483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a knowledge graph construction method, a knowledge graph construction device, computer equipment and a computer readable storage medium. The method comprises the following steps: acquiring a WeChat public number list in a preset mode; accessing a micro-letter server official interface according to the micro-letter public number list, and acquiring an article list of each micro-letter public number in the micro-letter public number list; and crawling WeChat articles according to the article list to obtain text corpus required by building a knowledge graph. Analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; acquiring an object and an object attribute contained in the text corpus according to a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; and drawing the association relation between the objects and the attributes to construct a knowledge graph. The embodiment of the application can realize efficient and visual management of the content of the WeChat public signal based on data analysis.

Description

Knowledge graph construction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and apparatus for constructing a knowledge graph, a computer device, and a computer readable storage medium.
Background
In daily work, the micro-signals of each person pay attention to some appointed public number information and subscription number information, but especially when the public number information and subscription number information which pay attention to are relatively large, each person can hardly know what aspects of the content are paid attention to by himself on the whole, even because too many public numbers and subscription numbers are paid attention to, some public number information and subscription number information cannot be browsed, and therefore the processing efficiency of the public number information and subscription number information is reduced.
Disclosure of Invention
The embodiment of the application provides a method, a device, computer equipment and a computer readable storage medium for constructing a knowledge graph, which can solve the problem of low processing efficiency of public number information and subscription number information which pay attention to WeChat public numbers in the traditional technology.
In a first aspect, an embodiment of the present application provides a method for constructing a knowledge graph, where the method includes: acquiring a WeChat public number list in a preset mode; accessing a micro-letter server official interface according to the micro-letter public number list, and acquiring an article list of each micro-letter public number in the micro-letter public number list; crawling WeChat articles according to the article list to obtain text corpus required by knowledge graph construction; analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; acquiring an object and an object attribute contained in the text corpus according to a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; and drawing the association relation between the object and the attribute to construct a knowledge graph.
In a second aspect, an embodiment of the present application further provides a device for constructing a knowledge graph, including: the first acquisition unit is used for acquiring a WeChat public number list in a preset mode; the second obtaining unit is used for accessing the micro-letter server official interface according to the micro-letter public number list and obtaining an article list of each micro-letter public number in the micro-letter public number list; the crawling unit is used for crawling WeChat articles according to the article list to obtain text corpus required by building a knowledge graph; the analysis unit is used for analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; the third acquisition unit is used for acquiring an object and an attribute of the object contained in the text corpus according to the topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; and the first construction unit is used for drawing the association relation between the object and the attribute so as to construct a knowledge graph.
In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements a method for constructing the knowledge graph when executing the computer program.
In a fourth aspect, an embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor causes the processor to execute the method for constructing a knowledge graph.
The embodiment of the application provides a knowledge graph construction method, a knowledge graph construction device, computer equipment and a computer readable storage medium. According to the embodiment of the application, based on the data analysis of the WeChat public number, the WeChat public number list is obtained in a preset mode to access the WeChat server official interface, the WeChat public number is obtained, the WeChat public number is analyzed, the WeChat public number-based knowledge graph is constructed according to the analysis result, the content of the WeChat public number can be efficiently tidied and visually managed, manual operation is reduced, meanwhile, the article list is orderly tidied through the relational tool of the WeChat public number knowledge graph, and the processing efficiency of the public number information and the subscription number information which pay attention to the WeChat public number can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an application scenario schematic diagram of a knowledge graph construction method provided by an embodiment of the present application;
Fig. 2 is a flow chart of a method for constructing a knowledge graph according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a data flow process of a method for constructing a knowledge graph according to an embodiment of the present application;
fig. 4 is a schematic sub-flowchart of a knowledge graph construction method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a topic model in a knowledge graph construction method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of another sub-process in the knowledge graph construction method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of co-occurrence matrix in the knowledge graph construction method according to the embodiment of the present application;
FIG. 8 is a schematic diagram of a third sub-process in the knowledge graph construction method according to the embodiment of the present application;
fig. 9 is a schematic diagram of a knowledge graph in the knowledge graph construction method according to the embodiment of the present application;
FIG. 10 is a schematic block diagram of a knowledge graph construction device according to an embodiment of the present application;
FIG. 11 is another schematic block diagram of a knowledge graph construction device according to an embodiment of the present application; and
Fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Referring to fig. 1, fig. 1 is a schematic application scenario diagram of a knowledge graph construction method according to an embodiment of the present application. The application scene comprises:
(1) And (5) a terminal. The terminal shown in fig. 1 is provided with an application program for displaying the knowledge graph of the interesting WeChat public number, through which the knowledge graph of the interesting WeChat public number can be displayed, so that a user can know the interesting WeChat public number and subscription number information on the whole, the processing efficiency of the public number and subscription number is improved, the application program can be a WeChat plug-in unit, a WeChat applet or an independent application program, and the computer equipment can be electronic equipment such as a notebook computer, a tablet computer or a desktop computer, and the terminal in fig. 1 is connected with an application program server.
(2) And an application server. The application server shown in fig. 1 serves an application program installed on the terminal in fig. 1, which displays a knowledge graph of a WeChat public number, and a tool for constructing a knowledge graph of the WeChat public number of interest is installed on the application server shown in fig. 1 to execute the steps of the knowledge graph construction method. The application server in fig. 1 is connected to a terminal using an application and a WeChat server, respectively.
(3) The micro-letter server refers to a server providing micro-letter services. The application program server in fig. 1 is connected with the micro-letter server, the application program server crawls micro-letter articles contained in the micro-letter public signals from the micro-letter server, takes the micro-letter articles as corpus for constructing a knowledge graph, analyzes the corpus to obtain an analysis result, and constructs the knowledge graph about the micro-letter public signals according to the analysis result.
The operation of the individual bodies in fig. 1 is as follows: when an application program on a terminal is to display a knowledge graph of a WeChat public number, calling an application program server to provide service, wherein the application program server receives a WeChat public number list input by an input device through the application program on the terminal or calls the WeChat public number list through an application program interface, accesses an official interface of a WeChat server according to the WeChat public number list, and acquires an article list of each WeChat public number in the WeChat public number list; crawling WeChat articles according to the article list to obtain text corpus required by knowledge graph construction; analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; acquiring an object and an object attribute contained in the text corpus according to a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; and drawing the association relation between the object and the attribute to construct a knowledge graph so as to realize the visualization of the WeChat public signal information, and displaying the visualized knowledge graph of the WeChat public signal through an application program on the terminal.
It should be noted that, in fig. 1, only a mobile phone is illustrated as a terminal, in an actual operation process, the type of the terminal is not limited to that illustrated in fig. 1, the terminal may also be an electronic device such as a notebook computer or a tablet computer, an application scenario of the method for constructing a knowledge graph is only used for illustrating the technical scheme of the present application, and the connection relationship may also have other forms.
Fig. 2 is a schematic flow chart of a knowledge graph construction method provided by an embodiment of the present application. The knowledge graph construction method is applied to the application program server in fig. 1 to complete all or part of functions of the knowledge graph construction method.
Referring to fig. 2 and fig. 3, fig. 2 is a flow chart of a knowledge graph construction method according to an embodiment of the present application, and fig. 3 is a data flow processing schematic diagram of the knowledge graph construction method according to an embodiment of the present application. As shown in fig. 2, the method includes the following steps S210 to S260:
S210, acquiring a WeChat public number list in a preset mode.
The knowledge graph is information contained in the interesting WeChat public signals which are described by using a visualization technology, knowledge of interest of WeChat users and the interrelation between the knowledge and the WeChat users are displayed through mining, analysis, construction and drawing, and topics or information of interest of the WeChat users can be reflected through the knowledge graph. Further, the personal knowledge graph is information contained in the WeChat public signals which are focused by WeChat users for the users of the individuals and are described by using a visualization technology, knowledge of the individuals and the interrelation between the knowledge and the knowledge are displayed through mining, analysis, construction and drawing, and topics or information of the individuals can be reflected by the personal knowledge graph.
The preset mode refers to through an application programming interface (API port) or. The public number list provided by the mode of receiving the user input means that when the micro-signal public number interface searches the public number, a link corresponding to the public number appears, the public number is accessed through the link of the public number, the public number provided by the mode of receiving the user input means that the user input the public number is received, and the connection corresponding to the public number is obtained, so that the public number list provided by the user is obtained. The API, english Application Programming Interface, is a predefined function, and aims to provide the capability of the application program and the developer to access a set of routines based on certain software or hardware without accessing source codes or understanding the details of an internal working mechanism.
Specifically, the application server obtaining the WeChat public number list in a preset manner means that the application server receives the public number provided by the user through an API port or through a terminal to obtain the public number list to be crawled. The method comprises the steps of acquiring a public number list through an API (application program interface), namely acquiring authority of acquiring a personal public number list on a WeChat through the API when an application program server starts an application program for constructing a knowledge graph based on the WeChat public number, and automatically acquiring the personal public number list on the WeChat through the API to obtain the public number list to be crawled. The public number list is provided by the user in a mode of adding a public number interface to the user, such as adding a name of a public number, on an application program displaying a personal knowledge graph on the terminal, so as to obtain the public number list to be crawled.
S220, accessing a micro-letter server official interface according to the micro-letter public number list, and acquiring an article list of each micro-letter public number in the micro-letter public number list.
The article list of the WeChat public number refers to WeChat articles which are contained in the WeChat public number and are presented in a list form.
Specifically, the application program server accesses each WeChat public number one by one through the WeChat server official interface according to the WeChat public number list, and obtains an article list of each WeChat public number in the WeChat public number list.
S230, crawling WeChat articles according to the article list to obtain text corpus required by knowledge graph construction.
Specifically, when a knowledge graph of a WeChat public number focused by a certain WeChat user is to be constructed, an application server acquires a WeChat public number list focused by the WeChat user, acquires an article list of each public number by accessing the WeChat server according to the WeChat public number list, crawls WeChat articles contained in the article list of each WeChat public number by a web crawler program, and takes characters contained in the WeChat articles as a text corpus for constructing the knowledge graph. The Web crawler program, called Spider, webCrawler or Robot, is a program that roams a collection of Web documents along links. It generally resides on a server, reads the corresponding document by using standard protocols such as HTTP through given URLs, and then continues roaming with all unviewed URLs included in the document as new starting points until no new URL is satisfied.
S240, analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model.
The preset tool is a tool for analyzing text corpus to obtain target data, such as a three-layer bayesian probability model, also called an LDA model, or a word frequency-inverse document frequency matrix, also called TF-IDF.
Specifically, the application server analyzes the text corpus data, acquires keywords contained in the text corpus data through analysis of the text corpus data, acquires keywords meeting preset conditions through screening the keywords according to preset rules, and generates knowledge graph content as knowledge graph content data to acquire analysis results. For example, the text corpus may be input into a three-layer bayesian probability model, a topic model of the text corpus is generated, a time distribution map of the topic model is generated according to the time distribution of the text corpus, the text corpus is analyzed by using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence map of the text corpus, and a keyword combination exceeding a preset frequency in the keyword co-occurrence map is obtained and stored as a keyword combination of the topic model to obtain an analysis result.
S250, acquiring an object and an object attribute contained in the text corpus according to a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model;
And S260, drawing the association relation between the objects and the attributes to construct a knowledge graph.
Specifically, the server classifies the text corpus according to the topic model and the keyword combination of the topic model, obtains an article list under a corresponding topic, determines an object of the topic, extracts sentences containing the object according to the article list to form a sentence set, analyzes the sentence set, screens out attributes of the sentence set and subordinate attributes of the attributes, and draws association relations among the object, the attributes and the subordinate attributes to construct a knowledge graph.
According to the embodiment of the application, the WeChat public number list is obtained in a preset mode to access the WeChat server official interface, the article list of each WeChat public number in the WeChat public number list is crawled, and the WeChat articles are analyzed, so that the contained WeChat articles are obtained according to the article list of each WeChat public number, the text corpus for constructing the personal knowledge graph is obtained, the knowledge graph based on the WeChat public number is constructed according to the analysis result, the efficient arrangement of the content of the WeChat public number can be realized, the manual operation is reduced, meanwhile, the article list is orderly arranged through the tool of the relation graph, the visual management of the content of the WeChat public number is realized, and the processing efficiency of the public number information and the subscription number information focused on the WeChat public number can be improved.
In one embodiment, before the step of crawling WeChat articles according to the article list to obtain the parsed text corpus required for building the knowledge graph, the method further includes:
A crawler is constructed that includes a proxy internet protocol address pool and a cache data pool.
Wherein, the Internet protocol, english is Internet protocol, abbreviated as IP. The internet protocol address english InternetProtocolAddress, which is translated into an internet protocol address abbreviated as IP address, and english IPAddress, is abbreviated as IP, and is a digital label assigned to a device using internet protocol (english InternetProtocol, IP) on the internet. The internet protocol address pool refers to a proxy IP pool, also called IP proxy pool, and is composed of a plurality of proxy IPs. Since a large number of accesses to one and the same IP for one website in a short time usually causes the IP to be sealed, the problem of sealing the IP can be solved by using proxy IP except for increasing delay (the crawling amount is not large or no requirement is made on the crawling speed) when crawling data.
The cache data pool refers to a cookie pool and consists of a plurality of Cookies. The Cookies, which may be referred to as Cookies in singular form, refer to data stored on a local terminal of a user by a website for identifying the identity of the user and performing session tracking, and are generally encrypted.
Specifically, as many websites make anti-crawler policies, each IP may be controlled frequently, so as to ensure the effectiveness of crawling, and avoid being limited by the anti-crawler policies, a proxy IP pool and a Cookies pool are constructed. The proxy IP pool can acquire proxy IP from a plurality of free websites in advance by a crawler, then check and judge whether the IP is available, and store the IP in the proxy IP pool if the IP is available, or purchase charged proxy service, or build a proxy server by itself, so that the proxy server is stable, but a large amount of server resources are needed.
Further, in one embodiment, after the step of building the crawler program including the proxy internet protocol address pool and the cache data pool, the method further includes:
Updating the proxy IP of the proxy IP pool and the Cookies in the Cookies pool.
Because the proxy IP and the Cookies have timeliness, the proxy IP and the Cookies of the crawling data need to be updated at random in order to ensure the crawling continuity, so that the effectiveness of the proxy IP and the Cookies in the IP pool and the Cookies pool is ensured. The proxy IP pool can acquire proxy IP from a plurality of free websites in advance through a crawler, then check and judge whether the IP is available, if the IP is available, the proxy IP pool is stored to update data in the proxy IP pool, or a charged proxy service is purchased, the purchased proxy service IP is stored in the proxy IP pool to update data in the proxy IP pool, or a proxy server is built by the proxy IP pool, the proxy IP of the built server is stored in the proxy IP pool to update data in the proxy IP pool, and meanwhile, if the server judges that the proxy IP in the proxy IP pool is invalid, the invalid proxy IP is removed from the proxy IP pool. Meanwhile, because the Cookie refers to data stored on the local terminal of the user by the website for distinguishing the identity of the user and carrying out session tracking, after the proxy IP data of the proxy IP pool is updated, the Cookies in the Cookie pool are correspondingly updated.
After a crawler program comprising an agent internet protocol address pool and a cache data pool is constructed, a public number list is obtained through an API port, or a public number list provided by a user is obtained through an interface, and the public number list to be crawled is obtained.
After the public number list is obtained, in order to improve the efficiency of crawling WeChat articles, crawling public number article codes which are input by taking the target public number list as input are packaged into a Docker container, the Docker container is started to be deployed on a plurality of machines, and the crawled article list files are stored in a catalog of a main server. Specifically, in the embodiment of the application, a distributed system is adopted by utilizing the Docker containers, a plurality of Docker containers are respectively distributed on different machines, and then the grabbed article list files are stored in the catalog of the main server. And each Docker container is packaged with a crawling public number article code which takes the target public number list as input. The Docker container is an open-source application container engine, so that the developer can package the application of the application and rely on the package to a portable container, then release the package to any popular Linux machine, and also can realize virtualization. The containers are completely sandboxed, do not have any interfaces with each other, are independent of any language, framework and system, have little performance overhead, and can be easily run in machines and data centers.
Referring to fig. 4, fig. 4 is a schematic sub-flowchart of a knowledge graph construction method according to an embodiment of the application. In this embodiment, the step of analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model, and a keyword combination of the topic model includes:
S241, inputting the text corpus into a three-layer Bayesian probability model to generate a topic model of the text corpus.
The three-layer Bayesian probability model is an English LATENT DIRICHLET Allocation model, is abbreviated as an LDA model, and is a document theme generation model and comprises a word, a theme and a document three-layer structure.
According to the embodiment of the application, based on the text corpus and the LDA model of the WeChat article, the topic model is obtained, and based on the probability distribution of the topic model related to the text corpus of the WeChat article, the probability distribution data under different topics are stored.
Specifically, the step of generating the topic model includes:
Firstly, training an LDA model through training corpus to obtain a topic model.
LDA is an unsupervised machine learning technology, and a topic model is obtained through training based on training corpus and an LDA model in the prior art. The topic model is a model capable of classifying topics of input text paragraphs, takes the text paragraphs as input contents and takes probability distribution of different topics as output.
The training corpus used for training the LDA model can be an acquired WeChat article, and the acquired topic model, namely the WeChat topic model trained based on the WeChat article, can improve the accuracy of the topic model related to the WeChat user. Further, the training corpus during training the LDA model may be not only WeChat articles, but also article corpus obtained from other channels, such as websites, books, newspaper magazines, etc., in order to diversify the sources of the training corpus and thus improve the accuracy of the LDA model training.
And secondly, inputting the WeChat articles of the WeChat public numbers into a theme model to acquire probability distribution of themes related to the WeChat articles.
With continued reference to fig. 3, the server obtains a list of WeChat public numbers of a user, obtains a list of articles of WeChat public numbers from each WeChat public number according to the list of WeChat public numbers, obtains a WeChat article of each public number according to the list of WeChat articles of each WeChat public number, uses the obtained WeChat article as article corpus, inputs the article corpus into an obtained topic model, and the topic model automatically outputs probability distribution of topics related to the WeChat article.
S242, generating a time distribution map of the topic model according to the time distribution of the text corpus.
The time distribution map refers to a distribution condition of a certain subject in different time periods. The subject includes different topics or different events. The time distribution map is used for observing dynamic display conditions of different topics in different time periods, observing dynamic display conditions of different events in different time periods and the like. This is in part to present the distribution of different subject matter over time, as well as the contribution index of different subject matter to heat at the same time.
Specifically, the text corpus is input into an LDA model to obtain a time distribution map for generating the topic model according to the time distribution of the text corpus, so that the probability distribution of different time periods is obtained and stored based on the release time of WeChat articles. Referring to fig. 5, fig. 5 is a schematic diagram of a topic model in a knowledge graph construction method according to an embodiment of the present application. FIG. 5 is an example of a time distribution graph illustrating the distribution of articles of medical and financial topics of interest to a user WeChat in 1-6 months.
S243, analyzing the word corpus by using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence map of the word corpus, and obtaining a keyword combination exceeding a preset frequency in the keyword co-occurrence map as a keyword combination of the topic model.
Wherein, word Frequency-inverse document Frequency matrix, english is Term Frequency-Inverse Document Frequency abbreviated as TF-IDF. TF-IDF is suitable for classification primarily by considering that a word or phrase has a good class-distinguishing ability if it appears frequently in one article TF is high and rarely in other articles.
Specifically, the word frequency-inverse document frequency matrix is used for analyzing the word corpus to construct a keyword co-occurrence map of the word corpus, and a keyword combination exceeding a preset frequency in the keyword co-occurrence map is obtained to be used as a keyword combination of the topic model and is used as a basis for dividing the topic model to which the word corpus belongs, so that the word corpus is classified into different attribute categories such as a voice recognition category, an image recognition category or a deep learning category.
Further, referring to fig. 6 and fig. 7, fig. 6 is another schematic sub-flowchart of the method for constructing a knowledge graph according to an embodiment of the present application, and fig. 7 is a schematic co-occurrence matrix of the method for constructing a knowledge graph according to an embodiment of the present application. In this embodiment, the step of analyzing the text corpus using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence map of the text corpus includes:
S2431, based on the text corpus, obtaining preset number of keywords of each WeChat article through a word frequency-inverse document frequency matrix;
s2432, carrying out total sum deduplication on all preset numbers of keywords to obtain non-repeated keyword vocabulary;
S2433, constructing a keyword co-occurrence matrix according to the keyword vocabulary to obtain a keyword co-occurrence map.
Specifically, based on the text corpus, obtaining a preset number of keywords of each WeChat article through a word frequency-inverse document frequency matrix, carrying out total sum deduplication on all preset number of keywords to obtain a non-repeated keyword vocabulary, and constructing a keyword co-occurrence matrix according to the keyword vocabulary to obtain a keyword co-occurrence map.
For example, based on the text corpus, obtaining Top10 keywords of each article through a TF_IDF matrix; total and deduplication is performed on Top10 keywords of each article to obtain non-repeated keyword vocabulary, { w1, w2, & gt, wm }, wherein m is the number of keywords. The process of obtaining non-duplicate keyword vocabulary includes:
The step of obtaining 10 keywords for each article through TF-IDF includes: and obtaining each WeChat article as the text corpus, performing Chinese word segmentation on the text corpus, obtaining a vocabulary library formed by each WeChat article, and obtaining Top10 keywords in each article according to the TF-IDF matrix.
The Top10 keywords of each article were summed and deduplicated to obtain a non-duplicate keyword vocabulary, { w1, w2, & gt, wm }. Wherein m is a natural number of 10 or more. For example, there are 10 WeChat articles, 10 keywords obtained by each article of the 10 articles are summarized to obtain 100 keywords, and if the 100 keywords have repeated keywords, the repeated keywords are removed, so that the keywords are all appeared once, and the keywords have uniqueness, so that non-repeated keyword vocabulary is obtained.
Obtaining non-repeated keyword vocabulary, and constructing a keyword co-occurrence matrix by using the keyword vocabulary, wherein the process of constructing the keyword co-occurrence matrix is as follows: traversing the key word vocabulary with non-repeated transverse and longitudinal behaviors, { w1, w2,.,. Fw }, traversing the combination of the Top10 vocabulary in all articles, and adding 1 to the corresponding vocabulary position; and finally obtaining a keyword co-occurrence matrix, namely a keyword co-occurrence map. Wherein, the keyword co-occurrence map refers to a co-occurrence keyword matrix.
Specifically, the construction process of the co-occurrence keyword matrix comprises the following steps: firstly, a list of all keywords is generated to generate a non-repeated vocabulary list; generating an initialized co-occurrence keyword matrix by taking the vocabulary list as a horizontal axis and a vertical axis; secondly, traversing keyword lists of all texts, combining words appearing in the keyword lists two by two, and adding 1 to the corresponding value of the co-occurrence keyword matrix; and finally, constructing the co-occurrence keyword matrix.
For example, referring to fig. 7, if the obtained non-repeated keyword vocabulary includes a, b, c, d, e, f, j, h, i and g, and the co-occurrence matrix is constructed by using a, b, c, d, e, f, j, h, i and g as the first row and the first column respectively, then two-by-two combinations of keywords are formed at the intersection of each row and each column, for example aa, ab, ac, ad … ba, bb, bc …, whether each two-by-two combinations exist in all articles is traversed, if one two-by-two combination exists in each article, 1 is added to the corresponding two-by-two combination vocabulary position, for example, if aa combination appears in one article, 1 is added to the corresponding position of aa, if de combination appears in six articles, 6 is added at the corresponding position of de in an accumulated manner, where ab and ba are the same combination until the co-occurrence keyword matrix is constructed, and finally, the result is referring to fig. 7.
After obtaining a keyword co-occurrence map, obtaining and storing keyword combinations of the topic model exceeding a preset frequency in the keyword co-occurrence map, obtaining the keyword combinations exceeding the preset frequency in the keyword co-occurrence map as the keyword combinations of the topic model, and taking the keyword combinations as the basis for dividing the topic model to which the text corpus belongs so as to classify the text corpus, for example, obtaining and storing the keyword combinations exceeding 5 in the keyword co-occurrence matrix.
Wherein, the Frequency number, english is Frequency, also called as "number of times", refers to dividing the samples into a plurality of groups according to a certain method, and the number of individuals containing the samples in each group is called as Frequency number, for example, in FIG. 3, aa has a Frequency number of 1, and de has a Frequency number of 6.
Specifically, based on the co-occurrence keyword matrix, two-point information can be obtained from the keyword combination which is obtained to meet the condition: 1) Which keywords belong to the keywords that appear high; 2) The hotspot keywords are similar to which keywords.
With continued reference to fig. 7, a keyword combination with a frequency exceeding a preset value in the keyword co-occurrence matrix is obtained and stored, for example, a keyword combination with a frequency exceeding 5 in the keyword co-occurrence matrix is obtained and stored, and as shown in fig. 7, if the keyword combination with a frequency exceeding 5 is de, it can be determined that keywords d and e belong to keywords with high occurrence, and it can be further determined which keywords in the hot spot keyword and keyword co-occurrence matrix are similar, or whether the hot spot keywords are similar to d and e.
S244, acquiring and storing the topic model, the time distribution map of the topic model and the keyword combination of the topic model.
Specifically, the topic model, the time distribution spectrum of the topic model and the keyword combination of the topic model are obtained and stored, and the topic model, the time distribution spectrum of the topic model and the keyword combination of the topic model are used as the analysis result of the text corpus.
In one embodiment, the step of inputting the text corpus into a three-layer bayesian probabilistic model to generate a topic model of the text corpus comprises:
obtaining a trained three-layer Bayesian probability model;
and inputting the text corpus into the three-layer Bayesian probability model to generate a topic model of the text corpus.
Specifically, since the three-layer Bayesian probability model, namely the LDA model, is a document topic generation model, and comprises words, topics and a document three-layer structure, training text corpus is input into the LDA model, the LDA model automatically analyzes the training text corpus according to non-supervised machine learning, and the text corpus is output in the probability distribution of different topics. After training, the LDA model has a relatively accurate recognition rate, and can generate a topic model corresponding to the text corpus according to the input text corpus. For example, according to the text corpus attribute of the WeChat articles contained in the WeChat public numbers, the WeChat articles mainly relate to medical topics, financial topics, time management topics, historical topics and the like, and the LDA model is trained through training text corpora of the medical topics, the financial topics, the time management topics and the historical topics, so that the accuracy of the LDA model is improved. After the recognition accuracy of the LDA model reaches the preset accuracy, the text corpus to be recognized is input into the three-layer Bayesian probability model, so that a topic model of the text corpus can be accurately generated, for example, if the text corpus to be recognized contains WeChat articles of financial topics, the LDA model can generate a model of the financial topics of the text corpus.
Referring to fig. 8 and fig. 9, fig. 8 is a schematic diagram of a third sub-flowchart in the knowledge graph construction method according to the embodiment of the present application, and fig. 9 is a schematic diagram of a knowledge graph in the knowledge graph construction method according to the embodiment of the present application. In this embodiment, the step of obtaining the object and the attribute of the object included in the text corpus according to the topic model of the text corpus, the time distribution map of the topic model, and the keyword combination of the topic model includes:
s2501, determining an object of the theme according to the theme model;
S2502, classifying the text corpus according to the topic model to obtain an article list under a corresponding topic;
S2503, extracting sentences containing the objects to form a sentence set according to the article list;
s2504, analyzing the sentence set to screen out the attribute of the sentence set.
Further, in one embodiment, the sentence set is analyzed, the attribute of the sentence set and the attribute of the lower level are screened out, and then the association relationship among the object, the attribute and the attribute of the lower level is drawn to construct a knowledge graph, so that the knowledge graph of the WeChat article is described in more detail, and the information efficiency of the WeChat article is further improved.
The object refers to a theme related to a WeChat article, such as entertainment, financial, medical or administrative theme. The attribute, which is used to describe the characteristics of a particular object, is static, for example, please continue with fig. 9, and the attributes under entertainment may include movie, sports, literature, etc. The secondary attribute is a lower concept of an attribute, and is a further specific description of an attribute, for example, an attribute under a movie includes a secondary attribute such as a heat map.
Specifically, referring to fig. 9, according to the topic model of the text corpus, the time distribution spectrum of the topic model, and the keyword combination of the topic model, the object and the attribute of the object included in the text corpus are obtained, so as to draw the association relationship between the object and the attribute to construct a knowledge spectrum, that is, construct the ontology structure of the object, the attribute, and the secondary attribute. For example, if subjects related to WeChat articles are classified into blocks such as entertainment, sports and society, the blocks are constructed as objects, and the objects can be understood as nodes.
Each tile has event content, which is understood to be the value of an attribute on a node, then an attribute is built on the object.
While the event is taken as an attribute, there may be a secondary attribute below, such as the latest event, the hottest event, etc., and the secondary attribute is built again on the node.
Obtaining keywords meeting preset conditions, constructing an ontology structure of an object, an attribute and a secondary attribute according to the keywords, and specifically comprising the following steps:
first, text corpus data is classified according to the topic model field, such as image recognition class, voice recognition class, and the like. The first ten keywords of each WeChat article can be extracted through a TF-IDF matrix, and if the keywords are matched with core keywords contained in the corresponding topic types, for example, if the core keywords of the word corpus of the voice recognition type have voice recognition, and if the keywords in the screened WeChat articles have voice recognition, the WeChat articles are classified into an article list of voice recognition types.
And secondly, determining the object of the similar keywords, and constructing the object according to the keywords. For example, the object of deep learning speech recognition is determined according to the keywords corresponding to the speech recognition such as the keyword speech document, recognition, decoding, encoding, natural language processing, learning and the like, and the article list contained in the speech recognition object in the text corpus is further obtained according to the keywords.
Thirdly, extracting sentences containing the objects to form a sentence set according to the article list contained by the objects, integrating the sentences, performing word segmentation and part-of-speech tagging, automatically screening out attributes, and simultaneously searching out all lower attributes until no lower attributes exist.
Fourth, the knowledge graph is already constructed, and its structure is object-attribute-subordinate attribute. With continued reference to fig. 9, the object "entertainment" contains the attributes "movie" and "sports game", the attribute "movie" includes the secondary attributes "recent" and "heat map", and the attribute "sports game" includes the secondary attributes "basketball" and "football".
It should be noted that, the method for constructing a knowledge graph in each embodiment may be used to re-combine the technical features included in different embodiments according to the need, so as to obtain a combined embodiment, which is within the scope of protection claimed by the present application.
Referring to fig. 10, fig. 10 is a schematic block diagram of a knowledge graph construction apparatus according to an embodiment of the present application. Corresponding to the knowledge graph construction method, the embodiment of the application also provides a knowledge graph construction device. As shown in fig. 10, the knowledge graph construction apparatus includes a unit for performing the knowledge graph construction method described above, and the apparatus may be configured in a computer device such as a server. Specifically, referring to fig. 10, the knowledge graph construction apparatus 1000 includes a first obtaining unit 1001, a second obtaining unit 1002, a crawling unit 1003, an analyzing unit 1004, a third obtaining unit 1005, and a first construction unit 1006.
The first obtaining unit 1001 is configured to obtain a WeChat public number list in a preset manner;
A second obtaining unit 1002, configured to access a micro-letter server official interface according to the micro-letter public number list, and obtain an article list of each micro-letter public number in the micro-letter public number list;
a crawling unit 1003, configured to crawl WeChat articles according to the article list to obtain text corpus required for building a knowledge graph;
an parsing unit 1004, configured to parse the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model, and a keyword combination of the topic model;
A third obtaining unit 1005, configured to obtain an object and an attribute of the object included in the text corpus according to a topic model of the text corpus, a time distribution spectrum of the topic model, and a keyword combination of the topic model;
A first construction unit 1006, configured to draw an association relationship between the object and the attribute to construct a knowledge graph.
Referring to fig. 11, fig. 11 is another schematic block diagram of a knowledge graph construction apparatus according to an embodiment of the present application. As shown in fig. 11, in this embodiment, the knowledge graph construction apparatus 1000 further includes:
a second construction unit 1007 is configured to construct a crawler including a proxy internet protocol address pool and a cache data pool.
With continued reference to fig. 11, in this embodiment, the knowledge graph construction apparatus 1000 further includes:
And an updating unit 1008, configured to update the proxy internet protocol address of the proxy internet protocol address pool and the cache data in the cache data pool.
With continued reference to fig. 11, in this embodiment, the parsing unit 1004 includes:
A first generation subunit 1041, configured to input the text corpus into a three-layer bayesian probability model to generate a topic model of the text corpus;
A second generating subunit 1042, configured to generate a time distribution map of the topic model according to the time distribution of the text corpus;
An analysis subunit 1043, configured to analyze the text corpus by using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence spectrum of the text corpus, and obtain a keyword combination exceeding a preset frequency in the keyword co-occurrence spectrum as a keyword combination of the topic model;
the first obtaining subunit 1044 is configured to obtain and store the topic model, a time distribution map of the topic model, and a keyword combination of the topic model.
In one embodiment, the first generating subunit 1041 includes:
The third acquisition subunit is used for acquiring the trained three-layer Bayesian probability model;
and the input subunit is used for inputting the text corpus into the three-layer Bayesian probability model to generate a topic model of the text corpus.
In one embodiment, the analysis subunit 1043 includes:
a fourth obtaining subunit, configured to obtain, based on the text corpus, a preset number of keywords of each WeChat article through a word frequency-inverse document frequency matrix;
The second acquisition subunit is used for carrying out total sum and duplicate removal on all the preset number of keywords to obtain non-repeated keyword vocabulary;
And the construction subunit is used for constructing a keyword co-occurrence matrix by using the keyword vocabulary so as to acquire a keyword co-occurrence map.
With continued reference to fig. 11, in this embodiment, the third obtaining unit 1005 includes:
a determining subunit 1051, configured to determine an object of the theme according to the object of the theme model;
a second obtaining subunit 1052, configured to classify the text corpus according to the topic model to obtain a list of articles under a corresponding topic;
An extraction subunit 1053, configured to extract sentences containing the objects to form a sentence set according to the article list;
a screening subunit 1054, configured to analyze the sentence set to screen out the attribute of the sentence set.
It should be noted that, as those skilled in the art can clearly understand the above knowledge graph construction device and the specific implementation process of each unit, reference may be made to the corresponding descriptions in the foregoing method embodiments, and for convenience and brevity of description, details are not repeated here.
Meanwhile, the division and connection modes of the units in the knowledge graph construction device are only used for illustration, in other embodiments, the knowledge graph construction device can be divided into different units according to the needs, and different connection sequences and modes can be adopted for the units in the knowledge graph construction device to complete all or part of functions of the knowledge graph construction device.
The knowledge graph construction means described above may be implemented in the form of a computer program which can be run on a computer device as shown in fig. 12.
Referring to fig. 12, fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 1200 may be an electronic device such as a desktop computer or a tablet computer, or may be a component or part of another device.
With reference to fig. 12, the computer device 1200 includes a processor 1202, a memory and a network interface 1205 connected by a system bus 1201, wherein the memory may include a non-volatile storage medium 1203 and an internal memory 1204.
The non-volatile storage medium 1203 may store an operating system 12031 and a computer program 12032. The computer programs 12032, when executed, enable the processor 1202 to perform a method of knowledge graph construction as described above.
The processor 1202 is operative to provide computing and control capabilities to support operation of the entire computer device 1200.
The internal memory 1204 provides an environment for the execution of a computer program 12032 in the nonvolatile storage medium 1203, and the computer program 12032, when executed by the processor 1202, can cause the processor 1202 to execute a knowledge graph construction method as described above.
The network interface 1205 is used to communicate with other devices over a network. It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device 1200 to which the present inventive arrangements may be applied, and that a particular computer device 1200 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 12, and will not be described again.
Wherein the processor 1202 is configured to execute a computer program 12032 stored in the memory, so as to implement the following steps: acquiring a WeChat public number list in a preset mode; accessing a micro-letter server official interface according to the micro-letter public number list, and acquiring an article list of each micro-letter public number in the micro-letter public number list; crawling WeChat articles according to the article list to obtain text corpus required by knowledge graph construction; analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model; according to the topic model of the text corpus, the time distribution map of the topic model and the keyword combination of the topic model, obtaining objects and the attributes of the objects contained in the text corpus, and drawing the association relationship between the objects and the attributes to construct a knowledge map.
In an embodiment, before implementing the step of crawling WeChat articles according to the article list to obtain text corpus required for building a knowledge graph, the processor 1202 further implements the following steps:
A crawler is constructed that includes a proxy internet protocol address pool and a cache data pool.
In one embodiment, after implementing the step of building a crawler comprising a proxy internet protocol address pool and a cache data pool, the processor 1202 further implements the steps of:
Updating the proxy internet protocol address of the proxy internet protocol address pool and the cache data in the cache data pool.
In an embodiment, when the step of parsing the text corpus using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model, and a keyword combination of the topic model is implemented by the processor 1202, the following steps are specifically implemented:
Inputting the text corpus into a three-layer Bayesian probability model to generate a topic model of the text corpus;
Generating a time distribution map of the topic model according to the time distribution of the text corpus;
Analyzing the word corpus by using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence map of the word corpus, and obtaining keyword combinations exceeding a preset frequency in the keyword co-occurrence map as keyword combinations of the topic model;
and acquiring and storing the topic model, the time distribution map of the topic model and the keyword combination of the topic model.
In one embodiment, when implementing the step of inputting the text corpus into a three-layer bayesian probability model to generate a topic model of the text corpus, the processor 1202 specifically implements the following steps:
obtaining a trained three-layer Bayesian probability model;
and inputting the text corpus into the three-layer Bayesian probability model to generate a topic model of the text corpus.
In one embodiment, when the step of analyzing the text corpus using the word frequency-inverse document frequency matrix to obtain the keyword co-occurrence map of the text corpus is implemented by the processor 1202, the following steps are specifically implemented:
Based on the text corpus, obtaining preset number of keywords of each WeChat article through a word frequency-inverse document frequency matrix;
Performing total sum deduplication on all preset numbers of keywords to obtain non-repeated keyword vocabulary;
and constructing a keyword co-occurrence matrix by using the keyword vocabulary to acquire a keyword co-occurrence map.
In an embodiment, when implementing the step of obtaining the object and the attribute of the object included in the text corpus according to the topic model of the text corpus, the time distribution map of the topic model, and the keyword combination of the topic model, the processor 1202 specifically implements the following steps:
Determining an object of the theme according to the theme model;
classifying the text corpus according to the topic model to obtain an article list under a corresponding topic;
extracting sentences containing the objects to form a sentence set according to the article list;
analyzing the sentence set to screen out the attribute of the sentence set.
It is to be appreciated that in an embodiment of the application the Processor 1202 may be a central processing unit (Central Processing Unit, CPU), the Processor 1202 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be appreciated by those skilled in the art that all or part of the flow of the method of the above embodiments may be implemented by a computer program, which may be stored on a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present application also provides a computer-readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
A computer program product which, when run on a computer, causes the computer to perform the steps of the knowledge graph construction method described in the above embodiments.
The computer readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the device. Further, the computer readable storage medium may also include both internal storage units and external storage devices of the device.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The computer readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, etc. which may store the program code.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing an electronic device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.
While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (9)

1. The method for constructing the knowledge graph is characterized by comprising the following steps of:
acquiring a WeChat public number list in a preset mode;
accessing a micro-letter server official interface according to the micro-letter public number list, and acquiring an article list of each micro-letter public number in the micro-letter public number list;
crawling WeChat articles according to the article list to obtain text corpus required by knowledge graph construction;
Analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model;
acquiring an object and an attribute of the object contained in the text corpus according to the topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model, so as to draw an association relationship between the object and the attribute to construct a knowledge map, namely, construct a body structure of the object, the attribute and a secondary attribute, wherein the object refers to a topic related in a WeChat article;
Drawing the association relation between the object and the attribute to construct a knowledge graph, comprising: classifying text corpus data according to the topic model field, extracting the first ten keywords of each WeChat article through a TF-IDF matrix, classifying the WeChat articles into article lists of corresponding topic types by matching whether the first ten keywords are core keywords contained in the corresponding topic types or not, screening whether the keywords in the WeChat articles contain core keywords contained in the corresponding topic types or not, and classifying the WeChat articles into article lists of the corresponding topic types if the keywords in the WeChat articles screened contain core keywords contained in the corresponding topic types; determining the object of the similar keywords, constructing the object according to the keywords, and obtaining an article list contained in the corresponding topic type object in the text corpus according to the keywords; extracting sentences containing objects to form a sentence set according to article lists contained by the objects, integrating the sentences, performing word segmentation and part-of-speech tagging, automatically screening out attributes, and simultaneously searching out all lower attributes until the lower attributes do not exist, so as to form the knowledge graph, wherein the structure of the knowledge graph is an object-attribute-lower attribute;
the step of analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model comprises the following steps:
Inputting the text corpus into a three-layer Bayesian probability model to generate a topic model of the text corpus;
Generating a time distribution map of the topic model according to the time distribution of the text corpus;
Analyzing the word corpus by using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence map of the word corpus, and obtaining keyword combinations exceeding a preset frequency in the keyword co-occurrence map as keyword combinations of the topic model;
And acquiring and storing the topic model, the time distribution map of the topic model and the keyword combination of the topic model.
2. The method for constructing a knowledge graph according to claim 1, wherein before the step of crawling WeChat articles according to the article list to obtain text corpus required for constructing a knowledge graph, the method further comprises:
A crawler is constructed that includes a proxy internet protocol address pool and a cache data pool.
3. The knowledge-graph construction method according to claim 2, wherein after the step of constructing the crawler program including the proxy internet protocol address pool and the cache data pool, further comprising:
Updating the proxy internet protocol address of the proxy internet protocol address pool and the cache data in the cache data pool.
4. The method for constructing a knowledge graph according to claim 1, wherein the step of inputting the text corpus into a three-layer bayesian probabilistic model to generate a topic model of the text corpus comprises:
obtaining a trained three-layer Bayesian probability model;
and inputting the text corpus into the three-layer Bayesian probability model to generate a topic model of the text corpus.
5. The knowledge-graph construction method according to claim 1, wherein the step of analyzing the text corpus using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence graph of the text corpus comprises:
Based on the text corpus, obtaining a preset number of keywords of each WeChat article through a word frequency-inverse document frequency matrix;
Performing total sum deduplication on all preset numbers of keywords to obtain non-repeated keyword vocabulary;
and constructing a keyword co-occurrence matrix by using the keyword vocabulary to acquire a keyword co-occurrence map.
6. The method for constructing a knowledge graph according to claim 1, wherein the step of obtaining the object and the attribute of the object included in the text corpus according to the topic model of the text corpus, the time distribution graph of the topic model, and the keyword combination of the topic model includes:
Determining an object of the theme according to the theme model;
classifying the text corpus according to the topic model to obtain an article list under a corresponding topic;
extracting sentences containing the objects to form a sentence set according to the article list;
analyzing the sentence set to screen out the attribute of the sentence set.
7. The knowledge graph construction device is characterized by comprising:
the first acquisition unit is used for acquiring a WeChat public number list in a preset mode;
The second obtaining unit is used for accessing the micro-letter server official interface according to the micro-letter public number list and obtaining an article list of each micro-letter public number in the micro-letter public number list;
the crawling unit is used for crawling WeChat articles according to the article list to obtain text corpus required by building a knowledge graph;
The analysis unit is used for analyzing the text corpus by using a preset tool to obtain a topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model;
The third obtaining unit is used for obtaining an object and an attribute of the object contained in the text corpus according to the topic model of the text corpus, a time distribution map of the topic model and a keyword combination of the topic model so as to draw an association relationship between the object and the attribute to construct a knowledge map, namely, construct a body structure of the object, the attribute and a secondary attribute, wherein the object refers to a topic related in a WeChat article;
The first construction unit is configured to draw an association relationship between the object and the attribute to construct a knowledge graph, and includes: classifying text corpus data according to the topic model field, extracting the first ten keywords of each WeChat article through a TF-IDF matrix, classifying the WeChat articles into article lists of corresponding topic types by matching whether the first ten keywords are core keywords contained in the corresponding topic types or not, screening whether the keywords in the WeChat articles contain core keywords contained in the corresponding topic types or not, and classifying the WeChat articles into article lists of the corresponding topic types if the keywords in the WeChat articles screened contain core keywords contained in the corresponding topic types; determining the object of the similar keywords, constructing the object according to the keywords, and obtaining an article list contained in the corresponding topic type object in the text corpus according to the keywords; extracting sentences containing objects to form a sentence set according to article lists contained by the objects, integrating the sentences, performing word segmentation and part-of-speech tagging, automatically screening out attributes, and simultaneously searching out all lower attributes until the lower attributes do not exist, so as to form the knowledge graph, wherein the structure of the knowledge graph is an object-attribute-lower attribute;
The first generation subunit is used for inputting the text corpus into a three-layer Bayesian probability model to generate a topic model of the text corpus;
the second generation subunit is used for generating a time distribution map of the topic model according to the time distribution of the text corpus;
The analysis subunit is used for analyzing the word corpus by using a word frequency-inverse document frequency matrix to obtain a keyword co-occurrence map of the word corpus, and obtaining a keyword combination exceeding a preset frequency in the keyword co-occurrence map as a keyword combination of the topic model;
The first acquisition subunit is used for acquiring and storing the topic model, the time distribution map of the topic model and the keyword combination of the topic model.
8. A computer device comprising a memory and a processor coupled to the memory; the memory is used for storing a computer program; the processor is configured to execute a computer program stored in the memory to perform the steps of the knowledge-graph construction method according to any one of claims 1-6.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the steps of the knowledge-graph construction method according to any one of claims 1-6.
CN201811510375.7A 2018-12-11 2018-12-11 Knowledge graph construction method and device, computer equipment and storage medium Active CN109684483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811510375.7A CN109684483B (en) 2018-12-11 2018-12-11 Knowledge graph construction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811510375.7A CN109684483B (en) 2018-12-11 2018-12-11 Knowledge graph construction method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109684483A CN109684483A (en) 2019-04-26
CN109684483B true CN109684483B (en) 2024-07-02

Family

ID=66186665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811510375.7A Active CN109684483B (en) 2018-12-11 2018-12-11 Knowledge graph construction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109684483B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347894A (en) * 2019-05-31 2019-10-18 平安科技(深圳)有限公司 Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
CN110377891B (en) * 2019-06-19 2023-01-06 北京百度网讯科技有限公司 Method, device and equipment for generating event analysis article and computer readable storage medium
CN110442713A (en) * 2019-07-08 2019-11-12 深圳壹账通智能科技有限公司 Abstract generation method, apparatus, computer equipment and storage medium
CN110543574B (en) * 2019-08-30 2022-05-17 北京百度网讯科技有限公司 Knowledge graph construction method, device, equipment and medium
US11397859B2 (en) * 2019-09-11 2022-07-26 International Business Machines Corporation Progressive collocation for real-time discourse
CN111090801B (en) * 2019-12-18 2023-06-09 创新奇智(青岛)科技有限公司 Expert human relation map drawing method and system
CN111353019A (en) * 2020-02-25 2020-06-30 上海昌投网络科技有限公司 WeChat public number topic classification method and device
CN111488741A (en) * 2020-04-14 2020-08-04 税友软件集团股份有限公司 Tax knowledge data semantic annotation method and related device
CN113569051A (en) * 2020-04-29 2021-10-29 北京金山数字娱乐科技有限公司 Knowledge graph construction method and device
CN111641621B (en) * 2020-05-21 2022-05-20 杭州安恒信息技术股份有限公司 Internet of things security event identification method and device and computer equipment
CN112100405B (en) * 2020-09-23 2024-01-30 中国农业大学 Veterinary drug residue knowledge graph construction method based on weighted LDA
CN112765367B (en) * 2021-01-28 2023-06-30 浙江富润数链科技有限公司 Method and device for constructing topic knowledge graph
CN113065657A (en) * 2021-04-09 2021-07-02 顶象科技有限公司 Knowledge graph construction method and device based on public data of bank
CN113297388B (en) * 2021-04-25 2023-08-11 中国人民解放军军事科学院战争研究院 Strategic event chained visualization method oriented to game analysis
CN114564636B (en) * 2021-12-29 2024-06-25 东方财富信息股份有限公司 Recall ordering algorithm and laminated technical architecture for financial information searching center

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122444A (en) * 2017-04-24 2017-09-01 北京科技大学 A kind of legal knowledge collection of illustrative plates method for auto constructing
CN107633044A (en) * 2017-09-14 2018-01-26 国家计算机网络与信息安全管理中心 A kind of public sentiment knowledge mapping construction method based on focus incident

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956052A (en) * 2016-04-27 2016-09-21 青岛海尔软件有限公司 Building method of knowledge map based on vertical field
WO2018045101A1 (en) * 2016-08-30 2018-03-08 Gluck Robert Francis Systems and methods for issue management
CN107168943B (en) * 2017-04-07 2018-07-03 平安科技(深圳)有限公司 The method and apparatus of topic early warning
CN108763333B (en) * 2018-05-11 2022-05-17 北京航空航天大学 Social media-based event map construction method
CN108897857B (en) * 2018-06-28 2021-08-27 东华大学 Chinese text subject sentence generating method facing field

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122444A (en) * 2017-04-24 2017-09-01 北京科技大学 A kind of legal knowledge collection of illustrative plates method for auto constructing
CN107633044A (en) * 2017-09-14 2018-01-26 国家计算机网络与信息安全管理中心 A kind of public sentiment knowledge mapping construction method based on focus incident

Also Published As

Publication number Publication date
CN109684483A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN109684483B (en) Knowledge graph construction method and device, computer equipment and storage medium
US7519588B2 (en) Keyword characterization and application
US8838633B2 (en) NLP-based sentiment analysis
Ratkiewicz et al. Detecting and tracking the spread of astroturf memes in microblog streams
Pu et al. Subject categorization of query terms for exploring Web users' search interests
CA2897886C (en) Methods and apparatus for identifying concepts corresponding to input information
Liu et al. Identifying web spam with the wisdom of the crowds
CN104978314B (en) Media content recommendations method and device
US10713291B2 (en) Electronic document generation using data from disparate sources
US10061767B1 (en) Analyzing user reviews to determine entity attributes
CN104462553A (en) Method and device for recommending question and answer page related questions
CN109918656B (en) Live broadcast hotspot acquisition method and device, server and storage medium
CN102200975A (en) Vertical search engine system and method using semantic analysis
Kumar et al. Hashtag recommendation for short social media texts using word-embeddings and external knowledge
US20150206101A1 (en) System for determining infringement of copyright based on the text reference point and method thereof
CN112328857B (en) Product knowledge aggregation method and device, computer equipment and storage medium
WO2015084757A1 (en) Systems and methods for processing data stored in a database
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
CN105512300A (en) Information filtering method and system
Singh et al. Mining the blogosphere from a socio-political perspective
KR20160002199A (en) Issue data extracting method and system using relevant keyword
CN107665442B (en) Method and device for acquiring target user
Hu et al. Embracing information explosion without choking: Clustering and labeling in microblogging
JP2020521246A (en) Automated classification of network accessible content
US11507593B2 (en) System and method for generating queryeable structured document from an unstructured document using machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant