CN112395391A

CN112395391A - Concept graph construction method and device, computer equipment and storage medium

Info

Publication number: CN112395391A
Application number: CN202011288931.8A
Authority: CN
Inventors: 白祚; 董光喆; 孙梓淇; 莫洋
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-23
Anticipated expiration: 2040-11-17
Also published as: CN112395391B

Abstract

The embodiment of the application belongs to the technical field of big data, relates to a concept map construction method and related equipment, can be applied to the field of intelligent education, and comprises the following steps: acquiring text data, and performing phrase extraction on the text data to obtain first candidate concept data, second candidate concept data and third candidate concept data; combining the first candidate concept data, the second candidate concept data and the third candidate concept data into a candidate data set, scoring the candidate concept data in the candidate data set through a preset scoring model, and determining the candidate concept data with the score larger than or equal to a preset threshold value as preferred concept data; and matching the preferred concept data with the stored preset knowledge, determining the preset knowledge successfully matched with the preferred concept data as the preferred knowledge, and storing the preferred concept data and the preferred knowledge in a correlation mode. In addition, the present application also relates to blockchain techniques, and preferred concept data may be stored in blockchains. The method and the device realize the self-adaptive construction of the concept map.

Description

Concept graph construction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a method and an apparatus for constructing a concept graph, a computer device, and a storage medium.

Background

Currently, people tend to perceive things in a conceptual manner, whether in a conversation or reading an article and describing an event. The ideal concept should accurately describe a class of events, but without the lack of generalization, be able to provide more and more generalized information beyond the events themselves. The hierarchy of concepts is different for different applications and different scenarios. For example, for search-class applications, searches in e-commerce systems and searches of medical literature, the conceptual hierarchy may be completely different.

The existing concept graph is often too general or too formal to perform different concept association aiming at different scenes. For example, some concept maps are constructed by mining over very formal content, such as the content of pages in wikipedia. These contents are not the same as the scenes actually faced by different applications. As in intelligent dialog systems, the language of the user is more spoken; in the insurance verticality search, the user input is highly relevant to the insurance industry. Therefore, the existing concept graph cannot provide suitable and customized knowledge supply for different applications, so that the intelligent cognitive level of the applications is limited, and the optimal experience cannot be provided for users.

Disclosure of Invention

An embodiment of the present application aims to provide a concept graph construction method, a device, a computer device, and a storage medium, and aims to solve the technical problem that the conventional concept graph cannot be subjected to customized knowledge supply.

In order to solve the above technical problem, an embodiment of the present application provides a concept graph construction method, which adopts the following technical solutions:

a concept map construction method comprises the following steps:

acquiring text data, performing phrase extraction on the text data based on a basic template of a preset template library to obtain first candidate concept data, performing phrase extraction on the text data based on a preset phrase extraction algorithm to obtain second candidate concept data, and performing phrase extraction on the text data based on a preset language model to obtain third candidate concept data;

combining the first candidate concept data, the second candidate concept data and the third candidate concept data into a candidate data set, scoring the candidate concept data in the candidate data set through a preset scoring model, and determining the candidate concept data with the score larger than or equal to a preset threshold value as preferred concept data;

and matching the preferred concept data with stored preset knowledge one by one, determining the preset knowledge successfully matched with the preferred concept data as the preferred knowledge, and storing the preferred concept data and the preferred knowledge in a correlation manner.

Further, before the step of performing phrase extraction on the text data based on the preset language model to obtain third candidate concept data, the method includes:

acquiring preset training data, and training a basic language model based on the preset training data, wherein the basic language model comprises a pre-training model and a conditional random field model;

acquiring the learning rate of the basic language model, and determining whether the convergence values of a pre-training model and a conditional random field model in the basic language model are corresponding optimal values under the learning rate;

and when the convergence value of any one of the pre-training model and the conditional random field model is a non-corresponding optimal value, adjusting the learning rate until the convergence values of the pre-training model and the conditional random field model reach the corresponding optimal value, and determining the basic language model as a preset language model.

Further, the step of scoring the candidate concept data in the candidate data set through a preset scoring model includes:

obtaining semantic aggregation, length, part-of-speech tagging results and language model characteristics of candidate concept data in the candidate data set;

and inputting the length, the part-of-speech tagging result, the semantic aggregation and the language model characteristics into a preset scoring model, and calculating to obtain a score corresponding to the candidate concept data.

Further, the step of obtaining the semantic aggregation of the candidate concept data in the candidate data set includes:

obtaining large-scale corpus data, and determining an article comprising the candidate concept data based on the large-scale corpus data;

obtaining the categories corresponding to all the articles, calculating the sub-times of the candidate concept data respectively appearing in different categories and the total times of the candidate concept data in all the categories, and taking the maximum ratio of the sub-times to the total times as the semantic aggregation of the candidate concept data.

Further, the step of matching the preferred concept data with the stored preset knowledge one by one includes:

calculating the font similarity of the preferred concept data and preset knowledge, taking the preferred concept data with the font similarity more than or equal to a first matching degree as primary concept data, and determining the preset knowledge corresponding to the primary concept data as the preferred knowledge;

and taking the preferred concept data except the primary concept data in the preferred concept data as secondary concept data, calculating the semantic similarity between the secondary concept data and the preset knowledge, and determining the preset knowledge with the semantic similarity more than or equal to a second matching degree as the preferred knowledge.

Further, after the step of performing phrase extraction on the text data based on the basic template of the preset template library to obtain first candidate concept data, the method includes:

detecting whether the first candidate concept data exists in a preset concept library, and determining the first candidate concept data as new concept data when the first candidate concept data does not exist in the concept library;

generating a candidate template corresponding to the new concept data based on a preset template generation system and the text data;

and performing quality detection on the candidate template, and saving the candidate template as a basic template in a preset template library when the quality detection of the candidate template passes.

Further, the step of performing quality detection on the candidate template includes:

performing concept extraction on the verification data based on the candidate template to obtain a first quantity of new concept data obtained by extraction and a second quantity of old concept data;

and calculating the ratio of the first quantity to the second quantity, and determining that the quality detection of the candidate template is passed when the ratio is greater than or equal to a first preset threshold and the second quantity is greater than or equal to a second preset threshold.

In order to solve the above technical problem, an embodiment of the present application further provides a concept graph constructing apparatus, which adopts the following technical solutions:

the acquisition module is used for acquiring text data, performing phrase extraction on the text data based on a basic template of a preset template library to obtain first candidate concept data, performing phrase extraction on the text data based on a preset phrase extraction algorithm to obtain second candidate concept data, and performing phrase extraction on the text data based on a preset language model to obtain third candidate concept data;

the combination module is used for combining the first candidate concept data, the second candidate concept data and the third candidate concept data into a candidate data set, scoring the candidate concept data in the candidate data set through a preset scoring model, and determining the candidate concept data with the score larger than or equal to a preset threshold value as preferred concept data;

and the matching module is used for matching the preferred concept data with stored preset knowledge one by one, determining the preset knowledge successfully matched with the preferred concept data as the preferred knowledge, and storing the preferred concept data and the preferred knowledge in a correlation manner.

In order to solve the technical problem, an embodiment of the present application further provides a computer device, which includes a memory and a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the concept graph construction method when executing the computer program.

In order to solve the technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the concept graph construction method.

The concept map construction method comprises the steps of acquiring text data, performing phrase extraction on the text data based on a basic template of a preset template library to obtain first candidate concept data, performing phrase extraction on the text data based on a preset phrase extraction algorithm to obtain second candidate concept data, and performing phrase extraction on the text data based on a preset language model to obtain third candidate concept data; the extraction of the text data from different angles is realized, and then the first candidate concept data, the second candidate concept data and the third candidate concept data are combined to be a candidate data set, so that the candidate data set can cover the number of the candidate concepts in a large range, the screening breadth is improved, and the omission of the candidate concept data is avoided; then, scoring the candidate concept data in the candidate data set through a preset scoring model, and determining the candidate concept data with the score larger than or equal to a preset threshold value as preferred concept data; the method comprises the steps of matching the preferred concept data with stored preset knowledge one by one, determining the preset knowledge successfully matched with the preferred concept data as the preferred knowledge, and storing the preferred concept data and the preferred knowledge in a correlation mode, so that self-adaptive construction of the concept graph is achieved, an applicable concept graph can be constructed for different applications, a concept system and knowledge with the most appropriate granularity are provided, more intelligent cognitive services are given to the applications, and accuracy of knowledge content provision for different application scenes is further improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a schematic flow diagram of one embodiment of a concept graph construction method;

FIG. 3 is a schematic block diagram of one embodiment of a conceptual map building apparatus according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Reference numerals: the concept map construction device 500, an acquisition module 501, a combination module 502 and a matching module 503.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the concept graph construction method provided in the embodiments of the present application is generally executed by a server/terminal, and accordingly, the concept graph construction apparatus is generally disposed in the server/terminal device.

It should be understood that the number of terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a concept graph construction method according to the present application is shown. The concept map construction method comprises the following steps:

step S201, collecting text data, performing phrase extraction on the text data based on a basic template of a preset template library to obtain first candidate concept data, performing phrase extraction on the text data based on a preset phrase extraction algorithm to obtain second candidate concept data, and performing phrase extraction on the text data based on a preset language model to obtain third candidate concept data;

in this embodiment, text data is collected, where the text data is data of a text class corresponding to a current application scenario; and when the text data is acquired, extracting the text data based on a basic template in a preset template library to obtain corresponding first concept data. Specifically, the preset template library is a template repository in which a plurality of basic templates are stored, and concepts in the text data can be extracted according to the basic templates, so that corresponding first concept data is obtained. The basic template is a preset relation template, trigger words with different relations are defined in the basic template, when the trigger words exist in the text data, the relation corresponding to the trigger words is determined based on the basic template, and then the participants of the relation are identified and determined through the named entities. Therefore, the first concept data corresponding to the current text data can be extracted and obtained.

When the phrase extraction is performed on the text data based on the preset phrase extraction algorithm, the preset phrase extraction algorithm is a higher-quality phrase extraction mode compared with template extraction, for example, AutoPhrase, which is a high-quality phrase extraction algorithm, and the phrase extraction can be performed on the text data through a knowledge map and word segmentation guidance to obtain a plurality of phrase data. And combining the extracted phrase combinations in a random or preset combination mode to obtain second concept data corresponding to the current text data.

When phrase extraction is performed on text data based on a preset language model, the preset language model adopts Bert (Bidirectional Encoder retrieval from transforms), text data can be automatically phrase extracted by combining a BERT algorithm and CRF, and the extracted data is third concept data.

Step S202, combining the first candidate concept data, the second candidate concept data and the third candidate concept data into a candidate data set, scoring the candidate concept data in the candidate data set through a preset scoring model, and determining the candidate concept data with the score larger than or equal to a preset threshold value as preferred concept data;

in this embodiment, when the first candidate concept data, the second candidate concept data, and the third candidate concept data are obtained, the first candidate concept data, the second candidate concept data, and the third candidate concept data are combined into the candidate data set. And performing score screening on the candidate concept data in the candidate data set based on a preset score model. The preset scoring model is a preset concept scoring model, and corresponding scoring results can be obtained by inputting each candidate concept data in the candidate concept set into the preset scoring model. Determining whether the current candidate concept data is the optimal concept data according to the scoring result, and determining the candidate concept data as the optimal concept data if the scoring of the candidate concept data is greater than or equal to a preset threshold value; and if the score of the candidate concept data is smaller than a preset threshold value, determining that the candidate concept data is not the preferred concept data.

Step S203, matching the preferred concept data with stored preset knowledge one by one, determining the preset knowledge successfully matched with the preferred concept data as the preferred knowledge, and storing the preferred concept data and the preferred knowledge in a correlation mode.

In this embodiment, after the preferred concept data is obtained, all the obtained preferred concept data are matched with all the stored preset knowledge one by one, wherein the matching includes exact matching and fuzzy matching. And determining preferred concept data which is completely matched with the stored preset knowledge in the preferred concept data through accurate matching, screening the preferred concept data which is completely matched with the preset knowledge, and performing fuzzy matching on the remaining preferred concept data. And calculating the matching degree of the remaining preferred concept data and the preset knowledge according to fuzzy matching, and if the matching degree is greater than or equal to the preset matching degree, determining that the remaining preferred concept data is successfully matched with the preset knowledge. And performing association storage on the completely matched preferred concept data and the corresponding preset knowledge, and performing association storage on the successfully matched preferred concept data and the corresponding preset knowledge. Therefore, when the preferred concept data is used for semantic recall, the matched preferred knowledge can be accurately acquired.

It is emphasized that the preferred concept data may also be stored in a node of a blockchain in order to further ensure privacy and security of the preferred concept data.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The embodiment realizes the self-adaptive construction of the concept map, so that the applicable concept map can be constructed for different applications, a concept system and knowledge with the most appropriate granularity are provided, more intelligent cognitive services are given to the applications, and the accuracy of knowledge content provision for different application scenes is further improved.

In some embodiments of the present application, before performing phrase extraction on the text data based on the preset language model to obtain third candidate concept data, the method includes:

In the present embodiment, a Learning rate (Learning rate) is used as an important hyper-parameter in supervised Learning and deep Learning, and determines whether and when the objective function converges to the local minimum value, and an appropriate Learning rate enables the objective function to converge to the local minimum value in an appropriate time. When the basic language model is trained, the pre-trained model, i.e. BERT (bidirectional Encoder replication from transforms), is pre-trained by using the largest-scale corpus, so that when the corpus for a specific application is fine-tuned, the reasonable learning rate is generally very small, and is between 1e-5 and 1e-4, at this time, parameters related to BERT can be fitted and converged quickly. However, at this learning rate, the Conditional Random field model (CRF) in the base language model is not sufficiently trained. Therefore, when the basic language model is trained based on the preset training data, the learning rate of the preset language model is obtained, and whether the convergence values of the preset training model and the CRF in the basic language model at the current learning rate are the corresponding optimal values is determined; the preset training data are a plurality of text data. And when the convergence value of any one of the pre-training model and the CRF is not the corresponding optimal value, adjusting the learning rate until the convergence values of the pre-training model and the CRF reach the corresponding optimal values, namely determining that the basic language model at the moment is the preset language model. For example, when the convergence value of the pre-trained model reaches the optimal value, the CRF is not sufficiently trained, and at this time, the learning rate is dynamically increased according to the conditional random field model, so that both BERT and CRF can be sufficiently trained, and fit convergence to the optimal parameters is achieved, and through experiments, the learning rate of the conditional random field model is generally 100 times that of BERT. And when the convergence values of the pre-training model and the CRF reach the corresponding optimal values, the basic language model at the moment is the preset language model.

According to the embodiment, the learning rate is adjusted, so that the obtained preset language model is more perfect, and the extraction result is more accurate when phrases are extracted from the text data through the preset language model.

In some embodiments of the present application, the scoring the candidate concept data in the candidate data set by using a preset scoring model includes:

In this embodiment, when the candidate concept data in the candidate data set is scored through the preset scoring model, the candidate concept data may be scored through four measuring dimensions from the length of the candidate concept data, the part-of-speech tagging result, the semantic aggregation degree, and the language model feature. Wherein, the length is the text length of the candidate concept data; the part-of-speech tagging result is part-of-speech marks determined for each word according to the context information, such as verbs, adverbs, nouns and the like; semantic cohesion is the expression ability of candidate combinations, and is used for measuring whether the content expressed by the phrases is broad or focused; the language model features are language reasonableness representations of the candidate concept data. And inputting the length, the part-of-speech tagging result, the semantic aggregation and the language model characteristics of the obtained candidate concept data into a preset scoring model, and obtaining a corresponding total score through the preset scoring model. Wherein, the preset scoring model is a pre-trained scoring model. Specifically, when a scoring model is trained, acquiring multiple sets of scoring training data in advance, and acquiring the length, the part-of-speech tagging result, the semantic aggregation degree and the language model characteristic of the scoring training data, and the quality level corresponding to the scoring training data; training the scoring model according to the length, the part-of-speech tagging result, the semantic cohesion degree and the language model characteristics of the scoring training data, calculating to obtain corresponding scores, and dividing the scores according to the quality grades corresponding to the scoring training data. Therefore, when the score is calculated according to the preset score model, the quality grade corresponding to the candidate concept data can be further determined according to the score.

According to the embodiment, the scores of the candidate concept data are calculated, so that the candidate concept data can be accurately measured through the scores, and the concept data which are more in line with the requirements are further screened.

In some embodiments of the present application, the obtaining the semantic aggregation degree of the candidate concept data in the candidate data set includes:

In this embodiment, the semantic aggregation is the image-wise expression capability of the candidate combination. The same word may be applied in multiple scenes, and some words may be used in certain scenes with a significantly higher rate than other scenes, so that the corresponding semantic aggregation degree of words in a relatively concentrated usage scene category is higher than that of words in a relatively dispersed usage scene category. When the semantic aggregation degree corresponding to the candidate concept data is obtained, large-scale corpus data is obtained firstly, articles including the candidate concept data are found from the large-scale corpus data, and the categories of all the obtained articles, such as various categories of sports, politics, entertainment and the like, are determined. The number of sub-occurrences of the candidate concept data subsection under different categories and the total number of occurrences under all categories are calculated. And calculating the ratio of the sub times to the total times corresponding to each category, if the ratio under a certain category is higher, indicating that the semantic aggregation degree of the candidate concept data under the category is higher, and taking the maximum ratio of the sub times to the total times of the candidate concept data under the certain category as the semantic aggregation degree of the candidate concept data.

According to the method and the device, the score calculated according to the voice cohesion degree is more accurate by calculating the semantic cohesion degree, and the accuracy of screening the candidate concept data through the score is improved.

In some embodiments of the present application, the matching the preferred concept data with the stored preset knowledge one by one includes:

In this embodiment, when the preferred concept data is matched with the stored preset knowledge, the font similarity between the preferred concept data and the preset knowledge is calculated, and the font similarity is the accurate similarity between the preferred concept data and the preset knowledge. And taking the preferred concept data with the font similarity of the preset knowledge greater than or equal to the first matching degree in all the preferred concept data as primary concept data, wherein the preset knowledge matched with the primary concept data is the preferred knowledge. And taking the preferred concept data except the primary concept data in all the preferred concept data as secondary concept data, and calculating the semantic similarity between the secondary concept data and preset knowledge. And if the semantic similarity is greater than or equal to the second matching degree, determining that the preset knowledge corresponding to the secondary concept data is the preferred knowledge.

The embodiment matches the concept data with the preset knowledge through the font similarity and the semantic similarity, improves the matching precision during matching, and realizes accurate acquisition of the optimal knowledge.

In some embodiments of the present application, after performing phrase extraction on the text data based on the basic template of the preset template library to obtain first candidate concept data, the method includes:

In this embodiment, after the concept data is obtained, the current preset template library may be updated through the preset template generation system. Specifically, when first candidate concept data is acquired, whether the first candidate concept data exists in a preset concept library is detected, and if not, the current first candidate concept data is determined to be new concept data. And generating a candidate template corresponding to the new concept data based on a preset template generating system and the text data corresponding to the new concept data. And when the candidate template is obtained, performing quality detection on the candidate template. Specifically, the quality inspection may control the candidate template to perform concept extraction on the verification data, determine the number of new concept data (i.e., concept data that does not exist in the current concept library) obtained by the extraction, and determine that the quality inspection of the candidate template passes if the number is greater than a preset number. And when the quality detection of the candidate template passes, reserving the candidate template as a basic template in a preset template library.

In the embodiment, the quality of the candidate template is detected, so that the new creation and the acquisition of the basic template in the preset template library are realized, and the accuracy rate of phrase extraction on the text data through the basic template is improved.

In some embodiments of the present application, the performing quality detection on the candidate template includes:

In this embodiment, when detecting the quality of the candidate template, concept extraction may be performed on the verification data through the candidate template, and whether the quality of the candidate template passes the detection is determined according to a ratio of the number of new concept data to the number of old concept data obtained by the extraction. The verification data is preset text data for verification, the new concept data is concept data which does not exist in the current concept library, and the old concept data is data which is consistent with the concept data which exists in the current concept library. Specifically, concept extraction is performed on verification data based on the candidate template to obtain a concept data set corresponding to the candidate template, and the number of new concept data and the number of old concept data in the concept data set are obtained and respectively used as a first number and a second number. Calculating the ratio of the first quantity to the second quantity, determining the quality of the current candidate template according to the ratio, if the ratio is greater than or equal to a first preset threshold value, and the second quantity is greater than or equal to a second preset threshold value, determining that the quality detection of the current candidate template passes, and reserving the candidate template as a basic template in a preset template library; if the ratio is smaller than a first preset threshold value, the candidate template is discarded if the ratio indicates that the current candidate template is repeated with the existing basic template.

According to the embodiment, the accurate detection and screening of the candidate template are realized through the ratio of the new concept data to the old concept data, and the detection efficiency of the candidate template is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a concept graph construction apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 3, the concept graph constructing apparatus 500 according to the present embodiment includes: an acquisition module 501, a combination module 502 and a matching module 503. Wherein the content of the first and second substances,

the acquisition module 501 is configured to acquire text data, perform phrase extraction on the text data based on a basic template of a preset template library to obtain first candidate concept data, perform phrase extraction on the text data based on a preset phrase extraction algorithm to obtain second candidate concept data, and perform phrase extraction on the text data based on a preset language model to obtain third candidate concept data;

When phrase extraction is performed on text data based on a preset phrase extraction algorithm, the preset phrase extraction algorithm is a higher-quality phrase extraction mode compared with template extraction, such as AutoPhrase. The AutoPhrase is a high-quality phrase extraction algorithm, and can automatically cut and extract phrases of text data through a knowledge map and word segmentation guidance to obtain a plurality of phrase data. And combining the extracted phrase combinations in a random or preset combination mode to obtain second concept data corresponding to the current text data.

A combination module 502, configured to combine the first candidate concept data, the second candidate concept data, and the third candidate concept data into a candidate data set, score the candidate concept data in the candidate data set through a preset scoring model, and determine candidate concept data with a score greater than or equal to a preset threshold as preferred concept data;

wherein the combination module includes:

the acquisition unit is used for acquiring the semantic aggregation degree, the length, the part-of-speech tagging result and the language model characteristics of the candidate concept data in the candidate data set;

and the first calculation unit is used for inputting the length, the part-of-speech tagging result, the semantic aggregation and the language model characteristics into a preset scoring model and calculating to obtain a score corresponding to the candidate concept data.

Wherein the acquisition unit includes:

the confirming subunit is used for acquiring large-scale corpus data and determining the articles comprising the candidate concept data based on the large-scale corpus data;

and the calculating subunit is used for acquiring the categories corresponding to all the articles, calculating the sub times of the candidate concept data appearing in different categories and the total times of the candidate concept data appearing in all the categories respectively, and taking the maximum ratio of the sub times to the total times as the semantic aggregation degree of the candidate concept data.

A matching module 503, configured to match the preferred concept data with stored preset knowledge one by one, determine that the preset knowledge successfully matched with the preferred concept data is the preferred knowledge, and store the preferred concept data and the preferred knowledge in an associated manner.

Wherein the matching module comprises:

the second calculation unit is used for calculating the font similarity between the preferred concept data and preset knowledge, using the preferred concept data with the font similarity larger than or equal to the first matching degree as primary concept data, and determining the preset knowledge corresponding to the primary concept data as the preferred knowledge;

and the third calculating unit is used for taking the preferred concept data except the primary concept data in the preferred concept data as secondary concept data, calculating the semantic similarity between the secondary concept data and the preset knowledge, and determining the preset knowledge with the semantic similarity being more than or equal to a second matching degree as the preferred knowledge.

The concept graph construction device provided by the embodiment further comprises:

the first training module is used for acquiring preset training data and training a basic language model based on the preset training data, wherein the basic language model comprises a pre-training model and a conditional random field model;

the second training module is used for acquiring the learning rate of the basic language model and determining whether the convergence values of a pre-training model and a conditional random field model in the basic language model are corresponding optimal values under the learning rate;

and the confirming module is used for adjusting the learning rate when the convergence value of any one of the pre-training model and the conditional random field model is a non-corresponding optimal value until the convergence values of the pre-training model and the conditional random field model reach the corresponding optimal value, and determining the basic language model as a preset language model.

The first detection module is used for detecting whether the first candidate concept data exists in a preset concept library or not, and when the first candidate concept data does not exist in the concept library, determining the first candidate concept data as new concept data;

the generating module is used for generating a candidate template corresponding to the new concept data based on a preset template generating system and the text data;

and the second detection module is used for carrying out quality detection on the candidate template, and storing the candidate template as a basic template in a preset template library when the quality detection of the candidate template passes.

The concept graph construction device provided by the embodiment realizes the self-adaptive construction of the concept graph, so that the applicable concept graph can be constructed for different applications, a concept system and knowledge with the most appropriate granularity are provided, more intelligent cognitive services are given to the applications, and the accuracy of knowledge content provision for different application scenes is further improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various types of application software, such as program codes of a concept graph construction method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute the program code stored in the memory 61 or process data, for example, execute the program code of the concept graph construction method.

The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The computer device provided by the embodiment realizes the self-adaptive construction of the concept graph, so that the applicable concept graph can be constructed for different applications, a concept system and knowledge with the most appropriate granularity are provided, more intelligent cognitive services are given to the applications, and the accuracy of knowledge content provision for different application scenes is further improved.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a concept graph construction program, which is executable by at least one processor to cause the at least one processor to perform the steps of concept graph construction as described above.

The computer-readable storage medium provided by the embodiment realizes the self-adaptive construction of the concept graph, so that the applicable concept graph can be constructed for different applications, a concept system and knowledge with the most appropriate granularity are provided, more intelligent cognitive services are given to the applications, and the accuracy of knowledge content provision for different application scenes is further improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A concept map construction method is characterized by comprising the following steps:

2. The method for constructing a concept graph according to claim 1, wherein before the step of extracting phrases from the text data based on the preset language model to obtain third candidate concept data, the method comprises:

3. The concept graph construction method according to claim 1, wherein the step of scoring the candidate concept data in the candidate data set by a preset scoring model comprises:

4. The method according to claim 3, wherein the step of obtaining semantic aggregation of candidate concept data in the candidate data set comprises:

5. The concept graph construction method according to claim 1, wherein the step of matching the preferred concept data with the stored preset knowledge one by one comprises:

6. The method for constructing concept graph according to claim 1, wherein after the step of performing phrase extraction on the text data based on the basic template of the preset template library to obtain the first candidate concept data, the method comprises:

7. The concept graph construction method according to claim 6, wherein the step of performing quality check on the candidate template comprises:

8. A concept graph construction apparatus, comprising:

9. A computer device comprising a memory having stored therein a computer program and a processor which, when executing the computer program, implements the steps of the concept graph construction method according to any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the concept graph construction method according to any one of claims 1 to 7.