CN114328799A

CN114328799A - Data processing method, device and computer readable storage medium

Info

Publication number: CN114328799A
Application number: CN202111324087.4A
Authority: CN
Inventors: 茅辉强; 徐世超; 王树辰; 徐钊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-04-12

Abstract

The embodiment of the application discloses a data processing method, a data processing device and a computer readable storage medium; the target preset corpus corresponding to the target knowledge type can be acquired from a preset corpus database, and a target search word set corresponding to the target knowledge type can be acquired from a historical search record; extracting word information from the target preset corpus and the target search word set to obtain a target entity word set; updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words; searching knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation; and establishing a knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation. Therefore, the knowledge graph can be constructed through the corpus entity words of various channels, the coverage range of knowledge entities in the knowledge graph is improved, and the knowledge information content is expanded.

Description

Data processing method, device and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, and a computer-readable storage medium.

Background

Knowledge Graph (knowledgegraph) is a way to describe Knowledge entities and entity relationships using visualization or structuring, and can be used for Knowledge mining, analysis, construction, fusion, etc. in other scenarios. In order to construct a knowledge graph of a certain field, the related technology determines different knowledge entities through content information opened in the field, and aligns the knowledge entities based on attribute similarity of the knowledge entities to construct a corresponding knowledge graph.

In the research and practice process of the prior art, the inventor of the present application finds that, in the prior art, when a corresponding knowledge graph is constructed, the coverage of knowledge entities related to the knowledge graph is limited, so that the constructed knowledge graph contains insufficient knowledge information, which is not beneficial to subsequent use.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and a computer readable storage medium. The coverage of knowledge entities in the knowledge graph can be improved.

An embodiment of the present application provides a data processing method, including:

acquiring a target preset corpus corresponding to a target knowledge type from a preset corpus database, and acquiring a target search word set corresponding to the target knowledge type from a historical search record;

extracting word information from the target preset corpus and the target search word set to obtain a target entity word set;

updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words;

searching knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target attribute information to obtain knowledge content data with the association relation;

and establishing a knowledge graph corresponding to the target knowledge type according to the knowledge content data with the incidence relation.

Correspondingly, an embodiment of the present application provides a data processing apparatus, including:

the acquisition unit is used for acquiring a target preset corpus corresponding to a target knowledge type from a preset corpus database and acquiring a target search word set corresponding to the target knowledge type from a historical search record;

the refining unit is used for carrying out word information refining on the target preset linguistic data and the target search word set to obtain a target entity word set;

the updating unit is used for updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words;

the searching unit is used for searching the knowledge content associated with the target attribute information and establishing an association relation between the knowledge content and the target attribute information to obtain knowledge content data with the association relation;

and the establishing unit is used for establishing the knowledge graph corresponding to the target knowledge type according to the knowledge content data with the incidence relation.

In some embodiments, the refining unit is further configured to:

establishing a word information structure tree according to the target preset corpus and the target search word set;

extracting word information contained in the word information structure tree to obtain an entity word candidate set;

and filtering word information in the entity word candidate set to obtain a target entity word set.

In some embodiments, the refining unit is further configured to:

extracting each word information in the entity word candidate set;

determining word frequency corresponding to each word information based on the target preset corpus and the target search word set;

determining the word information with the word frequency less than a preset word frequency threshold as word information to be filtered, and filtering the word information to be filtered in the entity word candidate set to obtain a filtered target entity word set.

In some embodiments, the update unit is further configured to:

extracting a target attribute word corresponding to each target entity word in the target entity word set;

acquiring a second target corpus in the preset corpus database, wherein the second target corpus is a corpus in the preset corpus database except the target preset corpus;

corpus attribute information associated with each target entity word and the corresponding target attribute word is extracted from the second target corpus;

and updating the attribute according to the corpus attribute information and the target attribute word corresponding to each target entity word to obtain the target attribute information corresponding to each target entity word.

In some embodiments, the updating unit is further configured to:

extracting a plurality of first target entity words corresponding to the target preset corpus in the target entity word set to obtain a first target entity word set;

acquiring a first target attribute word corresponding to each first target entity word based on the target preset corpus to obtain a first attribute word set;

performing word segmentation processing on the target search word set according to the first target entity word set and the first attribute word set to obtain a target word segmentation result, and determining a second attribute word set corresponding to the target search word set according to the target word segmentation result;

and fusing the first attribute word set and the second attribute word set to obtain a fused attribute word set, wherein the fused attribute word set comprises target attribute words corresponding to each target entity word.

In some embodiments, the updating unit is further configured to:

clustering target search words in the target search word set based on each target entity word in the target entity word set to obtain a search word classification set corresponding to each target entity word;

performing word segmentation processing on the target search words in each search word classification set according to the first target entity word set and the first attribute word set to obtain a word segmentation result corresponding to each search word classification set;

and determining a target word segmentation result corresponding to the target search word set according to the word segmentation result corresponding to each search word classification set.

In some embodiments, the updating unit is further configured to:

identifying a search word classification set corresponding to each attribute word in the target word segmentation result;

acquiring the frequency of target search words contained in the search word classification set corresponding to each attribute word;

determining the search word classification set with the frequency greater than or equal to a preset search word frequency threshold value as a target search word classification set;

and generating a second attribute word set corresponding to the target search word set according to the attribute words corresponding to each target search word classification set.

In some embodiments, the updating unit is further configured to:

performing word information matching on the target attribute words in the corpus attribute information to obtain target attribute words matched with the corpus attribute information;

and determining the target attribute words matched with the corpus attribute information as attribute words to be updated, and performing regularization processing according to the attribute words to be updated and the corpus attribute information to obtain target attribute information corresponding to each target entity word.

In addition, the embodiment of the present application further provides a computer device, which includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute the computer program in the memory to implement the steps in the data processing method provided in the embodiment of the present application.

In addition, a computer-readable storage medium is provided, where the computer-readable storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the data processing methods provided in the embodiments of the present application.

In addition, the embodiment of the present application also provides a computer program product, which includes computer instructions, and the computer instructions, when executed, implement the steps in any data processing method provided by the embodiment of the present application.

The method and the device for searching the target language data can acquire the target preset language data corresponding to the target knowledge type from the preset language data database and acquire the target search word set corresponding to the target knowledge type from the historical search records; extracting word information from the target preset corpus and the target search word set to obtain a target entity word set; updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words; searching knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation; and establishing a knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation. Therefore, the method and the device can determine the target entity word set according to the target search word set corresponding to the target preset linguistic data and the historical search records, expand the word quantity of the knowledge entity by combining the linguistic data in the database and the historical search records, further establish the association relation between the knowledge content associated with the target attribute information and the target entity word set after acquiring the target attribute information corresponding to the knowledge entity word, and establish the knowledge graph corresponding to the target knowledge type based on the associated knowledge entity data, thereby realizing the construction of the knowledge graph by utilizing the linguistic entity words of various channels, improving the coverage range of the knowledge entity in the knowledge graph and the contained knowledge information quantity, and further improving the usability of the knowledge graph.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating steps of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating another step of a data processing method according to an embodiment of the present application;

fig. 4 is a schematic view of a scenario of a data processing method provided in an embodiment of the present application;

fig. 5 is a scene schematic diagram of target entity word processing in the data processing method provided in the embodiment of the present application;

fig. 6 is a scene schematic diagram of target attribute information expansion in a data processing method provided in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a data processing method, a data processing device and a computer readable storage medium. Specifically, the embodiments of the present application will be described from the perspective of a data processing apparatus, where the data processing apparatus may be specifically integrated in a computer device, and the computer device may be a server, or may be a terminal or other devices. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a vehicle-mounted terminal, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence data processing and the like, and is specifically explained by the following embodiment:

for example, referring to fig. 1, a schematic view of a scenario of a data processing system according to an embodiment of the present application is provided. The scenario includes a computer device, which may be a terminal or a server.

The terminal or the server can acquire a target preset corpus corresponding to the target knowledge type from a preset corpus database and acquire a target search word set corresponding to the target knowledge type from a historical search record; extracting word information from the target preset corpus and the target search word set to obtain a target entity word set; updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words; searching knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge content data with the association relation; and establishing a knowledge graph corresponding to the target knowledge type according to the knowledge content data with the incidence relation.

The data processing may include processing modes such as obtaining target preset corpora, obtaining target search words in historical search records, refining entity words, updating attribute information, searching associated knowledge content, and constructing a knowledge graph.

The following are detailed below. The order of the following examples is not intended to limit the preferred order of the examples.

In the embodiments of the present application, the description will be made from the perspective of a user interface test data processing apparatus with which the apparatus can be specifically integrated in a computer device such as a terminal or a server. Referring to fig. 2, fig. 2 is a schematic diagram illustrating a flow of steps of a user interface test data processing method provided in an embodiment of the present application, where for example, a data processing apparatus is specifically integrated on a terminal, and when a processor on the terminal or a server executes a program corresponding to an information processing method, a specific flow is as follows:

101. and acquiring a target preset corpus corresponding to the target knowledge type from a preset corpus database, and acquiring a target search word set corresponding to the target knowledge type from a historical search record.

The target knowledge type may be a type of a knowledge graph to be established, the knowledge type may be an education knowledge type, a travel knowledge type, a sports knowledge type, a medical knowledge type, and the like, and the knowledge type may be understood as a knowledge field, such as an education field, a sports field, a medical field, a travel field, and the like.

The preset corpus database may be a database containing a large amount of knowledge content data, which may contain knowledge content corpora of a plurality of knowledge types (knowledge fields), such as knowledge content corpora of education fields, knowledge content corpora of sports fields, knowledge content corpora of travel fields, and the like; in addition, the preset corpus database may also be a database containing knowledge content corpuses of a certain knowledge type (knowledge domain), such as knowledge content corpuses of education domain, knowledge content corpuses of sports domain, knowledge content corpuses of tourism domain, or knowledge content corpuses of other domains, which is not limited herein. It should be noted that the preset corpus database contains the corpus of the existing knowledge content in one or more domains or knowledge types.

The target preset corpus may be a structured corpus of corresponding knowledge types, such as a knowledge corpus in an education field, a knowledge corpus in a travel field, and the like. It is understood that the target preset corpus belongs to a corresponding knowledge type or an existing knowledge content corpus in the knowledge field.

The historical search record may be a record of a target object searching for knowledge content of a certain knowledge type in a historical time, and the historical search record may include keywords or search terms used by the target object when searching for the target knowledge content.

The target search term set includes one or more target search terms (target keywords), such as one or more search terms, keywords, etc. used when the target object searches for the knowledge content in the education domain, or one or more search terms used when the target object searches for the related knowledge content in other domains.

The primary purpose of embodiments of the present application is to construct a knowledge graph of a domain or a type of knowledge. Wherein, the Knowledge Graph (Knowledge Graph) belongs to a semantic network, and describes objective things in the form of a data structure Graph, and the data structure Graph consists of nodes and edges; belongs to a method for describing knowledge entities and entity relations through visualization or structuring, and is used for expressing the incidence relation or incidence between different knowledge entities in the same field (or knowledge type). Specifically, nodes in the knowledge graph represent concepts and entities, the concepts are abstract things, and the entities are concrete things; edges represent relationships and attributes of things, internal features of things are represented by attributes, and external contacts are represented by relationships. For ease of understanding, entities and concepts are collectively referred to as entities, and relationships and attributes are collectively referred to as relationships, i.e., a knowledge graph describes entities and relationships between entities. For example, an entity may be a person, a place, an organization, a concept, etc., and the category of the relationship may be a person-to-person relationship, a person-to-organization relationship, a concept-to-some object relationship, etc. The foregoing is by way of example only and is not limiting.

In order to construct a knowledge graph of a certain field or a certain knowledge type, in the embodiment of the present application, knowledge content corpora of a corresponding knowledge type, that is, target preset corpora, and search words, keywords, and the like used by a target object to search for knowledge content of the corresponding knowledge type or the knowledge field in a historical time period are first acquired, so that a knowledge entity for constructing the knowledge graph is subsequently determined for the target search words of the target knowledge type according to the existing target preset corpora in the target knowledge type and the target object in the historical time period.

For example, taking the education domain as the target knowledge type as an example, in order to establish a knowledge graph of the education domain, the language material of the education knowledge may be obtained from a language material database corresponding to the education vertical website, wherein the language material includes two forms: structured corpus and unstructured corpus; acquiring search information such as search words or keywords or search words and the like adopted by the target object to search the education knowledge content in the historical time from the historical search records of the online education vertical website; so that knowledge entity words used for constructing the knowledge map of the education field are preliminarily determined based on the education knowledge corpus and the search words.

Through the method, the existing knowledge corpus and the search words in the search records of the knowledge content of the user can be obtained, so that the knowledge entity for constructing the knowledge map can be determined preliminarily and subsequently.

102. And carrying out word information extraction on the target preset linguistic data and the target search word set to obtain a target entity word set.

Wherein the set of target entity words may include one or more target entity words.

The target entity word can be a knowledge entity of a target knowledge type and is used for participating in building a knowledge graph of the target knowledge type. For example, in the field of sports knowledge, the target entity words may be track and field, basketball, riding horses, shooting, etc.; for another example, in the programming field, the target entity word may be Java, python, sprongboot, etc., which are merely examples and are not limited herein.

In order to obtain a knowledge entity for constructing a knowledge graph, in the embodiment of the application, after obtaining a knowledge content corpus corresponding to a target knowledge type and a target search word set of a target object, word information is extracted based on the knowledge content corpus and the target search word set, so that the extracted word information is used as a knowledge entity for constructing the knowledge graph, namely the target entity word. Therefore, the knowledge entity used for constructing the knowledge graph is determined by combining the corpus and the search information in the historical search record of the target object, the knowledge entity is determined from the corpus information of various sources, the coverage range of the knowledge entity in the knowledge graph is effectively improved, and the knowledge entity is closer to the use requirement.

In some embodiments, the step of "extracting word information from the target preset corpus and the target search word set to obtain the target entity word set" may include:

(1) and establishing a word information structure tree according to the target preset linguistic data and the target search word set.

(2) And extracting word information contained in the word information structure tree to obtain an entity word candidate set.

(3) And filtering the word information in the entity word candidate set to obtain a target entity word set.

The word information structure tree may be a word information search tree (a dictionary tree or a pre-tree), which may be created by specific word information, keywords, characters, etc. for searching, counting, and sorting the word information.

The word information can be word information obtained by extracting based on a target preset corpus and a target search word set, and the extraction mode can be obtained by extracting through an established word information structure tree.

The entity word candidate set includes one or more candidate entity words, which may be word information, such as word groups, word features, and the like, and may also be short texts or characters representing semantics and meanings, such as short sentences, characters, words, and the like. It should be noted that the entity word candidate set may include all word information corresponding to the target preset corpus and the target search word set, and the same word information is stored only once in the entity word candidate set, that is, the entity word candidate set includes different candidate entity words.

The target entity word set includes one or more target entity words, and the entity words may be word information or short text. It should be noted that the target entity words included in the target entity word set belong to word information or short texts which are relatively common, popular or occur frequently, and are used for participating in the subsequent construction of the knowledge graph.

In order to determine a target entity word for constructing a knowledge graph, in the embodiment of the present application, after a target preset corpus and a target search word set are obtained, a word information structure tree may be constructed based on the target preset corpus and the target search word set, and the construction process of the word information structure tree may be as follows: and searching word information contained in the target preset corpus and the target search word set according to the part-of-speech information, and establishing a word information structure tree according to the searched word information. Exemplarily, a specific process of establishing the word information structure tree according to the searched word information: acquiring a character sequence corresponding to each word information, wherein the character sequence can be a sequence formed by sequencing one or more characters related when the word information is formed; determining the position information of each character in the word information structure tree according to the sequence of each character in the character sequence; and establishing a word information structure tree according to the position information of each character.

Further, after the word information structure tree is obtained, word information may be extracted according to character data included in the word information structure tree to obtain a plurality of word information, and an entity word candidate set corresponding to the plurality of word information may be generated.

It should be noted that the obtained entity word candidate set generally includes a plurality of entity words (word information), where a use frequency or an occurrence frequency of some entity words in the entity word candidate set may be low, for example, the entity words may only appear once or several times, and the entity words with low frequency may have low or no correlation with knowledge contents of the target knowledge type, so as to avoid accuracy and specificity of subsequently establishing a knowledge graph of the target knowledge type, in the embodiment of the present application, some low-frequency candidate entity words in the entity word candidate set may be filtered to obtain a target entity word set, where the target entity word set includes one or more high-frequency target entity words.

In some embodiments, the step of "filtering word information in the entity word candidate set to obtain the target entity word set" may include:

and (3.1) extracting each word information in the entity word candidate set.

And (3.2) determining the word frequency corresponding to each word information based on the target preset corpus and the target search word set.

And (3.3) determining word information with the word frequency less than a preset word frequency threshold as word information to be filtered, and filtering the word information to be filtered in the entity word candidate set to obtain a filtered target entity word set.

The preset word frequency threshold may be a threshold for screening entity words, and is specifically used for filtering low-frequency candidate entity words (word information) in the entity word candidate set.

Specifically, in order to filter low-frequency word information in the entity word candidate set to obtain a filtered target entity word set, each word information (candidate entity word) included in the entity word candidate set may be extracted in the embodiment of the present application; determining the frequency of occurrence of each candidate entity word (word information) in the target preset corpus and the target search word set, namely determining the word frequency corresponding to each word information; and comparing the word frequency corresponding to each word information with a preset word frequency threshold value, determining the word information with the word frequency smaller than the preset word frequency threshold value as the word information to be filtered, and filtering/deleting the word information to be filtered from the entity word candidate set to obtain a filtered target entity word set.

Through the method, the target entity word set can be determined based on the target preset linguistic data and the target search word set of the target object in the historical time period, the knowledge entity corresponding to the target knowledge type is expanded preliminarily, the knowledge entity is not limited to the preset linguistic data, the target entity word is more fit with the use behavior habit of the user, and the knowledge map can be constructed subsequently based on the target entity word in the target entity word set.

103. And updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words.

The target attribute information refers to attribute information obtained after attribute expansion and updating of each target entity word, and the attribute information may be information including specific description of the target entity word.

It should be noted that the target entity word set includes one or more target entity words, and each target entity word may correspond to a corresponding attribute. For example, taking a neural network model as a target entity word as an example, the attribute corresponding to the neural network model may be "artificial intelligence" or "AI".

In order to expand and update the attribute of each target entity word, the embodiment of the application can supplement and perfect the attribute information of the corresponding target entity word to the target search words based on a large number of target objects and the corresponding search corpus so as to obtain the target attribute information corresponding to each target entity word.

In some embodiments, the step "performing attribute update on the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words" may include:

(1) extracting a target attribute word corresponding to each target entity word in the target entity word set;

(2) acquiring a second target corpus in the preset corpus database, wherein the second target corpus is a corpus in the preset corpus database except the target preset corpus;

(3) corpus attribute information associated with each target entity word and the corresponding target attribute word is extracted from the second target corpus;

(4) and updating the attribute according to the corpus attribute information and the target attribute words corresponding to each target entity word to obtain the target attribute information corresponding to each target entity word.

The target attribute word may be initial attribute information corresponding to the corresponding target entity word. For example, for Java as an example of a target entity word, the corresponding target property word may be "inheritance", "object-oriented", "distributed", or the like. It should be noted that the target attribute words may include a first target attribute word and a second target attribute word, where the first target attribute word corresponds to an entity word obtained from the target preset corpus, and the second target attribute word corresponds to an entity word of the target search word.

The second target corpus may be an unstructured corpus and/or a semi-structured corpus corresponding to the target knowledge type in the preset corpus database. For example, taking Java as an example of a target entity word, the corpus attribute information of Java in the second target corpus (structured corpus/unstructured corpus) is described as "Java distributed application technology architecture includes an initial stage, application service and data service separation, performance improvement using cache, application server cluster, database read-write separation, and the like".

In order to update the attribute of each target entity word, in the embodiment of the present application, after a target entity word set is obtained, specifically, a target attribute word corresponding to each target entity word is extracted first, and for example, the target attribute word corresponding to each target entity word is determined according to a target preset corpus (structured corpus); then, acquiring a second target corpus from a preset expectation database, namely acquiring unstructured corpus and/or semi-structured corpus, and extracting corpus attribute information in the second target corpus; furthermore, based on the corpus attribute information and the target attribute words corresponding to the corresponding target entity words, attribute updating is performed on each target entity word, and the updating mode can be as follows: and performing regularization processing on the corpus attribute information and the target attribute words to obtain target attribute information corresponding to each target entity word.

When each target entity word has a plurality of corresponding target attribute words, the attribute updating process of each target entity word may be: matching a plurality of target attribute words corresponding to each target entity word with corresponding corpus attribute information to determine target attribute words matched with the corpus attribute information; determining the target attribute words matched with the corpus attribute information as attribute words to be updated of the target entity words; and integrating the corpus attribute information and the attribute words to be updated, and if regularizing, obtaining target attribute information corresponding to the target entity words.

For example, by taking Java as a target entity word as an example, the corresponding target property word may be "inheritance", "object-oriented", "distributed", or the like; the corpus attribute information of the Java in the second target corpus (structured corpus/unstructured corpus) is described as that a Java distributed application technology architecture comprises an initial stage, application service and data service separation, performance improvement by using cache, application server cluster, database read-write separation and the like; through text regularization and pattern matching, the expanded attribute (target attribute information) of the entity word java can be obtained as follows: java distributed architecture (initial phase, application service and data service separation, performance improvement using caching, application server clustering, database read-write separation).

In some embodiments, the step of "extracting a target attribute word corresponding to each target entity word in the target entity word set" may include:

(1.1) extracting a plurality of first target entity words corresponding to target preset linguistic data in a target entity word set to obtain a first target entity word set;

(1.2) acquiring a first target attribute word corresponding to each first target entity word based on the target preset corpus to obtain a first attribute word set;

(1.3) performing word segmentation processing on the target search word set according to the first target entity word set and the first attribute word set to obtain a target word segmentation result, and determining a second attribute word set corresponding to the target search word set according to the target word segmentation result;

and (1.4) fusing the first attribute word set and the second attribute word set to obtain a fused attribute word set, wherein the fused attribute word set comprises target attribute words corresponding to each target entity word.

The target entity word set may include a target entity word corresponding to the target search word set, and may also include a target entity word corresponding to the target search word set and the target preset corpus. Taking the example that the target entity word set includes both the target search word set and the target entity word corresponding to the target preset corpus, the first target entity word may be an entity word corresponding to the target preset corpus in the target entity word set.

And the first target attribute word is an attribute word corresponding to the corresponding first target entity word.

The target word segmentation result may include a second target attribute word corresponding to each target search word.

In order to determine a target attribute word corresponding to each target entity word, in the embodiment of the present application, after a target entity word set is obtained, first, a first target entity word generated by a target preset corpus may be obtained from the target entity word set. Then, a first target attribute word corresponding to each first target entity word is extracted from the target preset corpus (structured corpus), so as to generate a first attribute word set, where the first attribute word set includes a plurality of first target attribute words, that is, includes the first target attribute word corresponding to each first target entity word. Then, taking a first target entity word contained in the first target entity word set and a first target attribute word contained in the first attribute word set as dictionaries, and performing word segmentation processing on each target search word in the target search word set to obtain a corresponding target word segmentation result; and further, determining a second target attribute word corresponding to each target search word based on the target word segmentation result, and generating a second attribute word set, wherein the second attribute word set comprises the second target attribute word corresponding to each target search word. And finally, fusing the first attribute word set and the second attribute word set to obtain a fused attribute word set so as to obtain a target attribute word corresponding to each target entity word in the target entity word set.

In some embodiments, the step of performing word segmentation processing on the target search word set according to the first target entity word set and the first attribute word set to obtain a target word segmentation result may include:

(1.3.1) clustering target search words in the target search word set based on each target entity word in the target entity word set to obtain a search word classification set corresponding to each target entity word;

(1.3.2) performing word segmentation processing on the target search words in each search word classification set according to the first target entity word set and the first attribute word set to obtain a word segmentation result corresponding to each search word classification set;

and (1.3.3) determining a target word segmentation result corresponding to the target search word set according to the word segmentation result corresponding to each search word classification set.

It should be noted that the historical search record is a knowledge content search record generated when the target object searches for the knowledge content of the target knowledge type in the historical time period, and since the target search word is extracted from the historical search record of the target object, the target search word itself does not have a corresponding attribute. In order to subsequently perform attribute expansion and update on the entity words corresponding to the target search words, the embodiment of the application needs to give the attribute words corresponding to each target search word first.

Specifically, in order to assign an attribute word corresponding to each target search word, first, clustering/classifying each target search word in a target search word set based on a target entity word in a target entity word set, specifically, based on a target entity word corresponding to each target search word in the target entity word set, to obtain a search word classification set corresponding to each target entity word; for example, if the second target entity word is used as the target entity word of the target entity word set corresponding to the target search word, the target search words in the target search word set are clustered based on each second target entity word in the target entity word set, so as to obtain a search word classification set corresponding to each second target entity word. Then, based on the first target entity word set and the first attribute word set, for example, as a dictionary, performing word segmentation processing on the target search words in each search word classification set to obtain a word segmentation result corresponding to each search word classification set, where the word segmentation result may include an attribute word corresponding to a corresponding target search word in the search word classification set, that is, a second target attribute word corresponding to the search word classification set. Finally, the word segmentation results corresponding to each search word classification set are combined to obtain target word segmentation results corresponding to the target search word set, in addition, a corresponding target word segmentation result list can be generated according to the word segmentation results corresponding to each search word classification set, and the target word segmentation result list can comprise the mapping relation between each second target entity word and the target search word as well as the second target attribute word (attribute word).

Specifically, the word segmentation processing procedure may be: extracting a target search word in each search word classification set; matching the target search word with a first target entity word in a first target entity word set to obtain an entity word matching result, wherein the entity word matching result can comprise the first target entity word matched with the target search word in the first target entity word set, such as the first target entity word same as the target search word, and can also be the first target entity word with the same or similar or identical semantics as the target search word; further, searching a first target attribute word corresponding to the entity word matching result from the first attribute word set, for example, searching a first target attribute word corresponding to a first target entity word included in the entity word matching result from the first attribute word set; and finally, determining word segmentation results corresponding to each search word classification set according to the first target attribute words corresponding to the entity word matching results.

In some embodiments, the step "determining a second attribute word set corresponding to the target search word set according to the target word segmentation result" may include:

(1.3.a) identifying a search word classification set corresponding to each attribute word in the target word segmentation result;

(1.3.b) obtaining the frequency of the target search words contained in the search word classification set corresponding to each attribute word;

(1.3.c) determining the search word classification set with the frequency greater than or equal to a preset search word frequency threshold value as a target search word classification set;

and (1.3.d) generating a second attribute word set corresponding to the target search word set according to the attribute words corresponding to each target search word classification set.

The preset search term frequency sub-threshold may be a threshold used for screening attribute terms, and is specifically used for filtering low-frequency sub-attribute terms corresponding to a screened search term classification set.

In order to filter low-frequency attribute words, after target word segmentation results are obtained, the search word classification set corresponding to each attribute word in the target word segmentation results can be identified, that is, the search word classification set corresponding to each second target attribute word is identified; then, counting the number, namely frequency, of the target search terms contained in each search term classification set; and then, determining target search term classification sets with the frequency greater than or equal to a preset search term frequency threshold, and determining a second attribute term set corresponding to the target search terms according to the attribute terms (second target attribute terms) corresponding to each target search term classification set.

In some embodiments, the step of "performing attribute update according to the corpus attribute information and the target attribute word corresponding to each target entity word to obtain target attribute information corresponding to each target entity word" may include:

(4.1) carrying out word information matching on the target attribute words in the corpus attribute information to obtain target attribute words matched with the corpus attribute information;

and (4.2) determining the target attribute words matched with the corpus attribute information as attribute words to be updated, and performing regularization processing according to the attribute words to be updated and the corpus attribute information to obtain target attribute information corresponding to each target entity word.

Through the mode, the attribute corresponding to each target search word can be given, so that the entity words corresponding to the target search words in the target entity word set also have corresponding attribute words, and attribute updating is conveniently carried out on each target entity word in the target entity word set, so that attribute extraction and expansion of the knowledge entities are realized, and each knowledge entity is conveniently related to the construction of a knowledge map and/or other knowledge entities in the follow-up process.

104. And searching the knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation.

The knowledge content associated with the target attribute information may be target knowledge content related to the target attribute information and/or the corresponding target entity word acquired through other channels. That is, the target knowledge content is not a knowledge content corpus in the local preset corpus database, and the target knowledge content may specifically be a knowledge content corpus in other websites or corresponding databases (second preset corpus database) in the same domain as the knowledge domain/target knowledge type.

Specifically, in order to expand the knowledge entities (target entity words) used for constructing the target knowledge graph, in the embodiment of the application, after target attribute information corresponding to each target entity word is obtained, knowledge content associated with the target attribute information can be searched; the process of searching for the knowledge content associated with the target attribute information may be: determining each target entity word and/or corresponding target attribute information as information to be searched; and searching the target knowledge content associated with the information to be searched.

Further, after finding the associated target knowledge content, the embodiment of the present application determines the knowledge entity data having an association relationship according to the target knowledge content and the target entity word set, for example, establishes an association relationship between the knowledge content and the target entity word set. Wherein, the specific process can be as follows: extracting a plurality of entity words to be determined from the target knowledge content, and determining attribute words to be determined corresponding to each entity word to be determined according to a target preset corpus (structured corpus); determining entity words to be associated based on each entity word to be determined and the corresponding attribute word to be determined, for example, filtering the entity words to be determined and the attribute words to be determined according to the frequency statistical manner to determine the entity words to be associated, it can be understood that an entity word set to be associated may include one or more entity words to be associated; and further, establishing an incidence relation between the entity word set to be associated and the target entity word set to obtain knowledge entity data with the incidence relation.

Through the method, the number of the knowledge entities used for constructing the knowledge graph can be expanded, so that the coverage range of the knowledge entities related to the subsequent knowledge graph is further improved, and the usability of the subsequently constructed knowledge graph is improved.

105. And establishing a knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation.

Wherein, the Knowledge Graph (Knowledge Graph) belongs to a semantic network, and describes objective things in the form of a data structure Graph, and the data structure Graph consists of nodes and edges; belongs to a method for describing knowledge entities and entity relations through visualization or structuring, and is used for expressing the incidence relation or incidence between different knowledge entities in the same field (or knowledge type).

Specifically, in order to obtain the knowledge graph corresponding to the target knowledge type, in the embodiment of the application, after obtaining the knowledge entity data having the association relationship, the knowledge graph corresponding to the target knowledge type may be established according to the knowledge entity data having the association relationship. So as to be convenient for subsequent use based on the knowledge graph, such as services of knowledge mining, analysis and the like.

As can be seen from the above, in the embodiment of the present application, the target preset corpus corresponding to the target knowledge type may be obtained from the preset corpus database, and the target search term set corresponding to the target knowledge type may be obtained from the historical search record; extracting word information from the target preset corpus and the target search word set to obtain a target entity word set; updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words; searching knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation; and establishing a knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation. Therefore, the method and the device can determine the target entity word set according to the target search word set corresponding to the target preset linguistic data and the historical search records, expand the word quantity of the knowledge entity by combining the linguistic data in the database and the historical search records, further establish the association relation between the knowledge content associated with the target attribute information and the target entity word set after acquiring the target attribute information corresponding to the knowledge entity word, and establish the knowledge graph corresponding to the target knowledge type based on the associated knowledge entity data, thereby realizing the construction of the knowledge graph by utilizing the linguistic entity words of various channels, improving the coverage range of the knowledge entity in the knowledge graph and the contained knowledge information quantity, and further improving the usability of the knowledge graph.

The method described in the above examples is further illustrated in detail below by way of example.

The embodiment of the present application takes data processing as an example, and further describes the data processing method provided in the embodiment of the present application.

Fig. 3 is a schematic flowchart of another step of the data processing method provided in the embodiment of the present application, and fig. 4 is a schematic view of a scenario of the data processing method provided in the embodiment of the present application; fig. 5 is a scene schematic diagram of target entity word processing in the data processing method provided in the embodiment of the present application; fig. 6 is a scene schematic diagram of target attribute information expansion in the data processing method according to the embodiment of the present application. For ease of understanding, please refer to fig. 3, 4, 5 and 6 together to describe the embodiments of the present application.

In the embodiments of the present application, description will be made from the perspective of a data processing apparatus, which may be specifically integrated in a computer device such as a terminal and/or a server. For example, taking the example of being integrated in a server, when a processor on the server executes a program corresponding to the data processing method, a specific flow of the data processing method is as follows:

201. and acquiring a target preset corpus corresponding to the target knowledge type from a preset corpus database, and acquiring a target search word set corresponding to the target knowledge type from a historical search record.

The preset corpus database may be a database containing a large amount of knowledge content data, which may contain knowledge content corpora of a plurality of knowledge types (knowledge fields), such as knowledge content corpora of education fields, knowledge content corpora of sports fields, knowledge content corpora of travel fields, and the like; in addition, the preset corpus database may also be a database containing knowledge content corpuses of a certain knowledge type (knowledge domain), such as knowledge content corpuses of education domain, knowledge content corpuses of sports domain, knowledge content corpuses of tourism domain, or knowledge content corpuses of other domains, which is not limited herein. It should be noted that the preset corpus database includes corpus of knowledge content existing in one or more fields or knowledge types.

For example, taking the education domain as the target knowledge type as an example, in order to establish a knowledge graph of the education domain, the language material of the education knowledge may be obtained from a language material database corresponding to the education vertical website, wherein the language material includes two forms: structured corpora and unstructured corpora, it should be noted that the target preset corpora obtained in this embodiment are structured corpora; acquiring search information such as search words or keywords or search words and the like adopted by the target object to search the education knowledge content in the historical time from the historical search records of the online education vertical website; so that knowledge entity words used for constructing the knowledge map of the education field are preliminarily determined based on the education knowledge corpus and the search words.

202. And carrying out word information extraction on the target preset linguistic data and the target search word set to obtain a target entity word set.

In order to obtain a knowledge entity for constructing a knowledge graph, in the embodiment of the application, after obtaining a knowledge content corpus corresponding to a target knowledge type and a target search word set of a target object, word information is extracted based on the knowledge content corpus and the target search word set, so that the extracted word information is used as a knowledge entity for constructing the knowledge graph, namely the target entity word.

Specifically, after the target preset corpus and the target search word set are obtained, the word information structure tree may be constructed based on the target preset corpus and the target search word set. And then, extracting word information according to the character data contained in the word information structure tree to obtain a plurality of word information, and generating an entity word candidate set corresponding to the plurality of word information. And finally, filtering word information in the entity word candidate set to obtain a target entity word set, wherein the filtering process comprises the following steps: extracting each word information (candidate entity word) contained in the entity word candidate set; determining the frequency of occurrence of each candidate entity word (word information) in the target preset corpus and the target search word set, namely determining the word frequency corresponding to each word information; and comparing the word frequency corresponding to each word information with a preset word frequency threshold value, determining the word information with the word frequency smaller than the preset word frequency threshold value as the word information to be filtered, and filtering/deleting the word information to be filtered from the entity word candidate set to obtain a filtered target entity word set.

203. And extracting a target attribute word corresponding to each target entity word in the target entity word set.

It should be noted that the target search term is obtained from a history search record of the target object in a history time period, and the target search term itself does not have a corresponding attribute. In order to subsequently perform attribute expansion and update on the entity words corresponding to the target search words, the embodiment of the application needs to give the attribute words corresponding to each target search word first.

In order to assign an attribute word corresponding to each target search word, firstly, clustering/classifying each target search word in a target search word set based on the target entity word in the target entity word set, specifically, based on the target entity word corresponding to each target search word in the target entity word set, so as to obtain a search word classification set corresponding to each target entity word; for example, if the second target entity word is used as the target entity word of the target entity word set corresponding to the target search word, the target search words in the target search word set are clustered based on each second target entity word in the target entity word set, so as to obtain a search word classification set corresponding to each second target entity word. Then, based on the first target entity word set and the first attribute word set, for example, as a dictionary, performing word segmentation processing on the target search words in each search word classification set to obtain a word segmentation result corresponding to each search word classification set, where the word segmentation result may include an attribute word corresponding to a corresponding target search word in the search word classification set, that is, a second target attribute word corresponding to the search word classification set. Finally, the word segmentation results corresponding to each search word classification set are combined to obtain target word segmentation results corresponding to the target search word set, in addition, a corresponding target word segmentation result list can be generated according to the word segmentation results corresponding to each search word classification set, and the target word segmentation result list can comprise the mapping relation between each second target entity word and the target search word as well as the second target attribute word (attribute word).

204. And acquiring a second target corpus in the preset corpus database.

The second target corpus is a corpus in the preset corpus database except the target preset corpus.

The second target corpus may be an unstructured corpus and/or a semi-structured corpus corresponding to the target knowledge type in the preset corpus database.

205. And extracting corpus attribute information associated with each target entity word and the corresponding target attribute word from the second target corpus.

Specifically, the step of not extracting corpus attribute information associated with each target entity word and corresponding target attribute word from the second target corpus may include:

labeling the second target corpus based on each target entity word and the corresponding target attribute word to obtain a labeled second target corpus; and extracting the labeled information from the labeled second target corpus to obtain corpus attribute information.

For example, taking Java as an example of a target entity word, the corpus attribute information of Java in the second target corpus (structured corpus/unstructured corpus) is described as "Java distributed application technology architecture includes an initial stage, application service and data service separation, performance improvement using cache, application server cluster, database read-write separation, and the like".

206. And updating the attribute according to the corpus attribute information and the target attribute words corresponding to each target entity word to obtain the target attribute information corresponding to each target entity word.

207. And searching the knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation.

Specifically, the knowledge content associated with the target attribute information is searched; the process of searching for the knowledge content associated with the target attribute information may be: determining each target entity word and/or corresponding target attribute information as information to be searched; and searching the target knowledge content associated with the information to be searched. Further, determining knowledge entity data with an association relationship according to the target knowledge content and the target entity word set, for example, establishing the association relationship between the knowledge content and the target entity word set; wherein, the specific process can be as follows: extracting a plurality of entity words to be determined from the target knowledge content, and determining attribute words to be determined corresponding to each entity word to be determined according to a target preset corpus (structured corpus); determining entity words to be associated based on each entity word to be determined and the corresponding attribute word to be determined, for example, filtering the entity words to be determined and the attribute words to be determined according to the frequency statistical manner to determine the entity words to be associated, it can be understood that an entity word set to be associated may include one or more entity words to be associated; and finally, establishing an incidence relation between the entity word set to be associated and the target entity word set to obtain knowledge entity data with the incidence relation.

208. And establishing a knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation.

By executing the above steps 201-208, the scenario shown in fig. 4, 5 and 6 can be realized, which takes the construction of the knowledge graph in the education field as an example, and the specific process is as follows:

(1) and (3) corpus collection: obtaining an education knowledge corpus from an education vertical website, wherein the education knowledge corpus comprises a structured knowledge corpus and an unstructured corpus; in addition, user search terms and educational content are obtained from an online education website.

(2) And (3) carrying out entity structured extraction of education knowledge: constructing a Tri tree (dictionary tree) by combining the vertical website structured corpus and the online education website user retrieval entries, and determining an education knowledge entity candidate set and corresponding attribute information; and taking the entity words in the entity candidate set as core words to carry out frequency distribution statistics on the structured linguistic data and the user search words, and screening the entities according to a set threshold value to serve as the finally determined knowledge entities, namely the entity words. The scene of the entity word extraction is shown in fig. 5, which is a schematic view of the scene of the target entity word processing.

(3) Extracting and expanding the attribute of the knowledge entity: a) performing Kmeans clustering on the user search word based on the entity word obtained in the step 2, and clustering similar search words corresponding to knowledge entities into one class, namely one knowledge entity corresponds to one class; because the structured corpus contains complete attribute information description, the entities and the corresponding attribute words in the structured corpus are used as dictionaries to perform word segmentation on user search words to obtain word segmentation results of the user search words, word segmentation result lists in the corresponding categories of knowledge entities are used as attribute candidate sets, and low-frequency attribute candidate words are filtered according to frequency distribution statistics; b) performing label returning on unstructured and semi-structured corpora obtained by encyclopedic websites and education vertical websites by using the obtained entity words and candidate attribute words, and obtaining an expansion attribute according to text regularization and pattern matching; for example, for the entity word "Java", the attribute list of the knowledge points obtained by the structured corpus and the user search word is < inheritance, object-oriented, distributed >, and for the expression in the unstructured and semi-structured corpus "Java distributed application technology architecture includes an initial stage, application service and data service separation, performance improvement using cache, application server clustering, database read-write separation, etc. The method can obtain the extended attribute of the entity word java through text regularization and pattern matching: java distributed architecture (initial phase, application service and data service separation, performance improvement using caching, application server clustering, database read-write separation). The process of extracting and expanding the knowledge entity attribute is specifically shown in fig. 6.

(4) And (3) knowledge association: the method comprises the steps of segmenting a content title and a content directory of an online education website to obtain an entity and candidate knowledge attribute set, searching the entity and attribute knowledge as query requirements of a search engine, performing association and co-occurrence statistics on the entity and attribute knowledge in search results of the search engine, and performing knowledge association based on a co-occurrence threshold.

(5) And (3) dynamically supplementing the attribute of the knowledge entity: performing label returning on the user search words and content data of the online education website based on the knowledge entities obtained in the step (3), and counting the corresponding heat and user behavior feedback of the knowledge entities of the content search and recommendation scenes to form a knowledge heat attribute and a knowledge correlation attribute; the knowledge association attribute refers to behavior association relation existing among different knowledge entities, for example, most users can learn about relevant courses of springboot after learning java, and we can obtain the front-back dependency relation between the java entities and the springboot entities.

(6) And (4) completing construction of the knowledge graph in the education field through the steps, and repeating the step (4) and the step (5) at regular time to obtain updating of the knowledge graph.

Through the above flow scenes, the following beneficial effects are achieved: the education professional domain knowledge map is constructed by utilizing the education content and the user behavior data precipitated by the online education website and combining the corpus information of the open website, and the map construction challenge brought by the problems of small coverage, high specialty and low standardization of professional domain knowledge is effectively solved. The scheme is low in implementation cost and does not need to relate to a large amount of manpower marking work. The established educational knowledge map enriches the knowledge base in the educational field, can effectively improve the use experience of users and the operating efficiency of the platform in the digital and online education, and has very high use value and income.

In order to better implement the above method, the present application further provides a data processing apparatus, which may be integrated in a network device, such as a server or a terminal, and the terminal may include a tablet computer, a notebook computer, and/or a personal computer.

For example, as shown in fig. 7, the data processing apparatus may include an acquisition unit 701, a refinement unit 702, an update unit 703, a search unit 704, and a setup unit 705.

An obtaining unit 701, configured to obtain a target preset corpus corresponding to a target knowledge type from a preset corpus database, and obtain a target search word set corresponding to the target knowledge type from a historical search record;

the refining unit 702 is configured to refine word information of the target preset corpus and the target search word set to obtain a target entity word set;

the updating unit 703 is configured to perform attribute updating on the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words;

the searching unit 704 is used for searching the knowledge content associated with the target attribute information, and establishing an association relationship between the knowledge content and the target entity word set to obtain knowledge content data with the association relationship;

the establishing unit 705 is configured to establish a knowledge graph corresponding to the target knowledge type according to the knowledge content data having the association relationship.

In some embodiments, the refinement unit 702 is further specifically configured to:

establishing a word information structure tree according to the target preset corpus and the target search word set; extracting word information contained in the word information structure tree to obtain an entity word candidate set; and filtering the word information in the entity word candidate set to obtain a target entity word set.

In some embodiments, the refinement unit 702 is further configured to:

extracting each word information in the entity word candidate set; determining word frequency corresponding to each word information based on the target preset corpus and the target search word set; determining word information with the word frequency less than a preset word frequency threshold as word information to be filtered, and filtering the word information to be filtered in the entity word candidate set to obtain a filtered target entity word set.

In some embodiments, the updating unit 703 is further configured to:

extracting a target attribute word corresponding to each target entity word in the target entity word set; acquiring a second target corpus in the preset corpus database, wherein the second target corpus is a corpus in the preset corpus database except the target preset corpus; corpus attribute information associated with each target entity word and the corresponding target attribute word is extracted from the second target corpus; and updating the attribute according to the corpus attribute information and the target attribute words corresponding to each target entity word to obtain the target attribute information corresponding to each target entity word.

In some embodiments, the updating unit 703 is further configured to:

extracting a plurality of first target entity words corresponding to a target preset corpus in a target entity word set to obtain a first target entity word set; acquiring a first target attribute word corresponding to each first target entity word based on the target preset corpus to obtain a first attribute word set; performing word segmentation processing on the target search word set according to the first target entity word set and the first attribute word set to obtain a target word segmentation result, and determining a second attribute word set corresponding to the target search word set according to the target word segmentation result; and fusing the first attribute word set and the second attribute word set to obtain a fused attribute word set, wherein the fused attribute word set comprises target attribute words corresponding to each target entity word.

In some embodiments, the updating unit 703 is further configured to:

clustering target search words in the target search word set based on each target entity word in the target entity word set to obtain a search word classification set corresponding to each target entity word; performing word segmentation processing on the target search words in each search word classification set according to the first target entity word set and the first attribute word set to obtain a word segmentation result corresponding to each search word classification set; and determining a target word segmentation result corresponding to the target search word set according to the word segmentation result corresponding to each search word classification set.

In some embodiments, the updating unit 703 is further configured to:

identifying a search word classification set corresponding to each attribute word in the target word segmentation result; acquiring the frequency of target search words contained in the search word classification set corresponding to each attribute word; determining a search word classification set with the frequency greater than or equal to a preset search word frequency threshold value as a target search word classification set; and generating a second attribute word set corresponding to the target search word set according to the attribute words corresponding to each target search word classification set.

In some embodiments, the updating unit 703 is further configured to:

performing word information matching on the target attribute words in the corpus attribute information to obtain target attribute words matched with the corpus attribute information; and determining the target attribute words matched with the corpus attribute information as attribute words to be updated, and performing regularization processing according to the attribute words to be updated and the corpus attribute information to obtain target attribute information corresponding to each target entity word.

In some embodiments, the lookup unit 704 is further configured to: determining each target entity word and/or corresponding target attribute information as information to be searched; and searching the target knowledge content associated with the information to be searched.

In some embodiments, the lookup unit 704 is further configured to: extracting a plurality of entity words to be determined from the target knowledge content, and determining attribute words to be determined corresponding to each entity word to be determined according to a target preset corpus (structured corpus); determining entity words to be associated based on each entity word to be determined and the corresponding attribute word to be determined, and establishing a entity word set to be associated corresponding to the entity words to be associated; and establishing an incidence relation between the entity word set to be associated and the target entity word set to obtain knowledge entity data with the incidence relation.

As can be seen from the above, in the embodiment of the application, the obtaining unit 701 may obtain the target preset corpus corresponding to the target knowledge type from the preset corpus database, and obtain the target search term set corresponding to the target knowledge type from the historical search record; extracting word information of the target preset corpus and the target search word set through an extracting unit 702 to obtain a target entity word set; updating the attributes of the target entity words in the target entity word set through an updating unit 703 to obtain target attribute information corresponding to the target entity words; searching the knowledge content associated with the target attribute information through the searching unit 704, and establishing an association relationship between the knowledge content and the target entity word set to obtain knowledge entity data with the association relationship; the knowledge graph corresponding to the target knowledge type is established according to the knowledge entity data with the association relationship through the establishing unit 705. Therefore, the method and the device can determine the target entity word set according to the target search word set corresponding to the target preset linguistic data and the historical search records, expand the word quantity of the knowledge entity by combining the linguistic data in the database and the historical search records, further establish the association relation between the knowledge content associated with the target attribute information and the target entity word set after acquiring the target attribute information corresponding to the knowledge entity word, and establish the knowledge graph corresponding to the target knowledge type based on the associated knowledge entity data, thereby realizing the construction of the knowledge graph by utilizing the linguistic entity words of various channels, improving the coverage range of the knowledge entity in the knowledge graph and the contained knowledge information quantity, and further improving the usability of the knowledge graph.

The embodiment of the present application further provides a computer device, as shown in fig. 8, which shows a schematic structural diagram of the computer device according to the embodiment of the present application, and specifically:

the computer device may include components such as a processor 801 of one or more processing cores, memory 802 of one or more computer-readable storage media, a power supply 803, and an input unit 804. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 8 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 801 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 802 and calling data stored in the memory 802, thereby monitoring the computer device as a whole. Alternatively, processor 801 may include one or more processing cores; preferably, the processor 801 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 801.

The memory 802 may be used to store software programs and modules, and the processor 801 executes various functional applications and data processing by operating the software programs and modules stored in the memory 802. The memory 802 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 802 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 802 may also include a memory controller to provide the processor 801 access to the memory 802.

The computer device further includes a power supply 803 for supplying power to the various components, and preferably, the power supply 803 is logically connected to the processor 801 via a power management system, so that functions such as managing charging, discharging, and power consumption are performed via the power management system. The power supply 803 may also include one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and any like components.

The computer device may further include an input unit 804, the input unit 804 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment of the present application, the processor 801 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 801 runs the application programs stored in the memory 802, thereby implementing various functions as follows:

acquiring a target preset corpus corresponding to a target knowledge type from a preset corpus database, and acquiring a target search word set corresponding to the target knowledge type from a historical search record; extracting word information from the target preset corpus and the target search word set to obtain a target entity word set; updating the attributes of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words; searching knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation; and establishing a knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any data processing method provided by the embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any data processing method provided in the embodiments of the present application, the beneficial effects that can be achieved by any data processing method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

The foregoing detailed description has provided a data processing method, apparatus, and computer-readable storage medium according to embodiments of the present application, and specific examples are used herein to explain the principles and implementations of the present application, and the above descriptions of the embodiments are only used to help understand the method and its core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A data processing method, comprising:

searching knowledge content associated with the target attribute information, and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation;

and establishing a knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation.

2. The method according to claim 1, wherein the extracting word information from the target corpus and the target corpus to obtain a target entity word set comprises:

3. The method of claim 2, wherein the filtering the word information in the entity word candidate set to obtain a target entity word set comprises:

extracting each word information in the entity word candidate set;

4. The method according to claim 1, wherein the attribute updating of the target entity words in the target entity word set to obtain target attribute information corresponding to the target entity words includes:

5. The method of claim 4, wherein the extracting the target attribute word corresponding to each target entity word in the set of target entity words comprises:

6. The method of claim 5, wherein the performing a word segmentation process on the target search word set according to the first target entity word set and the first attribute word set to obtain the target word segmentation result comprises:

7. The method according to claim 5, wherein the determining a second attribute word set corresponding to the target search word set according to the target word segmentation result comprises:

8. The method according to claim 4, wherein the performing attribute update according to the corpus attribute information and the target attribute word corresponding to each target entity word to obtain target attribute information corresponding to each target entity word comprises:

9. A data processing apparatus, comprising:

the searching unit is used for searching the knowledge content associated with the target attribute information and establishing an association relation between the knowledge content and the target entity word set to obtain knowledge entity data with the association relation;

and the establishing unit is used for establishing the knowledge graph corresponding to the target knowledge type according to the knowledge entity data with the incidence relation.

10. A computer-readable storage medium, wherein the computer-readable storage medium is computer-readable and stores a plurality of instructions, the instructions being adapted to be loaded by a processor to perform the steps of the data processing method according to any one of claims 1 to 8.