CN110321437B

CN110321437B - Corpus data processing method and device, electronic equipment and medium

Info

Publication number: CN110321437B
Application number: CN201910445315.XA
Authority: CN
Inventors: 周辉阳
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2024-03-15
Anticipated expiration: 2039-05-27
Also published as: CN110321437A

Abstract

The invention discloses a corpus data processing method, a corpus data processing device, electronic equipment and a medium. The method comprises the following steps: obtaining corpus data to be processed in the target field; acquiring a target entity and a target predicate in the corpus data to be processed according to a knowledge graph corresponding to the target field, and generating a corresponding entity mapping relation to be matched; filtering the corpus data to be processed based on the matching degree between the entity mapping relation to be matched and the target entity mapping relation to obtain target corpus data with target intention; wherein the target entity mapping relationship is set based on the knowledge graph. The efficiency of obtaining the target corpus data can be effectively improved. The target corpus data obtained in this way has target intention, and then the quality of a model trained by taking the target corpus data as input can be improved.

Description

Corpus data processing method and device, electronic equipment and medium

Technical Field

The present invention relates to the field of internet communications technologies, and in particular, to a corpus data processing method, device, electronic equipment, and medium.

Background

The intelligent question-answering system is a novel information service system and can analyze intention according to user input so as to answer questions for users. At present, the intelligent question-answering system is widely applied to intelligent customer service, intelligent household appliances and other scenes, and is popular with vast users. For query corpus data input by a user from different fields (such as medical, educational, legal fields and the like), the intelligent question-answering system makes effective answers.

In the prior art, often, target corpus data is selected from corpus data to be processed in a manual mode, and a model of a corresponding field in an intelligent question-answering system is constructed based on the target corpus data. However, the manpower cost is high, and the treatment efficiency is low; meanwhile, for the corresponding field, the obtained target corpus data has poor pertinence and large noise, and the quality of the constructed intelligent question-answering system is further affected.

Disclosure of Invention

In order to solve the problems of low processing efficiency, poor processing effect and the like in the processing of corpus data in the prior art, the invention provides a corpus data processing method, a corpus data processing device, electronic equipment and a medium:

in one aspect, the present invention provides a corpus data processing method, the method including:

Obtaining corpus data to be processed in the target field;

acquiring a target entity and a target predicate in the corpus data to be processed according to a knowledge graph corresponding to the target field, and generating a corresponding entity mapping relation to be matched;

filtering the corpus data to be processed based on the matching degree between the entity mapping relation to be matched and the target entity mapping relation to obtain target corpus data with target intention;

wherein the target entity mapping relationship is set based on the knowledge graph.

Another aspect provides a corpus data processing apparatus, the apparatus comprising:

the acquisition module is used for: the method comprises the steps of obtaining corpus data to be processed in a target field;

the generation module is used for: the method comprises the steps of obtaining target entities and target predicates in the corpus data to be processed according to a knowledge graph corresponding to the target field, and generating a corresponding entity mapping relation to be matched;

and a filtering module: the method comprises the steps of filtering corpus data to be processed by using a bloom filter based on the matching degree between the entity mapping relation to be matched and the target entity mapping relation to obtain target corpus data with target intention;

In another aspect, an electronic device is provided, where the electronic device includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement a corpus data processing method as described above.

Another aspect provides a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement a corpus data processing method as described above.

The corpus data processing method, the corpus data processing device, the electronic equipment and the medium provided by the invention have the following technical effects:

based on a knowledge graph in the target field, the invention utilizes the relationship between the entity and the predicate in the SPO (Subject Predicate Object) triplet to process the corpus data to be processed to obtain the target corpus data with target intention. The efficiency of obtaining the target corpus data can be effectively improved. The target corpus data obtained in this way has target intention, and then the quality of a model trained by taking the target corpus data as input can be improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a corpus data processing method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of one of the mapping relationships of the target entities and the target predicates in the corpus data to be processed and the corresponding entity to be matched is generated according to the knowledge graph corresponding to the target domain provided by the embodiment of the invention;

fig. 4 is a schematic flow diagram of generating a mapping relationship between the target entity class and the target predicate, which are included in the target entity in the corpus data to be processed, according to the knowledge graph provided by the embodiment of the present invention;

FIG. 5 is a schematic flow chart of filtering the corpus data to be processed to obtain target corpus data based on the matching degree between the mapping relationship of the entity to be matched and the mapping relationship of the target entity according to the embodiment of the invention;

FIG. 6 is a schematic flow chart of a corpus data processing method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an application scenario of an intent recognition model provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of an application scenario for inputting query corpus data according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an application scenario for inputting query corpus data according to an embodiment of the present invention;

FIG. 10 is a block diagram of a corpus data processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server comprising a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment provided by an embodiment of the present invention, an intelligent question-answering system may include a client 01 and a server 02, where the client and the server are connected through a network. The user sends the query corpus data to the server through the client, and the server processes the received query corpus data to identify the intention of the user, so as to obtain corresponding response corpus data. It should be noted that fig. 1 is only an example.

Specifically, the client 01 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, an intelligent wearable device, and other types of physical devices, or may include software running in the physical devices, for example, web pages provided by some service providers to users, or may provide applications provided by the service providers to users.

In particular, in the embodiment of the present disclosure, the server 02 may include a server that operates independently, or a distributed server, or a server cluster that is formed by a plurality of servers. The server 02 may include a network communication unit, a processor, a memory, and the like. Specifically, the server 02 may provide a background service for the client.

In practical application, through intention recognition, the domain to which the query corpus data belongs, such as the person, plant, animal domain and the like, can be determined by the intention type. Especially for some similar fields (such as novels, cartoons, movies, videos and the like), the field distinction can be effectively performed based on the intention recognition of the query corpus data, so that more accurate response corpus data can be obtained.

In the following, a specific embodiment of a corpus data processing method according to the present invention is described, and fig. 2 is a schematic flow chart of a corpus data processing method according to an embodiment of the present invention, where the method operation steps described in the examples or the flow chart are provided in the present specification, but more or fewer operation steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). As shown in fig. 2, the method may include:

S201: obtaining corpus data to be processed in the target field;

in the embodiment of the present invention, the process of obtaining the corpus data to be processed in the target domain may include: first, candidate corpus data is acquired. The candidate corpus data may be from a whole field. For example, the target domain may be a human domain, and the candidate corpus data may not be limited to the human domain. The candidate corpus data may be user search logs from an online application over a period of time (e.g., day, week). Then, selecting screening information pointing to the target field. The screening information includes at least one selected from the group consisting of a target domain keyword, a target domain uniform resource locator, and a target domain blacklist. And then, screening the corpus data to be processed from the corpus data candidates according to the screening information. Therefore, the to-be-processed corpus data can be effectively screened from a large number of or even a large number of the candidate corpus data, the number level of the to-be-processed corpus data is smaller than that of the candidate corpus data, and the to-be-processed corpus data can be accurately directed to the target field.

Specifically, when the target domain keyword is used as the screening information, an entity of the target domain (an entity means a basic unit representing one concept; specifically, an entity may be a specific object which has distinguishability and exists independently, for example, an entity may be a person name, a place name, an organization name, a date, a time, a percentage, money, or the like) may be selected as the target domain keyword. The target domain keyword may be composed of only one entity. The target domain keyword may also be composed of at least two entities in a logical and (& gt, and) relationship. For example, the sports field is taken as a target field, and ' Yao Ming ', ' Ke Jie ', ' world cup ', ' ping-assistant ', ' dakart ', ' A country & ' female row ', ' Nth & ' figure skating-line tournament & ' double short program ' and the like can be selected as target field keywords. Further, according to the number of corpus data to be processed obtained by filtering the target domain keywords as the screening information, the target domain keywords can be adjusted. For example, when the number of corpus data to be processed obtained by filtering using "Yao Ming" as the current target domain keyword is large, the target domain keyword may be adjusted to "Yao Ming & & NBA". When the number of corpus data to be processed obtained by filtering by using the 'Nth pattern ice-slip tournament double short program' as the current target domain keyword is small, the target domain keyword can be adjusted to the 'Nth pattern ice-slip tournament'.

When a target domain uniform resource locator (URL, uniform Resource Locator; which may be a compact representation of the location and access method of a resource available on the internet, is the address of a standard resource on the internet) is used as screening information, a target web address of the target domain may be selected as the target domain uniform resource locator. When selecting the target website, the correlation with the target field can be taken as a reference, or the click heat of the related website can be taken as a reference. Of course, the reference dimensions for selecting the target web address are not limited to the above and combinations thereof. Further, the target web site may be processed (e.g., removing the front-end "www", removing the back-end unwanted suffix) to obtain the target domain uniform resource locator. For example, "sports field qq.com", "sports. Sohu.com", "sports. Sina.com.cn", "sports.163.com", "hupu.com" and the like can be selected as target fields uniform resource locators. In practical application, user A inputs candidate corpus data A, returns at least one candidate link to the user according to the candidate corpus data A, and user A clicks the candidate link B and stays at the page opened by the candidate link B beyond a threshold value, or user A clicks the candidate link B and takes the page opened by the candidate link B as a final stay page. When the character string (such as "movie. Docuban. Com/subject/1291572/") corresponding to the candidate link B includes the uniform resource locator (such as "docuban. Com"), then the candidate corpus data a may be used as the corpus data to be processed in the target domain.

When the target domain blacklist is used as screening information, some sensitive words (such as pornography words) can be selected as the target domain blacklist, and entities of non-target domains can be selected as the target domain blacklist. For example, with the sports field as the target field, the entity "cloud computing", "big data", etc. of the computer field may be selected as the target field blacklist, and the entity "base", "deoxyribonucleic acid", etc. of the biological field may be selected as the target field blacklist.

According to the screening information, when the to-be-processed corpus data is screened from the candidate corpus data, the candidate corpus data can be correspondingly removed through the target domain blacklist, and then the candidate corpus data is correspondingly extracted through the target domain uniform resource locator and the target domain keywords in sequence.

Of course, the candidate corpus data from the whole domain can also be directly used as the to-be-processed corpus data of the target domain, and the target corpus data can be obtained by processing in the following steps S202-S203.

In a specific embodiment, the corpus data to be processed in the target domain is obtained to be presented in a query sentence format (query). For example, "who singer a's daughter is", "you know how much of movie B was showing", "what technical difficulties the bridge resolved in the port pearl australia.

In another specific embodiment, the corpus data to be processed in the target domain may be presented in the form of voice, text, image, and the like.

S202: acquiring a target entity and a target predicate in the corpus data to be processed according to a knowledge graph corresponding to the target field, and generating a corresponding entity mapping relation to be matched;

in an embodiment of the present invention, the intelligent question-answering system may provide knowledge base questions (KBQA, knowledge base question answering) based on knowledge base (KB; typically, data is stored as structured knowledge, such as knowledge stored in SPO triples). Knowledge in different fields can be provided with corresponding Knowledge maps (also called scientific Knowledge maps, called Knowledge domain visualization or Knowledge domain mapping maps in book information, which are a series of different graphs for displaying Knowledge development processes and structural relations, and Knowledge resources and carriers thereof are described by using visualization technology, and Knowledge and the interrelationships among the Knowledge resources and carriers are mined, analyzed, constructed, drawn and displayed. According to the knowledge graph corresponding to the target field, the relationship between the entity in the SPO triplet (see the related description in step S201, which will not be repeated here) and the predicate (the predicate may represent the relationship between the subject entity and the object entity, for example, an SPO triplet "singer a-wife-actor B", then singer a is the subject entity, wife is the predicate, and actor B is the object entity) is utilized, the target entity and the target predicate are obtained from the corpus data to be processed, and the intent of the corpus data to be processed can be reflected more directly according to the mapping relationship between the target entity and the target predicate, so that a more accurate judgment basis is provided for the subsequent obtaining of the target corpus data with the target intent.

In a specific embodiment, as shown in fig. 3, the obtaining, according to the knowledge graph corresponding to the target domain, the target entity and the target predicate in the corpus data to be processed, and generating the corresponding mapping relationship of the entity to be matched includes:

s301: acquiring a target entity category and the target predicate of the target entity in the corpus data to be processed according to the knowledge graph, and generating a corresponding entity category mapping relation to be matched;

specifically, as shown in fig. 4, the obtaining, according to the knowledge graph, the target entity category to which the target entity belongs in the corpus data to be processed and the target predicate, and generating a corresponding mapping relationship of the entity category to be matched includes:

s401: extracting the target entity from the corpus data to be processed according to the entity contained in the knowledge graph;

knowledge is stored in a structured form (such as a relational mapping pair) in a knowledge graph corresponding to the target field. The knowledge graph can comprise entity character strings corresponding to a plurality of entities, and the entity character strings belonging to the corpus data to be processed are searched in all the entity character strings to obtain target entities.

S402: determining the target entity category to which the target entity belongs according to the knowledge graph and the target entity;

according to the relation mapping pairs in the knowledge graph, an entity AC automaton (the AC automaton can realize KMP on a Trie to complete the matching of multi-mode strings, the Trie is also called a word search tree or a key tree, is a tree-shaped structure and is a variant of a hash tree, and the KMP is an improved character string matching algorithm). The following format may be set in the physical AC automaton: the entity adds the entity category to which the entity belongs. Entity categories may be represented in agreed numbers. Such as: "person name a|15, 21, 33", where 15 characterizes the entity category of singer, 21 characterizes the entity category of actor, and 33 characterizes the entity category of director. The target entity class (e.g. 15, 21, 33) to which the target entity (e.g. person name a) belongs can thus be determined by the entity AC automaton.

S403: extracting the target predicate from the processed corpus data according to the predicate contained in the knowledge graph;

the knowledge graph can comprise predicate character strings corresponding to a plurality of predicates, and predicate character strings belonging to the corpus data to be processed are searched in all predicate character strings to obtain target predicates.

S404: determining a target predicate identification corresponding to the target predicate according to the knowledge graph and the target predicate;

and establishing a predicate AC automaton according to the relation mapping pairs in the knowledge graph. The following format may be set in the predicate AC automata: the predicate adds a predicate identification corresponding to the predicate. Predicate identification may be represented in english words to which predicates correspond. Such as: "Daughter |Daugtier". Thus, the predicate identification (such as Daughter) to which a target predicate (such as a parade) belongs can be determined through the predicate AC automaton.

S405: and generating the entity class mapping relation to be matched according to the target entity class and the target predicate identification.

Thus, the mapping relation of the entity class to be matched (such as 15|Daughter,21|Daughter, 33|Daughter) can be obtained.

S302: filtering the corpus data to be processed according to the matching degree between the entity category mapping relation to be matched and the target entity category mapping relation to obtain middle corpus data;

and setting the target entity category mapping relation based on the knowledge graph, and establishing an entity category-predicate AC automaton according to the target entity category mapping relation. The following format may be set in the entity class-predicate AC automata: the entity class is plus predicate identification. Such as: "15|Album, song, life, daugter, son). Thus, the matching degree between the entity class mapping relation (such as 15|Daugter) to be matched and the target entity class mapping relation can be checked through the entity class-predicate AC automaton. If the target entity class (such as 15) to which the target entity belongs does not have a mapping relation with the predicate identification (such as Daughter), the matching degree is low. If the target entity class (such as 15) to which the target entity belongs has a mapping relation with a predicate identifier (such as Daughter), the description matching degree is high, and then filtering processing is carried out on the corpus data to be processed to obtain intermediate corpus data.

S303: generating the entity mapping relation to be matched based on the intermediate corpus data;

thus, the mapping relation (such as name A|Daugtier) of the entity to be matched can be obtained.

S203: filtering the corpus data to be processed based on the matching degree between the entity mapping relation to be matched and the target entity mapping relation to obtain target corpus data with target intention;

in a specific embodiment, the target entity mapping relationship is set based on the knowledge graph, and an entity-predicate AC automaton can be established according to the target entity mapping relationship. The following format may be set in the entity-predicate AC automata: the entity adds the predicate identification. Such as: "person name A|Album, song, life, daugter, son". Thus, the matching degree between the entity mapping relation (such as name A|Daugter) to be matched and the target entity mapping relation can be checked through the entity-predicate AC automaton. If the target entity (such as person name A) and the predicate identification (such as Daughter) have no mapping relation, the matching degree is low. If the mapping relation exists between the target entity (such as the name A) and the predicate identifier (such as Daughter), the description matching degree is high, and then the corpus data to be processed is filtered, so that the target corpus data with the target intention is obtained.

In another specific embodiment, as shown in fig. 5, the filtering the corpus data to be processed to obtain the target corpus data based on the matching degree between the mapping relationship of the entity to be matched and the mapping relationship of the target entity includes:

s501: respectively constructing corresponding bloom filters (bloom filters) based on the data corresponding to each target entity mapping relation;

the target entity mapping relationship is set based on the knowledge graph. And respectively constructing corresponding bloom filters based on the data corresponding to the mapping relation of each target entity. For example, a bloom filter 1 is constructed for data corresponding to a target entity mapping relation 1 (name a|birthday), a bloom filter 2 is constructed for data corresponding to a target entity mapping relation 2 (name a|work), and a bloom filter 3 is constructed for data corresponding to a target entity mapping relation 3 (name a|graduation school).

S502: processing data corresponding to the entity mapping relation to be matched according to the hash function corresponding to the bloom filter to obtain a bit number group to be matched;

the bloom filter corresponds to an array of reference bits (bits) with a length of m, wherein m is a positive integer, and m is greater than or equal to 1. At initialization, each bit of the reference bit array is 0. The bloom filter corresponds to k hash functions, wherein k is a positive integer and is greater than or equal to 1. When processing the data corresponding to the entity mapping relation to be matched according to the hash function corresponding to the bloom filter, calculating the data corresponding to the entity mapping relation to be matched by using k hash functions to obtain k hash values; and setting the corresponding bit in the reference bit array to be 1 according to the obtained k hash values and a preset rule, thereby obtaining the bit array to be matched. Of course, the k hash values corresponding to the obtained mapping relation of the entity to be matched can also be used for matching with the reference bit number set.

S503: determining the matching degree between the entity mapping relation to be matched and the target entity mapping relation according to the reference bit array corresponding to the bloom filter and the bit array to be matched;

the process of obtaining the reference bit array corresponding to the bloom filter is as follows: calculating data corresponding to the target entity mapping relation by using k hash functions to obtain k hash values; and setting the bit corresponding to the reference bit array to 1 according to a preset rule and the obtained k hash values, so as to obtain the reference bit array. The array length of the bit array to be matched and the array length of the reference bit array are the same as the array length of the reference bit array. Of course, when the obtained k hash values corresponding to the entity mapping relation to be matched are matched with the reference bit array, whether the corresponding bit in the reference bit array is 1 can be queried according to the k hash values corresponding to the entity mapping relation to be matched, and if all the bits pointed by the k hash values corresponding to the entity mapping relation to be matched are 1, the matching degree between the entity mapping relation to be matched and the target entity mapping relation can be determined to be high.

The bloom filter uses the hash function to encode and store all data, can effectively judge whether one data is in one set, and has high judgment accuracy on the data which is not in the set. For knowledge maps with entity orders of magnitude of hundred million, the order of magnitude of the mapping relation (entity-predicate) of the target entity is often 10-100 hundred million, and if corresponding AC automata, map (mapping, dictionary mapping) or set (set) is established, the situation of insufficient memory of a processor is easy to occur. The bloom filter is used for storing mass data, so that the occupied memory is small, and memory overflow cannot be caused. And constructing a corresponding bloom filter for the target entity mapping relation, so that target corpus data with target intention is obtained by effectively filtering in mass data, and the data volume of manual inspection is reduced.

In another specific embodiment, a bloom filter may be used to filter the corpus data to be processed to obtain the target corpus data with the target intent; according to the relation between the probability of the target intention of the target corpus data and a preset threshold value, or the number of the target corpus data (the minimum number of the target corpus data can be set differently in different fields, such as 2000 in the hot field), the performance parameters of the bloom filter are adjusted; wherein the performance parameter includes at least one selected from the group consisting of a type of hash function corresponding to the bloom filter, the number of hash functions, and an array length of a bit array. For example, when the probability that the target corpus data has the target intent is smaller than a preset threshold, that is, the correlation between the target corpus data and the target intent is small, the number of corresponding hash functions can be increased and/or the array length of the corresponding bit array can be increased, so that the memory space occupied by the corresponding bloom filter can be increased.

As shown in fig. 6, the method further includes:

s204: inputting the target corpus data into a machine learning model for intention recognition training;

the model LSTM (Long-Short Term Memory ) model and LR (Logistic Regression, logistic regression) model can be used as machine learning models for training, but the machine learning model for training is not limited thereto, and can also include decision tree machine learning models, etc. Specifically, when training an intention recognition model for recognizing an intention related to a sports field, the positive example sample data may be target corpus data of the sports field, and the negative example sample data may be corpus data of a non-sports field (such as medical field, educational field, legal field).

The target corpus data input to the machine learning model is not limited to the target corpus data obtained by processing the corpus data to be processed in the current stage, and may also include the target corpus data obtained by processing the corpus data to be processed before the current stage.

S205: in the training process, adjusting model parameters of the machine learning model until the type of intention output by the machine learning model is matched with the type of intention corresponding to the input target corpus data;

A loss value between an intermediate value (intention type as training intermediate result) output by the machine learning model and a reference value (intention type as correct answer) corresponding to the target corpus data may be calculated, and the model parameter may be adjusted according to the loss value. Specifically, the initial network model may be trained by a gradient descent method, an initial value of the learning rate is set to be 0.0005-0.0015, and the learning rate is adjusted to be a value every 1000-3000 iterations. For example, the initial value of the learning rate can be set to be 0.001, and the learning rate can be adjusted every 2000 iterations. Of course, the manner of setting the learning rate is not limited thereto.

S206: and taking the machine learning model corresponding to the current value of the model parameter as an intention recognition model.

Fig. 7 is a schematic diagram of an application scenario of an intent recognition model according to an embodiment of the present invention. In fig. 7, the training data is target corpus data, and each sample data may be labeled with a corresponding intention type; correspondingly, the intent recognition model trained later can recognize the intent type of the query corpus data.

In a specific embodiment, inputting the query corpus data into the intent recognition model to obtain the corresponding intent type includes: and respectively obtaining predicted values of intent types relevant to each target field corresponding to the query corpus data according to the intent recognition model corresponding to each target field. Each target domain may have a corresponding intent recognition model, such as an intent recognition model that recognizes sports domain related intent, an intent recognition model that recognizes medical domain related intent. The query corpus data is respectively input into intention recognition models corresponding to all target fields, and then predicted values of intention types relevant to all target fields are obtained, wherein the predicted values of the query corpus data corresponding to the query corpus data, such as 90 points (which can be expressed in a probability form or the like) obtained by the intention recognition models for recognizing the relevant intention of the sports field, 50 points obtained by the intention recognition models for recognizing the relevant intention of the medical field and 20 points obtained by the intention recognition models for recognizing the relevant intention of the education field, are obtained. And then comparing to obtain the maximum value of the predicted value of the intention type related to each target field corresponding to the query corpus data, and determining the intention type related to the target field corresponding to the maximum value of the query corpus data. The query corpus data are identified in the intention identification models corresponding to a plurality of different target fields, and then the prediction values obtained by each identification are synthesized, so that the accuracy of the identification result of the query corpus data can be ensured. Further, based on the identified intention type, more accurate response corpus data is returned to the user, and reference is made to fig. 8 and 9. The presentation form of the response corpus data may not be limited to speech, text, image, link.

The target corpus data is used for machine learning training to obtain the intention recognition model, so that the obtained intention recognition model has high generalization capability, and the recognition adaptability to the query corpus data input by a user can be improved when the intention recognition model is used for carrying out intention recognition, so that the reliability and the effectiveness of the intention recognition can be greatly improved. The intention recognition model obtained by training in the current stage can be used as an intermediate model, and target corpus data of the training in the current stage is input into the intermediate model for training. The intent recognition model used in the intelligent question-answering system may replace the old model with the new model as training proceeds.

As can be seen from the technical solutions provided in the embodiments of the present disclosure, based on the knowledge graph of the target field, the target corpus data with the target intention is obtained by processing the corpus data to be processed using the entity and predicate relationships in the SPO (Subject Predicate Object, main predicate) triples. The efficiency of obtaining the target corpus data can be effectively improved. The target corpus data obtained in this way has target intention, and then the quality of a model trained by taking the target corpus data as input can be improved. Through intention recognition, the domain to which the query corpus data belongs can be determined according to the intention type, and similar domains (such as novels, comics, films, videos and the like) can be more effectively distinguished, so that more accurate response corpus data can be obtained.

The embodiment of the invention also provides a corpus data processing device, as shown in fig. 10, which comprises:

acquisition module 1010: the method comprises the steps of obtaining corpus data to be processed in a target field;

generating module 1020: the method comprises the steps of obtaining target entities and target predicates in the corpus data to be processed according to a knowledge graph corresponding to the target field, and generating a corresponding entity mapping relation to be matched;

filter module 1030: and the method is used for filtering the corpus data to be processed by using a bloom filter based on the matching degree between the entity mapping relation to be matched and the target entity mapping relation to obtain target corpus data with target intention, and the target entity mapping relation is set based on the knowledge graph. The filtering module 1030 includes: the construction unit: the method comprises the steps of respectively constructing corresponding bloom filters based on data corresponding to each target entity mapping relation; and a processing unit: the data processing module is used for processing the data corresponding to the entity mapping relation to be matched according to the hash function corresponding to the bloom filter to obtain a bit number set to be matched; matching unit: and the matching degree between the mapping relation of the entity to be matched and the mapping relation of the target entity is determined according to the reference bit array corresponding to the bloom filter and the bit array to be matched, wherein the array lengths of the bit array to be matched and the reference bit array are the same.

The apparatus further comprises:

an input module: the target corpus data is input into a machine learning model for intention recognition training;

model training module: in the training process, adjusting model parameters of the machine learning model until the type of intention output by the machine learning model is matched with the type of intention corresponding to the input target corpus data;

model updating module: and the machine learning model corresponding to the current value of the model parameter is used as an intention recognition model.

It should be noted that the apparatus and method embodiments in the apparatus embodiments are based on the same inventive concept.

The embodiment of the invention provides electronic equipment, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the corpus data processing method provided by the embodiment of the method.

The memory may be used to store software programs and modules that the processor executes to perform various functional applications and data processing by executing the software programs and modules stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the device, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor.

The electronic device may be a server, and referring to fig. 11, the embodiment of the present invention further provides a schematic structural diagram of the server, and the server 1100 is configured to implement the corpus data processing method provided in the foregoing embodiment, and specifically, the server structure may include the corpus data processing device. The server 1100 may vary considerably in configuration or performance and may include one or more central processing units (Central Processing Units, CPU) 1110 (e.g., one or more processors) and memory 1130, one or more storage media 1120 (e.g., one or more mass storage devices) that store applications 1123 or data 1122. Wherein the memory 1130 and the storage medium 1120 may be transitory or persistent storage. The program stored on the storage medium 1120 may include one or more modules, each of which may include a series of instruction operations on a server. Still further, the central processor 1110 may be configured to communicate with a storage medium 1120 and execute a series of instruction operations in the storage medium 1120 on the server 1100. The server 1100 may also include one or more power supplies 1160, one or more wired or wireless network interfaces 1150, one or more input output interfaces 1140, and/or one or more operating systems 1121, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The embodiment of the invention also provides a storage medium, which can be arranged in an electronic device to store at least one instruction, at least one section of program, a code set or an instruction set related to a corpus data processing method in the embodiment of the method, where the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to implement the corpus data processing method provided in the embodiment of the method.

Alternatively, in this embodiment, the storage medium may be located in at least one network server among a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus and electronic device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and references to the parts of the description of the method embodiments are only required.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method for processing corpus data, the method comprising:

obtaining corpus data to be processed in the target field;

Acquiring a target entity and a target predicate in the corpus data to be processed according to the entity and the predicate contained in the knowledge graph corresponding to the target field, and establishing an entity predicate mapping relation to be matched according to the target entity and the target predicate, wherein the predicate is used for representing the relation between a subject entity and an object entity;

filtering the corpus data to be processed based on the matching degree between the entity predicate mapping relation to be matched and the target entity predicate mapping relation to obtain target corpus data with target intention;

wherein, the target entity predicate mapping relation is set based on the knowledge graph;

the establishing an entity predicate mapping relation to be matched according to the target entity and the target predicate includes: determining the category of the target entity to which the target entity belongs according to the knowledge graph; establishing an entity category predicate mapping relation to be matched according to the target entity category and the target predicate; filtering the corpus data to be processed according to the matching degree between the entity category predicate mapping relation to be matched and the target entity category predicate mapping relation to obtain middle corpus data, wherein the target entity category predicate mapping relation is set based on the knowledge graph; and generating the entity predicate mapping relation to be matched based on the intermediate corpus data.

2. The method of claim 1, wherein the filtering the corpus data to be processed to obtain the target corpus data based on a degree of matching between the entity predicate mapping relationship to be matched and the target entity predicate mapping relationship comprises:

based on the data corresponding to each target entity predicate mapping relation, respectively constructing corresponding bloom filters;

processing data corresponding to the entity predicate mapping relation to be matched according to the hash function corresponding to the bloom filter to obtain a bit number set to be matched;

determining the matching degree between the entity predicate mapping relation to be matched and the target entity predicate mapping relation according to the reference bit array corresponding to the bloom filter and the bit array to be matched;

wherein the array length of the bit array to be matched is the same as that of the reference bit array.

3. The method of claim 1, wherein filtering the corpus data to be processed to obtain the target corpus data with the target intent based on a degree of matching between the entity predicate-mapping relationship to be matched and the target entity predicate-mapping relationship, comprises:

Filtering the corpus data to be processed by using a bloom filter to obtain the target corpus data with target intention;

according to the relation between the probability of the target intention of the target corpus data and a preset threshold value or the number of the target corpus data, adjusting the performance parameters of the bloom filter;

wherein the performance parameter includes at least one selected from the group consisting of a type of hash function corresponding to the bloom filter, the number of hash functions, and an array length of a bit array.

4. The method of claim 1, wherein the obtaining a target entity and a target predicate in the corpus data to be processed according to the entity and the predicate included in the knowledge graph corresponding to the target domain, and establishing an entity predicate mapping relationship to be matched according to the target entity and the target predicate, comprises:

extracting the target entity from the corpus data to be processed according to the entity contained in the knowledge graph;

determining the target entity category to which the target entity belongs according to the knowledge graph and the target entity;

extracting the target predicate from the processed corpus data according to the predicate contained in the knowledge graph;

Determining a target predicate identification corresponding to the target predicate according to the knowledge graph and the target predicate;

and establishing the entity predicate mapping relation to be matched according to the target entity category and the target predicate identification.

5. The method according to claim 1, wherein the obtaining the corpus data to be processed of the target domain includes:

obtaining candidate corpus data;

selecting screening information pointing to the target field;

screening the corpus data to be processed from the candidate corpus data according to the screening information;

the screening information comprises at least one selected from the group consisting of a target domain keyword, a target domain uniform resource locator and a target domain blacklist.

6. The method according to claim 1, wherein the method further comprises:

inputting the target corpus data into a machine learning model for intention recognition training;

in the training process, adjusting model parameters of the machine learning model until the type of intention output by the machine learning model is matched with the type of intention corresponding to the input target corpus data;

and taking the machine learning model corresponding to the current value of the model parameter as an intention recognition model.

7. A corpus data processing apparatus, the apparatus comprising:

the generation module is used for: the method comprises the steps of obtaining a target entity and a target predicate in the corpus data to be processed according to an entity and a predicate contained in a knowledge graph corresponding to the target field, and establishing an entity predicate mapping relation to be matched according to the target entity and the target predicate, wherein the predicate is used for representing the relation between a subject entity and an object entity;

and a filtering module: the method comprises the steps of filtering corpus data to be processed based on the matching degree between the entity predicate mapping relation to be matched and the target entity predicate mapping relation to obtain target corpus data with target intention;

8. The apparatus of claim 7, wherein the filtering module comprises:

the construction unit: the method comprises the steps of respectively constructing corresponding bloom filters based on data corresponding to each target entity mapping relation;

and a processing unit: the data processing module is used for processing the data corresponding to the entity mapping relation to be matched according to the hash function corresponding to the bloom filter to obtain a bit number set to be matched;

matching unit: and the matching degree between the entity mapping relation to be matched and the target entity mapping relation is determined according to the reference bit array and the bit array to be matched corresponding to the bloom filter, wherein the array lengths of the bit array to be matched and the reference bit array are the same.

9. The apparatus of claim 7, wherein the filtration module: the method comprises the steps of filtering the corpus data to be processed by using a bloom filter to obtain target corpus data with target intention; according to the relation between the probability of the target intention of the target corpus data and a preset threshold value or the number of the target corpus data, adjusting the performance parameters of the bloom filter; wherein the performance parameter includes at least one selected from the group consisting of a type of hash function corresponding to the bloom filter, the number of hash functions, and an array length of a bit array.

10. The apparatus of claim 7, wherein the obtaining a target entity and a target predicate in the corpus data to be processed according to the entity and the predicate included in the knowledge graph corresponding to the target domain, and establishing an entity predicate mapping relationship to be matched according to the target entity and the target predicate, comprises: extracting the target entity from the corpus data to be processed according to the entity contained in the knowledge graph; determining the target entity category to which the target entity belongs according to the knowledge graph and the target entity; extracting the target predicate from the processed corpus data according to the predicate contained in the knowledge graph; determining a target predicate identification corresponding to the target predicate according to the knowledge graph and the target predicate; and establishing the entity predicate mapping relation to be matched according to the target entity category and the target predicate identification.

11. The apparatus of claim 7, wherein the acquisition module: the method comprises the steps of obtaining candidate corpus data; selecting screening information pointing to the target field; screening the corpus data to be processed from the candidate corpus data according to the screening information; the screening information comprises at least one selected from the group consisting of a target domain keyword, a target domain uniform resource locator and a target domain blacklist.

12. The apparatus of claim 7, wherein the apparatus further comprises:

13. An electronic device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the corpus data processing method of any of claims 1-6.

14. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the corpus data processing method of any of claims 1-6.