CN112765330A

CN112765330A - Text data processing method and device, electronic equipment and storage medium

Info

Publication number: CN112765330A
Application number: CN202011631883.8A
Authority: CN
Inventors: 谢韬; 秦昌博; 高倩; 邵长东
Original assignee: Ecovacs Robotics Suzhou Co Ltd
Current assignee: Ecovacs Robotics Suzhou Co Ltd; Ecovacs Commercial Robotics Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-07

Abstract

The embodiment of the invention provides a text data processing method, a text data processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a statement to be processed and a first named entity contained in the statement, generating a template statement containing the first named entity according to a preset statement template, and forming a statement pair by the statement to be processed and the template statement. And generating a triple relation containing the first named entity according to the statement pair. The method is an open domain-based triple relationship generation method, and can simultaneously obtain the explicit triple relationship and the implicit triple relationship corresponding to the first named entity in the statement to be processed. Meanwhile, the constructed statement pairs all contain the first named entity, so that the generation of the triple relation can be limited, namely the triple relation of a single named entity is generated, the number of the generated triple relation is limited, and the accuracy of generating the triple relation is ensured.

Description

Text data processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing, and in particular, to a text data processing method and apparatus, an electronic device, and a storage medium.

Background

Natural Language Processing (NLP) is a research hotspot in the field of artificial intelligence, and is also the core of human-computer interaction.

In the process of man-machine interaction, the intelligent device needs to understand the sentences input by the user firstly and then generates response based on the understanding of the sentences, so that man-machine interaction is realized, and the intelligent device can realize the understanding of the sentences by means of a knowledge graph. Therefore, the accuracy of establishing the knowledge graph directly influences the effect of human-computer interaction. The knowledge graph can be established by performing knowledge extraction on the text data, and the knowledge extraction can be performed in a limited domain or an open domain.

Disclosure of Invention

The embodiment of the invention provides a text data processing method and device, electronic equipment and a storage medium, which are used for ensuring the accuracy of generating a triple relation.

The embodiment of the invention provides a text data processing method, which comprises the following steps:

acquiring a first named entity contained in a statement to be processed;

forming a statement pair corresponding to the first named entity by the statement to be processed and the template statement containing the first named entity;

and generating a triple relation containing the first named entity according to the statement pair.

An embodiment of the present invention provides a text data processing apparatus, including:

the acquisition module is used for acquiring a first named entity contained in the statement to be processed;

the building module is used for forming a statement pair corresponding to the first named entity by using the statement to be processed and the template statement containing the first named entity;

and the generating module is used for generating the triple relation containing the first named entity according to the statement pair.

An embodiment of the present invention provides an electronic device, including: a processor and a memory; wherein the memory is to store one or more computer instructions that when executed by the processor implement:

acquiring a first named entity contained in a statement to be processed;

Embodiments of the present invention provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform at least the following:

acquiring a first named entity contained in a statement to be processed;

The embodiment of the invention provides another text data processing method, which comprises the following steps:

obtaining a sample named entity contained in a sample statement;

forming a sample statement pair corresponding to the named entity by the sample statement and a template statement containing the sample named entity;

inputting the sample statement pair into a generation model, and outputting an attribute relation sequence corresponding to the sample named entity and a prediction probability matrix corresponding to the attribute relation sequence by the generation model;

and adjusting the model parameters of the generated model according to the prediction probability matrix and a preset expected probability matrix.

An embodiment of the present invention provides another text data processing apparatus, including:

the acquisition module is used for acquiring a sample named entity contained in a sample statement;

the construction module is used for forming a sample statement pair corresponding to the named entity by the sample statement and the template statement containing the sample named entity;

the input module is used for inputting the sample statement pair into a generation model, so that the generation model outputs the attribute relation sequence corresponding to the sample named entity and the prediction probability matrix corresponding to the attribute relation sequence;

and the adjusting module is used for adjusting the model parameters of the generated model according to the prediction probability matrix and a preset expected probability matrix.

An embodiment of the present invention provides another electronic device, including: a processor and a memory; wherein the memory is to store one or more computer instructions that when executed by the processor implement:

obtaining a sample named entity contained in a sample statement;

Another computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform at least the following:

obtaining a sample named entity contained in a sample statement;

According to the text data processing method provided by the invention, the sentence to be processed and the first named entity contained in the sentence are obtained, and the template sentence containing the first named entity is generated according to the preset sentence template, so that the sentence to be processed and the template sentence form a sentence pair. And generating a triple relation containing the first named entity according to the statement pair. The method is an open domain-based triple relationship generation method, and can simultaneously obtain the explicit triple relationship and the implicit triple relationship corresponding to the first named entity in the statement to be processed. Meanwhile, the constructed statement pair contains the same named entity, so that the generation of the triple relation can be limited, namely the triple relation of a single named entity (a first named entity) is generated, the number of the triple relation generated each time is limited, and the accuracy of generating the triple relation is also ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a text data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another text data processing method according to an embodiment of the present invention;

fig. 3 is a flowchart of another text data processing method according to an embodiment of the present invention;

FIG. 4 is a flow chart of an alternative implementation of step 304 in the embodiment shown in FIG. 3;

fig. 5 is a flowchart of another text data processing method according to an embodiment of the present invention;

fig. 6 is a flowchart of another text data processing method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a text data processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device corresponding to the text data processing apparatus provided in the embodiment shown in fig. 7;

FIG. 9 is a schematic structural diagram of another text data processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device corresponding to the text data processing apparatus provided in the embodiment shown in fig. 9.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well. "plurality" generally includes at least two unless the context clearly dictates otherwise.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

According to the background art, the realization of the man-machine conversation function is realized based on the knowledge graph, and the knowledge in the knowledge graph is extracted from numerous and multiple sentences. In practice, knowledge extraction based on open domains tends to result in richer knowledge than knowledge extraction based on restricted domains.

And each piece of knowledge in the knowledge graph can be regarded as a Subject, predicate, Object, or SPO triple relationship corresponding to each of different named entities in the sentence. Wherein, Subject in the SPO triple relationship can be called a first named entity, Predict can be called a predicate, and Object can be called a second named entity. It should be noted that, in each of the following embodiments, the SPO triple relationship is simply referred to as a triple relationship, and any triple relationship includes a first named entity, a predicate, and a second named entity.

It is readily understood that for a statement, there may be both explicit and implicit triple relationships between named entities in the statement. For example, the statement "A and B have a girl C", the explicit triple relationship may be (A, a girl, C) and (B, a girl, C), and the implicit triple relationship may be (A, girl, C) and (B, girl, C).

In order to obtain the explicit triple relationship and the implicit triple relationship of the named entity in the sentence, the text data processing method provided in each embodiment of the present invention may be adopted.

Alternatively, the text data processing method provided by the embodiments of the present invention may be applied to an intelligent robot such as a service robot, a self-moving vending robot, and the like. The text processing method can also be applied to a man-machine conversation plug-in (or a man-machine conversation interface and a man-machine conversation function module) integrated in an online shopping system and a public service system; the text processing method can also be applied to intelligent terminals such as intelligent household appliances and intelligent wearable equipment. Broadly speaking, the text processing method can be applied to any device and system supporting man-machine conversation.

Based on the above description, some embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below may be combined with each other without conflict between the embodiments. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 1 is a flowchart of a text data processing method according to an embodiment of the present invention, where the text data processing method according to the embodiment of the present invention may be executed by a processing device. It will be appreciated that the processing device may be implemented as software, or a combination of software and hardware. The processing device may be any of the devices mentioned above that support human-machine interaction. As shown in fig. 1, the method comprises the steps of:

101. and acquiring a first named entity contained in the statement to be processed.

For the pending statements, optionally, they may be collected via the internet. Meanwhile, as can be seen from the above description, the triple relationship extracted from the to-be-processed sentence enables the intelligent robot to have a man-machine conversation function, and therefore, alternatively, a history conversation generated by the intelligent robot within a certain time period may also be used as the to-be-processed sentence.

Then, the user may trigger an input operation of the to-be-processed sentence, so that the processing device obtains the collected to-be-processed sentence, and further identifies the named entity included in the sentence. The sentence to be processed may be a single sentence or a paragraph composed of a plurality of sentences.

However, in order to ensure the accuracy of generating the triple relationship, the sentence to be processed, which is acquired by the processing device, is not too long. Therefore, in an optional manner, after the processing device obtains the to-be-processed sentence input by the user, the length is determined, the to-be-processed sentence with the length exceeding the first preset length is determined as the paragraph, the paragraph is subjected to sentence division, and then the triple relationship of the named entity in each sentence is generated respectively.

After receiving the statement to be processed, the processing device may further identify the named entity contained therein. Optionally, at least one named entity may be included in the statement to be processed. Alternatively, a dictionary containing the named entities may be established in advance, and after the sentence to be processed is obtained and subjected to word segmentation processing, the word segmentation result is compared with words in a preset dictionary, so as to determine the named entities contained in the sentence to be processed. In practical applications, the named entity may be an entity with a specific name, such as a person name, an organization name, a geographic location, and the like, but may also be further extended to a date, a currency, and the like.

It should be noted that, as mentioned in the above description, a triple relationship may include a first named entity, a predicate and a second named entity, and the named entity obtained in step 101 may be considered as the first named entity in the triple relationship.

For example, the sentence to be processed may be "wu chen" as an author of "journey to the west", whose word segmentation result is: west tour, of, author, Wu Cheng. Comparing the word segmentation result with words in a preset dictionary to obtain a first named entity, comprising: journey to the West and Wu Cheng.

102. And forming a statement pair corresponding to the first named entity by the statement to be processed and the template statement containing the first named entity.

And then substituting the acquired first named entity into a default template sentence to obtain a complete template sentence, and forming a sentence pair corresponding to the sentence to be processed by the sentence to be processed and the template sentence containing the first named entity, namely the complete template sentence. I.e. both statements in a statement pair are to contain the first named entity. Optionally, the complete sentence template obtained after the named entity is entered usually contains one named entity.

Taking the above example, if the default template sentence may be "which relation is related to XXX", then the first named entity of "the western note" in the sentence to be processed is substituted to obtain the complete template sentence "which relation is related to the western note". Wu Chen is the author of the sentence "West Yong" to be processed and the complete template sentence "which the relation of the West Yong is" form a sentence pair. Similarly, another sentence pair may be formed by the author "Wu Chen" and "what is the relationship about Wu Chen".

103. From the statement pair, a triple relationship is generated that includes the first named entity.

And finally, generating a triple relation containing the first named entity based on the obtained statement pair in a generating mode. The triple relation is knowledge contained in the statement to be processed, namely knowledge extraction based on an open domain is realized.

Continuing with the above example, the triplet relationship for the first named entity "West notes" may be (West notes, authors, Wu Chen), where "authors" is the predicate and "Wu Chen" is the second named entity. Similarly, the triplet relationship for the first named entity "wu chen" may be (wu chen, written work, western script), where "written work" is the predicate and "western script" is the second named entity.

It should be noted that, if a statement pair includes the same named entity, for example, both include "shortcuts", the scheme provided by this embodiment may control the generation of the triple relationship on a single named entity, that is, the generated triple relationships are all for the same named entity (i.e., shortcuts), that is, the generation of the triple relationship is limited, so that the generation of the triple relationship is more controllable. Compared with the triple relationships corresponding to a plurality of named entities generated at the same time, the method provided by the embodiment has the advantages that the number of the triple relationships generated is small, and the accuracy of the generated triple relationships can be further ensured.

In this embodiment, a sentence to be processed and a first named entity included in the sentence are obtained, and a template sentence including the first named entity is generated according to a preset sentence template, so that a sentence pair can be formed by the sentence to be processed and the template sentence. And generating a triple relation containing the first named entity according to the statement pair. The method is a knowledge generation method based on an open domain, and can simultaneously extract an explicit triple relation and an implicit triple relation corresponding to a first named entity in a sentence to be processed. Meanwhile, the constructed statement pairs all contain the same named entity, so that the generation of the triple relation can be limited, the triple relation aiming at the first named entity is only generated, the number of the triple relations generated each time is limited, and the accuracy of generating the triple relation is ensured.

In the above description of the embodiments, it is disclosed that a predetermined dictionary is used for comparison to obtain a first named entity. In addition to this, the recognition of named entities can optionally be accomplished using a tagged sequence model. Fig. 2 is a flowchart of another text data processing method according to an embodiment of the present invention, and as shown in fig. 2, the method may include the following steps:

201. and inputting the sentence to be processed into the sequence labeling model, and outputting a labeling sequence corresponding to the sentence to be processed by the sequence labeling model.

202. The first named entity is determined from the output annotation sequence.

The sentence to be processed may be input into a sequence annotation model that has been trained to converge to determine the first named entity according to the annotation sequence output by the sequence annotation model. Optionally, the sequence labeling model adopts a BIO labeling mode (Begin, Inside, out, abbreviated as BIO).

For example, for a sentence to be processed "journey to the west" authored by wu chen ", the BIO sequence output by the sequence annotation model is: "OBI IOOOBI IO". And determining words corresponding to characters corresponding to the beginning I and the ending B in the sequence in the sentence to be processed as a first named entity. Therefore, the first named entity that can be obtained according to the BIO sequence is: journey to the West and Wu Cheng.

Optionally, the training process of the sequence annotation model may be: and inputting the training sentences into the sequence annotation model, calculating a loss value according to a predicted annotation sequence output by the sequence annotation model and an artificially annotated actual annotation sequence, and reversely adjusting model parameters of the annotation sequence according to the loss value so as to realize model convergence.

203. And forming a statement pair corresponding to the first named entity by the statement to be processed and the template statement containing the first named entity.

204. From the statement pair, a triple relationship is generated that includes the first named entity.

The specific implementation process of the steps 203 to 204 can refer to the related description in the embodiment shown in fig. 1, and is not described herein again.

In this embodiment, compared with the method that the first named entity included in the sentence to be processed is generated in a direct generation manner, the recognition accuracy of the named entity can be further improved by means of the tagging sequence model, the uncontrollable generation of the named entity is avoided, and the accuracy of the generation of the subsequent triple relationship is further ensured.

According to the embodiment shown in fig. 1 or fig. 2, it can be determined whether the sentence to be processed is a paragraph according to the length of the sentence to be processed input by the user. For the case that the sentence to be processed is a paragraph, in an alternative manner, the paragraph as a whole may be directly input into the sequence identification model, and the above steps 202 to 204 are performed.

In a practical paragraph, pronouns are likely to exist in the sentence, and the pronouns may specifically include zero pronouns and/or human pronouns. For example, assume that the following dialog paragraphs exist:

the user: i want to do a credit card.

The intelligent robot: you can apply for a credit card through a mobile banking.

The user: what offer it has?

The intelligent robot: the credit card of the I bank pays by 9 discount. Do you want to know how to download a mobile phone bank?

The user: i want to know.

In the above dialog paragraph, the pronoun "it" is present in the sentence "what offer it has" and is used to refer to "credit card", and the relationship is present in the sentence. The sentence "i want to know" there is a content that the user is consciously left out "how to download the mobile banking", wherein there is a left out relation, and the left out part is called a zero pronoun.

In this case, the processing device may alternatively perform clause segmentation on the paragraphs, where each sentence included in the segmentation result may be considered as a to-be-processed sentence. Then, in order to avoid that the zero pronouns and/or the human pronouns in the to-be-processed sentences can affect the generation of the triple relationship of the named entities, the processing equipment can also judge whether the to-be-processed sentences contain pronouns or not. And if the pronouns are contained, determining the reference content corresponding to the pronouns from the upper sentences of the sentences to be processed by using the model obtained by training based on the reading understanding principle, and completing the sentences to be processed by using the reference content. Inputting a sequence annotation model by using the supplemented to-be-processed sentences, namely executing the step 201 by using the supplemented to-be-processed sentences, so as to obtain an annotation sequence of the to-be-processed sentences, and further executing the steps 202 to 204 to generate a triple relationship of named entities in each to-be-processed sentence.

The recognition of pronouns in the sentence to be processed can be realized by means of the sequence labeling model, namely, the position of the pronouns in the sentence to be processed is determined according to the BIO sequence output by the sequence labeling mode. It should be noted that the sequence annotation model for identifying pronouns may be a different model from the sequence annotation model used in step 201 of the embodiment shown in fig. 2.

On the basis of the above description, in practical applications, when a to-be-processed sentence input by a user is a paragraph, in order to ensure the accuracy of generating a triple relationship, a common practice may be to: if the length of the paragraph input by the user exceeds the second preset length, the paragraph may be divided first, and the above-mentioned resolution is performed to obtain the named entity included in the paragraph. And then generating the triple relation of the named entities contained in each statement in the paragraph by using the generative model. In this way, a statement pair is input into the generation model, and compared with the whole paragraph, the length of the statement pair is obviously shorter, so that information contained in the statement cannot be lost in the process of generating the triple relationship, and the generation of the triple relationship is ensured to be controllable and accurate. If the length of the paragraph input by the user does not exceed the second preset length, the sequence marking model is directly input for subsequent processing, and pronouns in the paragraph do not need to be considered. Wherein the first preset length is smaller than the second preset length.

The above description has introduced that the embodiments provided by the present invention all use the generation method to generate the triple relationship in the statement to be processed. For the generation of the triple relationship may be implemented by generating a model, fig. 3 is a flowchart of another text data processing method provided in the embodiment of the present invention, and as shown in fig. 3, the method may include the following steps:

301. and acquiring a first named entity contained in the statement to be processed.

302. And forming a statement pair corresponding to the first named entity by the statement to be processed and the template statement containing the first named entity.

The specific implementation processes of the steps 301 to 302 can refer to the related descriptions in the embodiment shown in fig. 1, and are not described herein again.

303. And inputting the statement pair into the generative model, and outputting the attribute relation sequence corresponding to the first named entity by the generative model.

304. And generating a triple relation containing the first named entity according to the first named entity and the attribute relation sequence, wherein the attribute relation sequence contains a predicate in the triple relation and a second named entity.

After following the

above steps

301, 302, pairs of sentences are already available. Then, the sentence pair is input into a generative model, and the generative model outputs the attribute relationship sequence corresponding to the first named entity. Finally, a triple relationship for the first named entity is constructed from the first named entity and the sequence of attribute relationships.

The attribute relationship sequence may include at least one attribute relationship, and the first named entity and each attribute relationship may form a three-way relationship group for the first named entity, that is, the number of the three-way relationships is the same as the number of the attribute relationships in the attribute relationship sequence, and both are at least one. And the attribute relation sequence is composed of a plurality of words, and one attribute relation can contain more than two words, namely one predicate in one triple relation and at least one second named entity.

For example, if the sentence to be processed is "wu cheng, the word ru zi, the number zhuyang shan man", and the first named entity is "wu cheng", the sentence pair "which [ SEP ] wucheng, the word ru zi, the number zhuyang shan man" is related to wu cheng is input into the generative model, wherein the "SEP" serves as the interval symbol between the two sentences. The series of attribute relationships that the generative model outputs are "word, ru loyalty" and "number, sun shooter".

The attribute relationship sequence comprises two attribute relationships of 'word, Ru faithful' and 'number, shooting yang mountain'. In this case, two triad relations (wu-chen, han, ru-zhong) and (wu-chen, hakuri-kuan) can be generated for wu-chen. Wherein, the words and the numbers are predicates in the two triple relations respectively, and the Ru faithful words and the shooting Yang bushy persons are second named entities in the two triple relations respectively.

Optionally, when the generative model may specifically be a sequence-to-sequence model, the specific working process thereof may be: an encoder in the model encodes the received statement pairs to obtain fixed length statement vectors. The statement vector is further input into a decoder in the model, and the decoder generates a first word in the relationship attribute sequence according to the statement vector, generates a second word according to the first word, and so on until an end identifier is generated. The sequence-to-sequence model is actually a Recurrent Neural Network (RNN) model.

In the embodiment, the triple relationship is generated by using the generation model, and this way can generate the literal explicit triple relationship of the first named entity in the sentence to be processed and the implicit triple relationship of the first named entity. In addition, the triple relationship is generated according to a statement pair containing the same named entity, the generation range of the triple relationship can be limited, namely the triple relationship of a single named entity is generated, the length of the attribute relationship sequence is short, the number of the triple relationship is small, and therefore the accuracy of generating the triple relationship is guaranteed.

In practical applications, the terms contained in the sequence of relational attributes that generate the model output may be separated by a spacer. And the triplet relation of the named entity can be generated only when the interval symbol between the words meets the preset condition, namely the attribute relation sequence meets the preset format specification. Optionally, the format specification corresponding to the attribute relationship sequence may be: usable among each attribute relationship "; "Interval", the predicate and the second named entity in one attribute relationship can be used "," interval ", and a plurality of second named entities in each attribute relationship can be used" and "interval".

Based on the above format specification, it is assumed that the attribute relationship sequence output by the generative model is represented as: p₁,O_1-1、O_1-2、O_1-n；P₂,O_2-1、O_2-2、O_2-n；…；P_m,O_m-1、O_m-2、O_m-n. Then may be according "; "divide this sequence into n attribute relationships, P₁,O_1-1、O_1-2、O_1-nIs an attribute relationship, P₂,O_2-1、O_2-2、O_2-nIs another attribute relationship. According to the above, m predicates, namely P, contained in the sequence can be determined_1……P_m. According to the above, determining that each attribute relationship comprises n second named entities, namely O_1-1、O_1-2、O_1-n。

Based on the above description, after the embodiment shown in fig. 3 is executed to obtain the attribute relationship sequence of the to-be-processed statement, optionally, the validity of each attribute relationship in the attribute relationship sequence may also be verified, and the triple relationship may be generated according to the valid attribute relationship.

Then, as shown in fig. 4, an optional specific implementation manner of step 304, that is, the process of generating the triple relationship according to the first named entity and attribute relationship sequence may include the following steps:

3041. and dividing the attribute relation sequence into at least one attribute relation according to the interval symbol among the words in the attribute relation sequence.

3042. Respective validity of at least one attribute relationship is determined.

In practical applications, the following problems are easy to occur in the attribute relationship sequence generated by the generative model: firstly, the attribute relationship sequence format is not standardized; second, since the embodiments of the present invention generate the attribute relationship sequence by a generation method, the generation of words in the attribute relationship is not limited. The above problems all result in attribute relationship sequences that are wholly or partially invalid.

For the first problem, the attribute relationship sequence in the above format is accepted, and the non-specification of the attribute relationship sequence format is usually embodied as: there is no use between two attribute relationships "; "Interval, predicate and second named entity in the same attribute relationship are not used", "Interval, between multiple second named entities in the same attribute relationship are not used", "Interval, etc.

In response to the second problem, it is common to embody the second named entity in the sequence of attribute relationships as not appearing in the pending statement.

For attribute relationships that have any of the above problems, they are typically filtered out as invalid attribute relationships.

Based on the above description, the above format specification, such as two terms, may be adopted according to the interval between the terms in the attribute relationship sequence; "the words contained between are divided into at least one attribute relationship. Then, it is determined whether each attribute relationship is valid, that is, whether each attribute relationship has the above two problems.

Since the determination processes of the validity of the attribute relationships are the same, the description will be given by taking any one of the at least one attribute relationship, i.e., the target attribute relationship, as an example.

Based on the above format specification, an optional validity determination means determines that the target attribute relationship is invalid and filters if the first space character in the target attribute relationship is not "," i.e. there is no use between the first two words "," space ", or the second space character in the target attribute relationship is not", "i.e. there is no use between words after the first word", "space", or both.

Also based on the format specification, another optional validity determination manner is that after the predicates and the second named entities in the target attribute relationship are identified according to the spacers between the words in the target attribute relationship, whether the second named entities in the target attribute relationship are included in the statements to be processed is further identified. And if the second named entity is not contained in the statement to be processed and indicates that the generation of the second named entity by the generating model exceeds the range, determining that the target attribute relationship is invalid and filtering.

In practice, the first named entity in the to-be-processed statement may also have an implied triple relationship, and the implication of such a relationship is usually embodied by a predicate in the triple relationship, that is, the predicate in the triple relationship may not be present in the to-be-processed statement, and the first named entity and the second named entity in the triple relationship are usually to be present in the to-be-processed statement. After the judgment is carried out according to the effectiveness judgment mode, the situation that the named entity in the triple relation does not appear in the sentence to be processed can be avoided.

It should be noted that, in practical applications, the validity of the attribute relationship may be determined by using the above two methods at the same time.

3043. And constructing a triple relation containing the first named entity according to the effective attribute relation and the first named entity, wherein the number of the triple relation is the same as that of the effective attribute relation.

And finally, respectively forming a triple relation corresponding to the first named entity with the first named entity according to the effective attribute relation. Wherein the number of triple relationships is the same as the number of valid attribute relationships.

In this embodiment, validity verification is performed on each attribute relationship in the attribute relationship sequence through different determination rules. And finally, generating a triple relation according to the effective attribute relation and the first named entity, and ensuring the accuracy of the triple relation.

In addition, the process of generating the attribute relationship sequence by using the generation model has already been mentioned in the embodiment shown in fig. 3, and the accuracy of the attribute relationship sequence directly affects the accuracy of the generation of the subsequent triple relationship. Fig. 5 is a flowchart of another text data processing method according to an embodiment of the present invention, and as shown in fig. 5, the method may include the following steps:

401. sample named entities contained in sample statements are obtained.

402. And forming a sample statement pair corresponding to the named entity by the sample statement and the template statement containing the sample named entity.

A sample statement and a sample named entity contained in the sample statement may be obtained. And then, substituting the sample named entities in the sample sentences into the default template sentences to obtain complete template sentences, wherein the sample sentences and the complete template sentences form sample sentence pairs corresponding to the sample named entities.

The obtaining manner of the sample statement, the sample named entity, and the sample statement pair is the same as the obtaining manner of the statement pair corresponding to the to-be-processed statement, the first named entity in the to-be-processed statement, and the first named entity, and specific contents may refer to descriptions in steps 101 to 102 in the embodiment shown in fig. 1.

403. And inputting the sample statement pair into the generative model, and outputting the attribute relation sequence corresponding to the sample named entity and the prediction probability matrix corresponding to the attribute relation sequence by the generative model.

404. And adjusting the model parameters of the generated model according to the prediction probability matrix and a preset expected probability matrix.

Then, the sample statement pair is input into the generative model, and the generative model outputs the attribute relationship sequence corresponding to the sample named entity and the prediction probability matrix corresponding to the attribute relationship sequence. And then calculating the cross entropy between the prediction probability matrix and the expected probability matrix, and taking the cross entropy as a loss value of the adjustment model parameter, thereby adjusting the model parameter until the model converges. The prediction probability matrix output by the model and the preset expected probability matrix are used for reflecting the probability of each word at the position of each word in the attribute relation sequence.

Assuming that a word library containing 1 ten thousand words exists after presetting, the prediction probability matrix may include probabilities that a first word in the attribute relationship sequence is each of the 1 ten thousand words, may include probabilities that a second word in the attribute relationship sequence is each of the 1 ten thousand words, and the other words are similar.

Because the attribute relationship sequence corresponding to the sample sentence can be artificially constructed in the training process, each word included in the attribute relationship sequence is determined, for example, the artificially obtained attribute relationship sequence is as follows: "AB, CD; "then in the expected probability matrix, the probability of the first word in the attribute relation sequence being a is 1, the probability of the first word in the attribute relation sequence being other words in the word library being 0, the probability of the first word in the attribute relation sequence being B is 1, the probability of the first word in the attribute relation sequence being other words in the word library being 0, and the other words are similar.

In this embodiment, the statement pair including the sample named entity is used to train the generating model, so that the generating model has the capability of generating the triple relationship of the single named entity (i.e., the sample named entity), that is, the generation of the triple relationship is limited, and the generated triple relationship of the single named entity has a limited length, and the accuracy of generating the triple relationship can also be ensured.

The embodiment shown in fig. 5 is a description of a training process of a generative model used in a triplet relationship generation process in a scenario of triplet relationship generation. In practical applications, for other scenarios using the generative model to generate the attribute relationship sequence, training of the generative model may also be performed separately in the following manner. Fig. 6 is a flowchart of another text data processing method according to an embodiment of the present invention, and as shown in fig. 6, the method may include the following steps:

501. sample named entities contained in sample statements are obtained.

502. And forming a sample statement pair corresponding to the named entity by the sample statement and the template statement containing the sample named entity.

503. And inputting the sample statement pair into the generative model, and outputting the attribute relation sequence corresponding to the sample named entity and the prediction probability matrix corresponding to the attribute relation sequence by the generative model.

504. And adjusting the model parameters of the generated model according to the prediction probability matrix and a preset expected probability matrix.

Details that are not described in detail in the embodiment shown in fig. 6 and technical effects that can be achieved can be referred to the related description in the embodiment shown in fig. 5, and are not described again here.

For ease of understanding, a specific implementation of the text data processing method provided above is exemplarily described in connection with the following application scenarios.

The author of the sentence "journey to the West" that the user can get through the Internet to be processed is Wu Cheng. Wu Cheng, the Chinese character Ru Zhong, the number of she Yang Shanren or she Yang Jushi. "for brevity of the subsequent description the author of the sentence" journey to the West "may be Wu Chen. "is called sentence A; the sentence "Wu Cheng En, the word Ru Zhong, the number she Yang Shanren. "is called sentence B

Inputting the sentence to be processed into a sequence annotation model, and training a BIO annotation sequence output by the annotation model as follows: OBIIOOOBIIO BIIOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO. The first named entity contained in the sentence to be processed can be acquired according to the labeling sequence and is named as 'journey to the West, Wu Cheng'.

Next, for the first named entity, "the western notes," which are the relationships about the western notes by statement a and template statement a 1? "constitute a sentence pair. For the first named entity "wu yan", which are the relationships about wuyan by statement B and template statement B1? "constitutes another sentence pair.

Further, a sentence pair is formatted as "what are relationships about the western notes? The author of [ SEP ] (journey to the West) is Wu Cheng En. "input the generative model. Generating the sequence of attribute relationships for the model output includes: "author, wu chen; ". At this time, at least one attribute relationship sequence is included; "which is an attribute relationship and is a valid attribute relationship, and therefore, a triple relationship for" shorthand "can be generated: (West-John, authors, Wu Chen).

Similarly, the other sentence pair "what are relationships about wu-chen? [ SEP ] Wu Cheng, the word Ru Zhong, number she Yang Shanren or she Yang Shi. "input the generative model. Generating the sequence of attribute relationships for the model output includes: "character, ru zhong; number, she yang mountain, she yang jushi; ".

At this time, up to two in the attribute relationship sequence are included "; "which may be divided into two attribute relationships, and both are valid attribute relationships, and therefore, a triple relationship for" wu chen "may be generated: (Wu Cheng, Zi Zhong, Wu Cheng, Happy Shanren, Wu Cheng Yang), and (Wu Cheng, Happy Jushi).

The resulting triplet relationships described above can be converted into the following tabular form:

in the process, according to one statement pair, the generation model generates the triple relation of a single named entity every time, the generation length of the triple relation is limited, and the accuracy of the generated triple relation is further ensured.

The text data processing apparatus of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these text data processing devices can be constructed by configuring the steps taught in the present embodiment using commercially available hardware components.

Fig. 7 is a schematic structural diagram of a text data processing apparatus according to an embodiment of the present invention, and as shown in fig. 7, the apparatus includes:

the obtaining module 11 is configured to obtain a first named entity included in the statement to be processed.

And the building module 12 is used for forming a statement pair corresponding to the first named entity by using the statement to be processed and the template statement containing the first named entity.

A generating module 13, configured to generate, according to the statement pair, a triple relationship including the first named entity.

Optionally, the obtaining module 11 includes:

the input unit 111 is configured to input the sentence to be processed into a sequence tagging model, so that the sequence tagging model outputs a tagging sequence corresponding to the sentence to be processed.

A determining unit 112, configured to determine the first named entity from the output annotation sequence.

Optionally, the input unit 111 is specifically configured to: if the sentence to be processed contains pronouns, determining the referring content corresponding to the pronouns according to the previous sentence of the sentence to be processed; completing the sentence to be processed according to the reference content; and inputting the supplemented sentence to be processed into the sequence labeling model.

Optionally, the generating module 13 includes:

the input unit 131 is configured to input the statement pair into a generative model, so that the generative model outputs an attribute relationship sequence corresponding to the first named entity.

A generating unit 132, configured to generate a triple relationship including the first named entity according to the first named entity and the attribute relationship sequence, where the attribute relationship sequence includes a predicate in the triple relationship and a second named entity.

Optionally, the generating unit 132 is specifically configured to: dividing the attribute relation sequence into at least one attribute relation according to the interval symbol among the words in the attribute relation sequence;

determining respective validity of the at least one attribute relationship;

and constructing a triple relation containing the first named entity according to the effective attribute relation and the first named entity, wherein the number of the triple relation is the same as that of the effective attribute relation.

Optionally, the generating unit 132 is specifically configured to: and if the interval symbol between the words in the target attribute relationship does not meet the preset requirement, determining that the target attribute relationship is invalid, wherein the target attribute relationship is any one of the at least one attribute relationship.

Optionally, the generating unit 132 is specifically configured to: identifying predicates and a second named entity in the triple relationship according to spacers among the words in the target attribute relationship, wherein the target attribute relationship is any one attribute relationship in the at least one attribute relationship;

and if the second named entity in the target attribute relationship is not contained in the statement to be processed, determining that the target attribute relationship is invalid.

Optionally, the apparatus further comprises:

the obtaining module 11 is configured to obtain a sample named entity included in a sample statement.

The constructing module 12 is configured to construct a sample statement pair corresponding to the named entity from the sample statement and the template statement including the sample named entity.

An input module 21, configured to input the sample statement pair into the generative model, so that the generative model outputs the attribute relationship sequence corresponding to the sample named entity and the prediction probability matrix corresponding to the attribute relationship sequence.

And an adjusting module 22, configured to adjust a model parameter of the generated model according to the predicted probability matrix and a preset expected probability matrix.

The apparatus shown in fig. 7 may execute the text data processing method provided in the embodiments shown in fig. 1 to fig. 5, and for the parts not described in detail in this embodiment, reference may be made to the related descriptions of the embodiments shown in fig. 1 to fig. 5, which are not described again here.

Having described the internal functions and structure of the text data processing apparatus, in one possible design, the structure of the text data processing apparatus may be implemented as an electronic device, as shown in fig. 8, which may include: a processor 31 and a memory 32. Wherein, the memory 32 is used for storing a program for supporting the electronic device to execute the text data processing method provided in the foregoing embodiments shown in fig. 1 to 5, and the processor 31 is configured to execute the program stored in the memory 32.

The program comprises one or more computer instructions which, when executed by the processor 31, are capable of performing the steps of:

optionally, the processor 31 is further configured to perform all or part of the steps in the foregoing embodiments shown in fig. 1 to 5.

The electronic device may further include a communication interface 33 for communicating with other devices or a communication network.

Acquiring a first named entity contained in a statement to be processed;

Additionally, embodiments of the present invention provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform at least the following:

acquiring a first named entity contained in a statement to be processed;

Fig. 9 is a schematic structural diagram of another text data processing apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes:

an obtaining module 41, configured to obtain a sample named entity included in the sample statement.

And the constructing module 42 is configured to construct a sample statement pair corresponding to the named entity from the sample statement and the template statement containing the sample named entity.

An input module 43, configured to input the sample statement pair into a generative model, so that the generative model outputs the attribute relationship sequence corresponding to the sample named entity and the prediction probability matrix corresponding to the attribute relationship sequence.

And an adjusting module 44, configured to adjust a model parameter of the generative model according to the prediction probability matrix and a preset expected probability matrix.

The apparatus shown in fig. 9 may execute the text data processing method provided in the embodiment shown in fig. 6, and for parts not described in detail in this embodiment, reference may be made to the related description of the embodiment shown in fig. 6, which is not repeated herein.

Having described the internal functions and structure of the text data processing apparatus, in one possible design, the structure of the text data processing apparatus may be implemented as an electronic device, as shown in fig. 10, which may include: a processor 51 and a memory 52. Wherein, the memory 52 is used for storing a program for supporting the electronic device to execute the text data processing method provided in the foregoing embodiment shown in fig. 6, and the processor 51 is configured to execute the program stored in the memory 52.

The program comprises one or more computer instructions which, when executed by the processor 51, are capable of performing the steps of:

optionally, the processor 51 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 6.

The electronic device may further include a communication interface 53 for communicating with other devices or a communication network.

Obtaining a sample named entity contained in a sample statement;

obtaining a sample named entity contained in a sample statement;

and calculating and adjusting the model parameters of the generated model according to the prediction probability matrix and a preset expected probability matrix.

The above-described apparatus embodiments are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding, the above technical solutions may be embodied in the form of a computer product, which is a substantial part of or contributes to the prior art.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text data processing method, comprising:

acquiring a first named entity contained in a statement to be processed;

2. The method of claim 1, wherein obtaining the first named entity contained in the to-be-processed statement comprises:

inputting the sentence to be processed into a sequence labeling model, and outputting a labeling sequence corresponding to the sentence to be processed by the sequence labeling model;

determining the first named entity from the output annotated sequence.

3. The method of claim 2, wherein the inputting the sentence into the sequence annotation model comprises:

if the sentence to be processed contains pronouns, determining the referring content corresponding to the pronouns according to the previous sentence of the sentence to be processed;

completing the sentence to be processed according to the reference content;

and inputting the supplemented sentence to be processed into the sequence labeling model.

4. The method of claim 1, wherein generating the triple relationship containing the first named entity from the statement pair comprises:

inputting the statement pair into a generative model, and outputting an attribute relationship sequence corresponding to the first named entity by the generative model;

and generating a triple relationship containing the first named entity according to the first named entity and the attribute relationship sequence, wherein the attribute relationship sequence contains a predicate in the triple relationship and a second named entity.

5. The method of claim 4, wherein generating the triple relationship containing the first named entity from the first named entity and the sequence of attribute relationships comprises:

dividing the attribute relation sequence into at least one attribute relation according to the interval symbol among the words in the attribute relation sequence;

determining respective validity of the at least one attribute relationship;

6. The method of claim 5, wherein the determining the respective validity of the at least one attribute relationship comprises:

and if the interval symbol between the words in the target attribute relationship does not meet the preset requirement, determining that the target attribute relationship is invalid, wherein the target attribute relationship is any one of the at least one attribute relationship.

7. The method according to claim 5 or 6, wherein the determining the respective validity of the at least one attribute relationship comprises:

identifying predicates and a second named entity in the triple relationship according to spacers among the words in the target attribute relationship, wherein the target attribute relationship is any one attribute relationship in the at least one attribute relationship;

8. The method of claim 4, further comprising:

obtaining a sample named entity contained in a sample statement;

inputting the sample statement pair into the generation model, and outputting the attribute relation sequence corresponding to the sample named entity and the prediction probability matrix corresponding to the attribute relation sequence by the generation model;

9. A text data processing method, comprising:

obtaining a sample named entity contained in a sample statement;

10. A text data processing apparatus, characterized by comprising:

11. An electronic device, comprising: a processor and a memory; wherein the memory is to store one or more computer instructions that when executed by the processor implement:

acquiring a first named entity contained in a statement to be processed;

12. A computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform at least the following acts:

acquiring a first named entity contained in a statement to be processed;

13. A text data processing apparatus, characterized by comprising:

14. An electronic device, comprising: a processor and a memory; wherein the memory is to store one or more computer instructions that when executed by the processor implement:

obtaining a sample named entity contained in a sample statement;

15. A computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform at least the following acts:

obtaining a sample named entity contained in a sample statement;