CN110807311B - Method and device for generating information - Google Patents

Method and device for generating information Download PDF

Info

Publication number
CN110807311B
CN110807311B CN201810791223.2A CN201810791223A CN110807311B CN 110807311 B CN110807311 B CN 110807311B CN 201810791223 A CN201810791223 A CN 201810791223A CN 110807311 B CN110807311 B CN 110807311B
Authority
CN
China
Prior art keywords
sentence
triplet
information
processed
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810791223.2A
Other languages
Chinese (zh)
Other versions
CN110807311A (en
Inventor
沈之锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810791223.2A priority Critical patent/CN110807311B/en
Publication of CN110807311A publication Critical patent/CN110807311A/en
Application granted granted Critical
Publication of CN110807311B publication Critical patent/CN110807311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the application discloses a method and a device for generating information. One embodiment of the method comprises the following steps: acquiring text information to be processed, wherein the text information to be processed comprises at least one sentence; extracting sentences meeting a first preset condition from at least one sentence to form a sentence set; extracting subjects, predicates and objects from sentences in the sentence set to form triples, wherein parallel words exist in the objects; combining the formed triples into a triplet set, and selecting the triples from the triplet set as target triples; extracting parallel words from objects in the target triplet, taking a subject in the target triplet as a father knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating father-son relationship information for indicating father-son relationship between the father knowledge point and the child knowledge point. The embodiment realizes the mining of the parent-child relationship between knowledge points.

Description

Method and device for generating information
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating information.
Background
Whether in the process of learning the user or in the process of constructing the knowledge graph, the father node or the child node for obtaining a certain knowledge point is required very frequently. The knowledge point knowledge management method can help the user to know knowledge of a larger field associated with the knowledge point, and can also enable the user to know the knowledge point and further decompose the knowledge point into smaller knowledge points. Therefore, the method has important significance and effect in mining the parent-child relationship between knowledge points.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating information.
In a first aspect, embodiments of the present application provide a method for generating information, the method including: acquiring text information to be processed, wherein the text information to be processed comprises at least one sentence; extracting sentences meeting a first preset condition from at least one sentence to form a sentence set; extracting subjects, predicates and objects from sentences in the sentence set to form triples, wherein parallel words exist in the objects; combining the formed triples into a triplet set, and selecting the triples from the triplet set as target triples; extracting parallel words from objects in the target triplet, taking a subject in the target triplet as a father knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating father-son relationship information for indicating father-son relationship between the father knowledge point and the child knowledge point.
In some embodiments, the first preset condition comprises: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs for stating main words in the sentence or adverbs for modifying the verbs; the characters in the character set are conjunctions or punctuation marks for connecting words in parallel relationship in a sentence.
In some embodiments, extracting sentences satisfying the first preset condition from at least one sentence includes: for a sentence in the at least one sentence, determining whether the sentence includes keywords in a keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence includes characters in the character set, the sentence is extracted.
In some embodiments, the first preset condition further comprises: the length of the sentence is not more than the preset word number; and extracting sentences satisfying the first preset condition from at least one sentence, comprising: for a sentence in the at least one sentence, determining whether the sentence includes keywords in a keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is greater than a preset word number; if the length of the sentence is not greater than the preset word number, extracting the sentence.
In some embodiments, for a sentence in a set of sentences, extracting subjects, predicates, and objects from the sentence includes: and for sentences in the sentence set, taking the sentences as sentences to be processed, carrying out syntactic analysis and semantic role analysis on the sentences to be processed to obtain analysis results, and extracting subjects, predicates and objects from the sentences to be processed based on the analysis results.
In some embodiments, the analysis result includes first labeling information for indicating a core verb in the sentence to be processed and second labeling information for indicating a posting portion of the core verb; and extracting subjects, predicates and objects from the sentence to be processed based on the analysis result, including: determining whether the analysis result further includes third labeling information for indicating the subject portion of the core verb; if the third labeling information is included, determining the part of the event indicated by the second labeling information, the core verb indicated by the first labeling information and the part of the event indicated by the third labeling information as subjects, predicates and objects in the sentence to be processed in sequence, and extracting the determined subjects, predicates and objects from the sentence to be processed.
In some embodiments, the analysis result further includes at least one fourth labeling information, the fourth labeling information being used to indicate a motor guest relationship between a core verb and a word other than the core verb in the sentence to be processed; and extracting subjects, predicates and objects from the sentence to be processed based on the analysis result, further comprising: in response to determining that the analysis result does not include the third annotation information, determining target fourth annotation information meeting the second preset condition in at least one fourth annotation information, extracting phrases from the sentence to be processed as objects based on the target fourth annotation information, sequentially taking the part of the event indicated by the second annotation information and the core verb indicated by the first annotation information as subjects and predicates in the sentence to be processed, and extracting the determined subjects, predicates and objects from the sentence to be processed.
In some embodiments, selecting a triplet from the triplet set as the target triplet includes: obtaining a target classification model, wherein the target classification model is a trained classification model for predicting whether the relationship among subjects, predicates and objects in the triplet is correct; selecting triples with correct relationships among included subjects, predicates and objects from the triples based on the target classification model to form a first triples set; and selecting a triplet from the first triplet set as a target triplet.
In some embodiments, obtaining the target classification model includes: acquiring labeling information of at least one triplet in the triplet set, wherein the labeling information is used for indicating whether the relation among subjects, predicates and objects in the corresponding triplet is correct; for a triplet in at least one triplet, carrying out feature extraction on the triplet to obtain feature information, inputting the feature information of the triplet into an initial model to obtain a prediction result corresponding to the triplet, wherein the prediction result is used for indicating whether the relation among subjects, predicates and objects in the triplet is correct; comparing the prediction result with the labeling information of the triplet, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization objective, the initial model is taken as the objective classification model.
In some embodiments, selecting a triplet from the first set of triples as the target triplet includes: performing preset disambiguation operation on triples in the first triplet set; and taking the triples in the first triplet set after the disambiguation operation as target triples.
In a second aspect, embodiments of the present application provide an apparatus for generating information, the apparatus comprising: an acquisition unit configured to acquire text information to be processed, wherein the text information to be processed includes at least one sentence; the first generation unit is configured to extract sentences meeting a first preset condition from at least one sentence to form a sentence set; a second generation unit configured to extract subjects, predicates, and objects from sentences in the sentence set, to form triples, wherein parallel words exist in the objects; a selection unit configured to combine the composed triples into a triplet set, and select a triplet from the triplet set as a target triplet; and a third generation unit configured to extract parallel words from objects in the target triplet, use a subject in the target triplet as a parent knowledge point, and use words in the extracted parallel words as child knowledge points, and generate parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge points.
In some embodiments, the first preset condition comprises: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs for stating main words in the sentence or adverbs for modifying the verbs; the characters in the character set are conjunctions or punctuation marks for connecting words in parallel relationship in a sentence.
In some embodiments, the first generation unit is further configured to: for a sentence in the at least one sentence, determining whether the sentence includes keywords in a keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence includes characters in the character set, the sentence is extracted.
In some embodiments, the first preset condition further comprises: the length of the sentence is not more than the preset word number; and the first generation unit is further configured to: for a sentence in the at least one sentence, determining whether the sentence includes keywords in a keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is greater than a preset word number; if the length of the sentence is not greater than the preset word number, extracting the sentence.
In some embodiments, the second generating unit comprises: the extraction subunit is configured to take the sentence as a sentence to be processed in the sentence set, perform syntactic analysis and semantic role analysis on the sentence to be processed to obtain an analysis result, and extract subjects, predicates and objects from the sentence to be processed based on the analysis result.
In some embodiments, the analysis result includes first labeling information for indicating a core verb in the sentence to be processed and second labeling information for indicating a posting portion of the core verb; and the extraction subunit is further configured to: determining whether the analysis result further includes third labeling information for indicating the subject portion of the core verb; if the third labeling information is included, determining the part of the event indicated by the second labeling information, the core verb indicated by the first labeling information and the part of the event indicated by the third labeling information as subjects, predicates and objects in the sentence to be processed in sequence, and extracting the determined subjects, predicates and objects from the sentence to be processed.
In some embodiments, the analysis result further includes at least one fourth labeling information, the fourth labeling information being used to indicate a motor guest relationship between a core verb and a word other than the core verb in the sentence to be processed; and the extraction subunit is further configured to: in response to determining that the analysis result does not include the third annotation information, determining target fourth annotation information meeting the second preset condition in at least one fourth annotation information, extracting phrases from the sentence to be processed as objects based on the target fourth annotation information, sequentially taking the part of the event indicated by the second annotation information and the core verb indicated by the first annotation information as subjects and predicates in the sentence to be processed, and extracting the determined subjects, predicates and objects from the sentence to be processed.
In some embodiments, the selecting unit includes: an acquisition subunit configured to acquire a target classification model, wherein the target classification model is a trained classification model for predicting whether a relationship between subjects, predicates, and objects in the triplet is correct; a generation subunit configured to select, based on the target classification model, a triplet from the triplet set in which the relationship between the included subject, predicate and object is correct, constituting a first triplet set; a selecting subunit configured to select a triplet from the first set of triples as a target triplet.
In some embodiments, the acquisition subunit is further configured to: acquiring labeling information of at least one triplet in the triplet set, wherein the labeling information is used for indicating whether the relation among subjects, predicates and objects in the corresponding triplet is correct; for a triplet in at least one triplet, carrying out feature extraction on the triplet to obtain feature information, inputting the feature information of the triplet into an initial model to obtain a prediction result corresponding to the triplet, wherein the prediction result is used for indicating whether the relation among subjects, predicates and objects in the triplet is correct; comparing the prediction result with the labeling information of the triplet, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization objective, the initial model is taken as the objective classification model.
In some embodiments, the pick subunit is further configured to: performing preset disambiguation operation on triples in the first triplet set; and taking the triples in the first triplet set after the disambiguation operation as target triples.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
According to the method and the device for generating the information, the text information to be processed including at least one sentence is obtained, then the sentences meeting the first preset condition are extracted from the at least one sentence, and a sentence set is formed, so that sentences in the sentence set serve as corpus content for excavating father-son relations among knowledge points. And then extracting subjects, predicates and objects from sentences in the sentence set to form triples, combining the formed triples into a tripleset, selecting the triples from the tripleset as target triples so as to extract parallel words from the objects in the target triples, taking the subjects in the target triples as father knowledge points, taking the words in the extracted parallel words as son knowledge points, and generating father-son relationship information for indicating father-son relationships between the father knowledge points and the son knowledge points. Therefore, generation of triples comprising subjects, predicates and objects is effectively utilized, parallel words included by objects in the target triples are extracted, and excavation of father-son relations among knowledge points is achieved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:
FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a method for generating information according to the present application;
FIG. 3 is a schematic illustration of one application scenario of a method for generating information according to the present application;
FIG. 4 is a flow chart of yet another embodiment of a method for generating information according to the present application;
FIG. 5 is a schematic structural diagram of one embodiment of an apparatus for generating information according to the present application;
fig. 6 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.
Detailed Description
The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the methods for generating information or the apparatus for generating information of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include information generating terminals 101, 102, 103, a network 104, and an information storage terminal 105. The network 104 is a medium used to provide communication links between the information generating terminals 101, 102, 103 and the information storage terminal 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The information generating terminals 101, 102, 103 may interact with the information storage terminal 105 via the network 104 to receive or transmit messages or the like. For example, the information generating terminals 101, 102, 103 may acquire the text information to be processed from the information storage terminal 105, and then analyze the text information to be processed to obtain a processing result (for example, generated parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge point).
The information generating terminals 101, 102, 103 may be terminal devices or servers. When the information generating terminals 101, 102, 103 are terminal devices, various communication client applications such as a web browser application, an application for mining parent-child relationships between knowledge points, and the like may be installed on the terminal devices.
The information storage terminal 105 may be a server providing various services, for example, a server for storing text information for processing by the information generation terminals 101, 102, 103.
It should be noted that, the method for generating information provided in the embodiments of the present application is generally performed by the information generating terminals 101, 102, 103, and accordingly, the apparatus for generating information is generally provided in the information generating terminals 101, 102, 103.
It should be noted that the terminal device may be hardware or software. When the terminal device is hardware, it may be a variety of electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal device is software, it can be installed in the above-listed electronic device. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
In practice, if the information generating terminals 101, 102, 103 pre-store the required text information to be processed, the system architecture 100 may not include the information storage terminal 105.
It should be understood that the number of information generating sides, networks, and information storage sides in fig. 1 are merely illustrative. There may be any number of information generating sides, networks, and information storage sides, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating information according to the present application is shown. The flow 200 of the method for generating information comprises the steps of:
step 201, obtaining text information to be processed.
In this embodiment, the execution subject of the method for generating information (e.g., the information generating terminals 101, 102, 103 shown in fig. 1) may acquire the text information to be processed from the connected server (e.g., the information storage terminal 105 shown in fig. 1) in real time, or may acquire the text information to be processed locally. Wherein the text information to be processed may comprise at least one sentence.
As an example, the execution subject described above may acquire, in response to receiving an information generation request for the text information to be processed, the text information to be processed indicated by the information generation request. The sender of the information generating request may be a user side or a server side, and the embodiment is not limited in this respect.
Step 202, extracting sentences meeting a first preset condition from at least one sentence to form a sentence set.
In this embodiment, the execution body may extract sentences satisfying a first preset condition from the at least one sentence, to form a sentence set. The first preset condition may include, for example: the sentence includes a keyword in a preset keyword set and a character in a preset character set. Here, the keywords in the keyword set may be verbs for stating subjects in sentences. The characters in the character set may be conjunctions or punctuation marks for connecting words in parallel relationship in a sentence.
As an example, for a sentence in the at least one sentence, the execution body may first determine whether the sentence includes a keyword in the keyword set. Then, the execution body may further determine whether the sentence includes characters in the character set in response to determining that the sentence includes keywords in the keyword set. Finally, the executing body may determine that the sentence is a sentence satisfying the first preset condition in response to determining that the sentence includes characters in the character set, and the executing body may extract the sentence.
By using the first preset condition, it is ensured that the extracted sentence is a sentence including three sentence components of subject, predicate and object and parallel words. The extracted sentences are used as the corpus content for excavating the father-son relations among the knowledge points, so that the excavating efficiency can be improved. It should be understood that a parallel word may include at least two words that are in a parallel relationship. Taking the sentence "threads can be generally classified into three categories of fastening threads, driving threads and pipe threads according to purposes" as an example, in the sentence, a parallel relationship exists between the "fastening threads", "driving threads" and "pipe threads", and the parallel relationship belongs to the parallel words.
In some optional implementations of this embodiment, the keywords in the keyword set described above may be verbs for stating the subject in a sentence, adverbs for modifying the verb, or the like. It should be noted that, the keyword set may be obtained by performing word expansion on an initial keyword set manually. The initial keywords may include, for example: the terms "comprises," "comprising," "includes," "including," or "having" are used herein. An execution end (for example, the execution body or a server connected to the execution body) for generating the keyword set may input each keyword in the initial keywords into a preset related model (for example, word2 vec) for generating a Word vector, so as to obtain a plurality of words related to the keywords. The execution end may extract a preset number of words (for example, 30) from the obtained words as expanded keywords of the keywords. The execution end may combine the initial keyword and the expanded keyword corresponding to the initial keyword into a keyword set.
In some optional implementations of this embodiment, since the number of keywords in the keyword set is greater, in order to improve the matching efficiency, the executing body may use a multimode matching algorithm (for example, AC automaton, english is called Aho-Corasick automaton) to match the keywords in the keyword set with the words included in the sentences in the at least one sentence, so as to determine whether the sentence includes the keywords in the keyword set.
Step 203, extracting subjects, predicates and objects from sentences in the sentence set to form triples.
In this embodiment, for a sentence in the sentence set, the execution body may extract a subject, a predicate, and an object from the sentence to form a triplet.
As an example, the execution subject may perform a syntactic analysis (e.g., dependency syntactic analysis) on the sentence to obtain an analysis result. The dependency syntax analysis may be used to identify grammar components such as main, predicate, guest, fix, form, complement, etc. in a sentence, and analyze the relationship between the components. The analysis result may be used to indicate a relationship type to which the relationship between the different words in the sentence belongs. The relationship types may include a master-predicate relationship, a move-guest relationship, and the like. And then the execution subject can search out two words belonging to the main-predicate relation in the sentence based on the analysis result, wherein the left word in the two words is used as a main word, and the right word in the two words is used as a predicate. Then, the execution subject may find out a target word having a moving object relationship with the predicate in the sentence based on the analysis result. If the number of the target words is 1, the execution subject may extract a phrase having a word adjacent to the predicate and on the right side of the predicate as a start word and the target word as an end word from the sentence, and use the phrase as an object; if the number of target words is greater than 1, the execution subject may select a target word having the largest number of characters spaced from the predicate from the target words, extract a phrase having a word adjacent to the predicate and on the right side of the predicate as a start word, and the selected target word as an end word from the sentence, and use the phrase as an object. Finally, the execution subject can extract the determined subjects, predicates and objects from the sentence to form a triplet.
Taking the sentence "the working portion of the function gauge has the checking portion, the locating portion, and the guiding portion" as an example, if the subject "the working portion of the function gauge" is extracted from the sentence, "the predicate" the checking portion has "and the object" the locating portion, and the guiding portion ", the above-described execution subject may constitute the following triples < the working portion of the function gauge, the checking portion, the locating portion, and the guiding portion >.
And 204, combining the composed triples into a triplet set, and selecting the triples from the triplet set as target triples.
In this embodiment, the execution body may merge the triples formed in step 203 into a triplet set. The execution body may select a triplet from the triplet set as a target triplet. As an example, the execution body may select each triplet in the triplet set as a target triplet.
In some alternative implementations of the present embodiment, the relationships between subjects, predicates, and objects in some triples may be erroneous. For example, a subject or object is excessively long, etc., and is not practically suitable as a subject or object of one sentence. Thus, the execution body may analyze a triplet in the triplet set to determine whether the relationship between the subject, predicate, and object in the triplet is correct. The execution body may select a triplet from the triplet set, where the relationship among the included subject, predicate, and object is correct, to form a first triplet set. The execution body may select a triplet from the first triplet set as the target triplet. For example, each triplet in the first set of triples is selected as a target triplet.
Here, the execution subject may acquire a target classification model, and predict whether or not the relationship among subjects, predicates, and objects included in the triples in the triplet set is correct using the target classification model. The target classification model may be a trained classification model for predicting whether relationships between subjects, predicates, and objects in the triplet are correct. The target classification model may be obtained by training an initial model. The initial model may be, for example, an untrained or untrained naive bayes model (Naive Bayesian Model, NBM) or a support vector machine (Support Vector Machine, SVM), etc. If the executing entity does not execute the process 200 for the first time, the executing entity may acquire the pre-stored object classification model because the object classification model is generally stored locally or in a server connected to the executing entity.
And 205, extracting parallel words from objects in the target triplet, taking a subject in the target triplet as a father knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating father-son relationship information for indicating father-son relationships between the father knowledge point and the child knowledge points.
In this embodiment, for each target triplet, the execution subject may extract a parallel word from the object in the target triplet, use the subject in the target triplet as a parent knowledge point, and use the word in the extracted parallel word as a child knowledge point, and generate parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge point.
Taking the target triplet < working part of function gauge, checking part, locating part and guiding part > as an example, the execution subject may divide the object according to the characters included in the object in the triplet and existing in the character set. Here, the object includes characters "," and "existing in the above character set. The execution subject described above may divide the object into, for example, "check part", "locating part", and |guide part ", where" | "may represent a separator. The execution body may use the divided words other than the characters "," and "as the parallel words, that is," checking portion "," positioning portion "and" guiding portion "as the parallel words. The execution subject may extract the parallel words "check portion", "positioning portion", and "guide portion" from the object. The execution subject may use the working part of the subject's function gauge' as a parent knowledge point and the checking part, the positioning part and the guiding part as child knowledge points, respectively. The execution subject may generate, for example, the following three pieces of parent-child relationship information:
Subject(s): working part of function gauge, predicate: divided into objects: a checking section;
subject(s): working part of function gauge, predicate: divided into objects: a positioning portion;
subject(s): working part of function gauge, predicate: divided into objects: a guide portion.
In some optional implementations of this embodiment, the executing body may store the generated parent-child relationship information in a pre-specified database, so as to form a corresponding knowledge graph. Thus, the knowledge graph can be used for scenes such as knowledge recommendation. Taking a knowledge recommendation scene as an example, when a user learns a certain knowledge point, a parent knowledge point and/or a child knowledge point of the knowledge point can be obtained through the knowledge map. By pushing the acquired knowledge points to the user, the user can learn conveniently, and the knowledge surface of the user can be enlarged.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating information according to the present embodiment. In the application scenario of fig. 3, the server 301 may acquire the text information to be processed from the server 302 in real time. The acquired text information to be processed may include a sentence A, B, C. The content of sentence a may be "interchangeability includes multiple classifications". The content of sentence B can be classified into "interchangeability" and "complete interchangeability". The content of sentence C may be "the working portion of the function gauge has a checking portion, a positioning portion, and a guiding portion".
Then, the server 301 may determine whether the sentences A, B, C satisfy the first preset condition, respectively. The first preset condition may include: the sentence includes a keyword in a preset keyword set and a character in a preset character set. The server 301 may extract the sentences B, C and compose the sentences B, C into the sentence sets 303 in response to determining that the sentences B, C each satisfy the first preset condition.
Next, the server 301 may extract the subject "interchangeability", the predicate "split into" and object "completely interchangeability and incompletely interchangeability" from the sentence B, and compose the triplet < interchangeability, split into two categories, completely interchangeability and incompletely interchangeability >, as shown by reference numeral 304. The server 301 may also extract from sentence C the subject "working portion of the function gauge", the predicate "have" and object "checking portion, the locating portion and the guiding portion", constituting the triplet < working portion of the function gauge, have checking portion, locating portion and guiding portion >, as indicated by reference numeral 305.
Server 301 may then merge the triples indicated by reference 304 and reference 305, respectively, into a triplet set 306. Server 301 may select each triplet in triplet set 306 as a target triplet.
Finally, for the target triplet indicated by reference numeral 304, the server 301 can extract the juxtaposition words "complete interchangeability" and "incomplete interchangeability" from the object of the target triplet. The server 301 may use the subject in the target triplet as a parent knowledge point and the parallel words "complete interchangeability" and "incomplete interchangeability" as child knowledge points, respectively. The server 301 may generate the following parent-child relationship information (as shown at reference numeral 307):
subject(s): interchangeability, predicates: divided into objects: complete interchangeability;
subject(s): interchangeability, predicates: divided into objects: incomplete interchangeability.
For the target triplet indicated at reference numeral 305, the server 301 may extract the juxtaposition words "check portion", "locate portion" and "guide portion" from the object of the target triplet. The execution subject may use the subject in the target triplet as a parent knowledge point and the parallel words "checking portion", "positioning portion" and "guiding portion" as child knowledge points, respectively. Server 301 may generate the following parent-child relationship information (as shown at reference numeral 308):
subject(s): working part of function gauge, predicate: there are objects: a checking section;
Subject(s): working part of function gauge, predicate: there are objects: a positioning portion;
subject(s): working part of function gauge, predicate: there are objects: a guide portion.
The method provided by the embodiment of the application effectively utilizes the generation of the triples comprising the subject, the predicate and the object and the extraction of the parallel words comprising the object in the target triples, and realizes the mining of the father-son relationship between knowledge points.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating information is shown. The flow 400 of the method for generating information comprises the steps of:
step 401, obtaining text information to be processed.
In this embodiment, the execution subject of the method for generating information (e.g., the information generating terminals 101, 102, 103 shown in fig. 1) may acquire the text information to be processed from the connected server (e.g., the information storage terminal 105 shown in fig. 1), or may acquire the text information to be processed locally. Wherein the text information to be processed may comprise at least one sentence.
Step 402, extracting sentences meeting the following first preset conditions from at least one sentence: the sentences comprise keywords in a preset keyword set, characters in a preset character set, and the lengths of the sentences are not more than the preset word number, and the extracted sentences form a sentence set.
In this embodiment, the execution body may extract, from the at least one sentence, a sentence satisfying the following first preset condition: the sentence comprises keywords in a preset keyword set, characters in a preset character set, and the length of the sentence is not more than the preset word number. Then, the execution body may compose the extracted sentences into a sentence set.
Wherein, the keywords in the keyword set may be verbs for stating the subject in sentences or adverbs for modifying the verbs. The characters in the character set may be conjunctions or punctuation marks for connecting words in parallel relationship in a sentence.
As an example, for a sentence in the at least one sentence, the execution body may first determine whether the sentence includes a keyword in the keyword set, for example. If so, the executing body may further determine whether the sentence includes characters in the character set. If the sentence includes characters in the character set, the execution body may further determine whether the length of the sentence is greater than the predetermined number of words. If the length of the sentence is not greater than the preset word number, the execution body may extract the sentence.
It should be noted that, because the number of keywords in the keyword set is larger, in order to improve the matching efficiency, the execution body may use a multi-mode matching algorithm (for example, an AC automaton) to match the keywords in the keyword set with words included in the sentence in the at least one sentence, so as to determine whether the sentence includes the keywords in the keyword set.
Step 403, regarding the sentences in the sentence set, using the sentences as the sentences to be processed, performing syntactic analysis and semantic role analysis on the sentences to be processed to obtain analysis results, and extracting subjects, predicates and objects from the sentences to be processed based on the analysis results to form triples.
In this embodiment, for a sentence in the sentence set, the sentence is taken as a sentence to be processed, and the execution body may perform syntactic analysis and semantic role analysis on the sentence to be processed, so as to obtain an analysis result. Based on the analysis result, the execution subject can extract subjects, predicates and objects from the sentence to be processed to form a triplet.
It should be noted that the syntax analysis may include, for example, dependency syntax analysis. Dependency syntax analysis can be used to identify syntactic components of a sentence, such as main, predicate, guest, fix, shape, complement, etc., and analyze the relationships between the components. In general, the arguments can be divided into several types based on the different semantic relationships between predicates and arguments. This type of argument is commonly referred to as a semantic role. The argument may be called a question element, a term, or the like, and is a semantic component having a direct relationship with and being governed by a predicate. Semantic roles may include scholars, eventualities, and the like. In general, a person or thing that performs an action refers to a subject of the action, i.e., a person or thing that performs an action or changes. An incident generally refers to an object of an action, i.e., a person or thing subject to the action. The semantic role analysis can be used for analyzing the types of the arguments in the sentence and carrying out semantic role labeling on the arguments.
In this embodiment, the analysis result may include first labeling information for indicating a core verb in the sentence to be processed and second labeling information for indicating a part of the core verb that is subject to be processed. As an example, the first labeling information may be represented by, for example, "HED" or the like, which may be used to represent a core relationship, may point to the core of the entire sentence (may be a core verb that may be a predicate in the sentence). The second labeling information may be represented by "A0" or the like, for example. The execution body may search the analysis result for third labeling information indicating the subject portion of the core verb. The third label information may be represented by "A1" or the like, for example. If so, the executing body may sequentially use the part of the event indicated by the second labeling information, the core verb indicated by the first labeling information, and the part of the event indicated by the third labeling information as subjects, predicates, and objects in the sentence to be processed. The execution subject may extract the determined subject, predicate and object from the sentence to be processed to form a triplet.
In some optional implementations of this embodiment, the analysis result may further include at least one fourth annotation information while including the first annotation information and the second annotation information. The fourth labeling information may be used to indicate a motor guest relationship between the core verb and the word other than the core verb in the above-described sentence to be processed. The fourth labeling information may be represented by, for example, "VOB" or the like. If the analysis result does not include the third labeling information, the execution body may execute the following extraction operations:
First, the execution body may determine target fourth labeling information satisfying a second preset condition from the at least one fourth labeling information. As an example, the second preset condition may include, for example: the number of characters lying between the indicated two words is the largest. For each fourth labeling information in the at least one fourth labeling information, the execution subject may count the number of characters between two words indicated by the fourth labeling information. And then the execution main body can compare the counted number and determine the fourth labeling information corresponding to the compared maximum number as the target fourth labeling information.
Then, the execution subject may extract a phrase as an object from the sentence to be processed based on the target fourth labeling information. For example, the sentences to be processed are classified into two types of "interchangeability and incomplete interchangeability", and the two words indicated by the fourth labeling information include "classified into" and "class", wherein "classified into" is a core verb, and a motor guest relationship exists between "classified into" and "class". The execution subject may extract, from the sentence to be processed, a phrase formed of characters and "class" between the two words, that is, two classes of "complete interchangeability and incomplete interchangeability", with the phrase being an object.
Finally, the execution subject may sequentially use the part of the event indicated by the second labeling information and the core verb indicated by the first labeling information as a subject and a predicate in the sentence to be processed. The execution subject may extract the determined subject, predicate and object from the sentence to be processed to form a triplet.
And step 404, combining the composed triples into a triplet set, and acquiring the labeling information of at least one triplet in the triplet set.
In this embodiment, the execution body may merge the triples formed in step 403 into a triplet set. The execution body may obtain labeling information of at least one triplet in the triplet set. Wherein the labeling information may be used to indicate whether the relationships between subjects, predicates, and objects in the corresponding triples are correct.
It should be noted that the labeling information may be manually set. The execution body may output the triplet set and output corresponding prompt information to prompt a relevant person to set labeling information for at least one triplet in the triplet set. The execution body may receive labeling information of at least one triplet sent by the related person through the electronic device. The labeling information may be represented by 0 or 1, for example. A0 may represent that the relationship between the subject, predicate, and object in the triplet is erroneous. A 1 may represent that the relationship between subject, predicate and object in the triplet is correct.
Step 405, for a triplet in at least one triplet, extracting features of the triplet to obtain feature information, and inputting the feature information of the triplet into an initial model to obtain a prediction result corresponding to the triplet; comparing the prediction result with the labeling information of the triplet, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization objective, the initial model is taken as the objective classification model.
In this embodiment, after the executing body obtains the labeling information of at least one triplet, for the triplet in the at least one triplet, the executing body may execute the following model training operation with the triplet being used as the triplet to be processed:
firstly, the execution main body can perform feature extraction on the triples to be processed to obtain feature information. Here, the execution subject may determine whether the subject in the triplet to be processed starts with a centering structure, to obtain a determination result. The execution body may also count the length of the subject in the triplet to be processed, the number of nouns in the subject, the length of the object, and the number of the break in the object. The execution body may use the counted information, the obtained determination result, predicates in the triplet to be processed, and the like as the feature information of the triplet to be processed. It should be noted that, the analysis result may further include fifth labeling information for indicating a centering relationship between words. The fifth labeling information may be represented by "ATT" or the like, for example. The execution body may determine whether a subject in the triplet to be processed starts with a centering structure based on fifth annotation information included in an analysis result associated with the triplet to be processed.
Then, the execution body may input the extracted feature information into an initial model to obtain a prediction result. Wherein the prediction result may be used to indicate whether the relationship between subject, predicate and object in the above-described pending triplet is correct. The initial model may be, for example, an untrained or untrained naive bayes model or a support vector machine, etc.
And then, the execution main body can compare the obtained prediction result with the labeling information of the to-be-processed triples, and determine whether the initial model reaches a preset optimization target according to the comparison result. The optimization target may refer to that the accuracy of the prediction result output by the initial model is greater than a preset accuracy threshold.
Finally, the executing body may take the initial model as the target classification model in response to determining that the initial model reaches the optimization target.
In some optional implementations of this embodiment, the executing body may further adjust network parameters of the initial model in response to determining that the initial model does not reach the optimization target, use the adjusted initial model as the initial model, and reselect a triplet from the at least one triplet as a triplet to be processed, and continue to execute the model training operation.
Step 406, based on the target classification model, selecting a triplet from the triplet set that the relationship between the subject, predicate and object included is correct, constituting a first triplet set.
In this embodiment, the execution body may select, from the triplet set, a triplet in which the relationship among the subject, the predicate, and the object included is correct, based on the target classification model, to constitute a first triplet set.
For example, for each triplet in the triplet set, if the triplet has corresponding labeling information, the executing entity may determine whether the labeling information is labeling information for indicating that a relationship among a subject, a predicate, and an object in the triplet is correct, and if so, the executing entity may select the triplet to be classified into the first triplet set. If the triplet has no corresponding labeling information, the executing body can perform feature extraction on the triplet, and input the extracted feature information into the target classification model to obtain a prediction result. The execution body may determine whether the predicted result is a predicted result that indicates that a relationship among subjects, predicates, and objects in the triplet is correct, and if so, the execution body may select the triplet to be included in the first triplet set.
Step 407, performing a preset disambiguation operation on the triples in the first triplet set.
In this embodiment, the execution body may execute a preset disambiguation operation on a triplet in the first triplet set.
As an example, the execution body may perform the following disambiguation operations on triples in the first triplet set:
and for the triples in the first triplet set, taking the triples as triples to be identified, the execution body can search for triples, one or two of which are consistent with the corresponding items in the triples to be identified, in the first triplet set, and the rest of which are inconsistent with the corresponding items in the triples to be identified. If so, the execution body may select one triplet from the triples to be identified and the triples to be found for reservation, and clear the remaining triples. For example, the execution body may obtain frequencies corresponding to the triples to be identified and the triples found respectively, compare the obtained frequencies, and select the triples corresponding to the maximum frequency for reservation.
It should be noted that, if the number of the maximum frequencies is greater than 1, the executing body may randomly select a maximum frequency, and select a triplet corresponding to the maximum frequency for reservation.
Optionally, if the number of the maximum frequencies is greater than 1, the executing body may also send a prompt message to the related person to prompt the related person to perform manual selection. The execution body may select, in response to receiving a selection result sent by the relevant person through the electronic device, a triplet indicated by the selection result to be retained.
In step 408, the triples in the first triples set after the disambiguation operation are used as target triples, parallel words are extracted from objects in the target triples, subjects in the target triples are used as parent knowledge points, words in the extracted parallel words are used as child knowledge points, and father-son relationship information for indicating father-son relationships between the parent knowledge points and the child knowledge points is generated.
In this embodiment, the execution body may use a triplet in the first triplet set after the disambiguation operation as the target triplet. The execution subject may extract parallel words from objects in the target triplet, use a subject in the target triplet as a parent knowledge point, and use words in the extracted parallel words as child knowledge points, thereby generating parent-child relationship information for indicating a parent-child relationship between the parent knowledge point and the child knowledge point. The method for generating the parent-child relationship information may refer to the related description of step 205 in the embodiment shown in fig. 2, which is not described herein.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating information in this embodiment highlights the step of extracting sentences satisfying the following first preset condition: the sentences comprise keywords in a preset keyword set, characters in a preset character set and the lengths of the sentences are not more than the preset word number; a step of performing syntactic analysis and semantic role analysis on sentences in the sentence set, and extracting subjects, predicates and objects from the sentences based on analysis results; training the initial model to obtain a target classification model; and executing preset disambiguation operation on the triples in the first triplet set. Therefore, the scheme described by the embodiment can effectively save time cost and improve the validity of the generated father-son relationship information.
With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an apparatus for generating information, where an embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 5, the apparatus 500 for generating information of the present embodiment includes: the obtaining unit 501 is configured to obtain text information to be processed, where the text information to be processed may include at least one sentence; the first generating unit 502 is configured to extract sentences satisfying a first preset condition from at least one sentence, and compose a sentence set; the second generating unit 503 is configured to extract subjects, predicates, and objects from sentences in the sentence set, to compose triples, wherein there are parallel words in the objects; the selecting unit 504 is configured to combine the composed triples into a triplet set, and select a triplet from the triplet set as a target triplet; the third generating unit 505 is configured to extract parallel words from objects in the target triplet, take a subject in the target triplet as a parent knowledge point, and take words in the extracted parallel words as child knowledge points, and generate parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge points.
In the present embodiment, in the apparatus 500 for generating information: the specific processing of the obtaining unit 501, the first generating unit 502, the second generating unit 503, the selecting unit 504, and the third generating unit 505 and the technical effects thereof may refer to the related descriptions of step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2, and are not repeated herein.
In some optional implementations of this embodiment, the first preset condition may include: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set can be verbs for stating main words in the sentence or adverbs for modifying the verbs; the characters in the character set may be conjunctions or punctuation marks for connecting words in parallel relationship in a sentence.
In some optional implementations of the present embodiment, the first generating unit 502 may be further configured to: for a sentence in the at least one sentence, determining whether the sentence includes keywords in a keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence includes characters in the character set, the sentence is extracted.
In some optional implementations of this embodiment, the first preset condition may further include: the length of the sentence is not more than the preset word number; and the first generation unit 502 may be further configured to: for a sentence in the at least one sentence, determining whether the sentence includes keywords in a keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is greater than a preset word number; if the length of the sentence is not greater than the preset word number, extracting the sentence.
In some optional implementations of the present embodiment, the second generating unit 503 may include: an extracting subunit (not shown in the figure) is configured to, for a sentence in the sentence set, take the sentence as a sentence to be processed, perform syntactic analysis and semantic role analysis on the sentence to be processed, obtain an analysis result, and extract a subject, a predicate and an object from the sentence to be processed based on the analysis result.
In some optional implementations of the present embodiment, the analysis result may include first labeling information for indicating a core verb in the sentence to be processed and second labeling information for indicating a part of the core verb that is applied; and the extraction subunit may be further configured to: determining whether the analysis result further includes third labeling information for indicating the subject portion of the core verb; if the third labeling information is included, determining the part of the event indicated by the second labeling information, the core verb indicated by the first labeling information and the part of the event indicated by the third labeling information as subjects, predicates and objects in the sentence to be processed in sequence, and extracting the determined subjects, predicates and objects from the sentence to be processed.
In some optional implementations of this embodiment, the analysis result may further include at least one fourth labeling information, where the fourth labeling information may be used to indicate a motor-guest relationship between a core verb and a word other than the core verb in the sentence to be processed; and the extraction subunit may be further configured to: in response to determining that the analysis result does not include the third annotation information, determining target fourth annotation information meeting the second preset condition in at least one fourth annotation information, extracting phrases from the sentence to be processed as objects based on the target fourth annotation information, sequentially taking the part of the event indicated by the second annotation information and the core verb indicated by the first annotation information as subjects and predicates in the sentence to be processed, and extracting the determined subjects, predicates and objects from the sentence to be processed.
In some optional implementations of the present embodiment, the selecting unit 504 may include: an acquisition subunit (not shown in the figure) configured to acquire a target classification model, wherein the target classification model may be a trained classification model for predicting whether a relationship between subjects, predicates, and objects in the triplet is correct; a generating subunit (not shown in the figure) configured to select, based on the target classification model, a triplet from the triplet set, in which the relationship between the included subject, predicate and object is correct, to constitute a first triplet set; a selection subunit (not shown) configured to select a triplet from the first set of triples as a target triplet.
In some optional implementations of this embodiment, the acquiring subunit may be further configured to: acquiring labeling information of at least one triplet in the triplet set, wherein the labeling information can be used for indicating whether the relation among subjects, predicates and objects in the corresponding triplet is correct; for a triplet in at least one triplet, carrying out feature extraction on the triplet to obtain feature information, inputting the feature information of the triplet into an initial model to obtain a prediction result corresponding to the triplet, wherein the prediction result can be used for indicating whether the relation among subjects, predicates and objects in the triplet is correct; comparing the prediction result with the labeling information of the triplet, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization objective, the initial model is taken as the objective classification model.
In some optional implementations of the present embodiment, the pick subunit may be further configured to: performing preset disambiguation operation on triples in the first triplet set; and taking the triples in the first triplet set after the disambiguation operation as target triples.
The device provided by the embodiment of the application effectively utilizes the generation of the triples comprising the subject, the predicate and the object and the extraction of the parallel words comprising the object in the target triples, and realizes the excavation of the father-son relationship between knowledge points.
Referring now to FIG. 6, a schematic diagram of a computer system 600 suitable for use in implementing an electronic device (e.g., information generating terminals 101, 102, 103 shown in FIG. 1) of an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or information storage end. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a first generation unit, a second generation unit, a selection unit, and a third generation unit. The names of these units do not constitute a limitation on the unit itself in some cases, and the acquisition unit may also be described as "a unit that acquires text information to be processed", for example.
As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to: acquiring text information to be processed, wherein the text information to be processed can comprise at least one sentence; extracting sentences meeting a first preset condition from at least one sentence to form a sentence set; extracting subjects, predicates and objects from sentences in the sentence set to form triples, wherein parallel words exist in the objects; combining the formed triples into a triplet set, and selecting the triples from the triplet set as target triples; extracting parallel words from objects in the target triplet, taking a subject in the target triplet as a father knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating father-son relationship information for indicating father-son relationship between the father knowledge point and the child knowledge point.
The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims (20)

1. A method for generating information, comprising:
acquiring text information to be processed, wherein the text information to be processed comprises at least one sentence;
extracting sentences meeting a first preset condition from the at least one sentence to form a sentence set; the first preset condition includes: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs used for stating main words in the sentence or adverbs used for modifying the verbs; the characters in the character set are conjunctions or punctuations for connecting words with parallel relations in sentences;
Extracting subjects, predicates and objects from sentences in the sentence set to form triples, wherein parallel words exist in the objects;
combining the formed triples into a triplet set, and selecting the triples from the triplet set as target triples;
extracting parallel words from objects in the target triplet, taking a subject in the target triplet as a father knowledge point, taking words in the extracted parallel words as child knowledge points, and generating father-son relationship information for indicating father-son relationship between the father knowledge point and the child knowledge point.
2. The method of claim 1, wherein the extracting sentences satisfying the first preset condition from the at least one sentence comprises:
for a sentence in the at least one sentence, determining whether the sentence includes keywords in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; if the sentence includes characters in the character set, the sentence is extracted.
3. The method of claim 1, wherein the first preset condition further comprises: the length of the sentence is not more than the preset word number; and
The extracting the sentence meeting the first preset condition from the at least one sentence comprises the following steps:
for a sentence in the at least one sentence, determining whether the sentence includes keywords in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is greater than the preset word number; and if the length of the sentence is not greater than the preset word number, extracting the sentence.
4. The method of claim 1, wherein the extracting subject, predicate, and object from sentences in the set of sentences for the sentences comprises:
and regarding sentences in the sentence set as sentences to be processed, carrying out syntactic analysis and semantic role analysis on the sentences to be processed to obtain analysis results, and extracting subjects, predicates and objects from the sentences to be processed based on the analysis results.
5. The method of claim 4, wherein the analysis result includes first labeling information for indicating a core verb in the sentence to be processed and second labeling information for indicating a part of the core verb that is to be executed; and
The extracting subjects, predicates and objects from the sentence to be processed based on the analysis result includes:
determining whether the analysis result further comprises third labeling information for indicating the incident part of the core verb;
and if the third labeling information is included, sequentially determining the executive part indicated by the second labeling information, the core verb indicated by the first labeling information and the executive part indicated by the third labeling information as subjects, predicates and objects in the sentence to be processed, and extracting the determined subjects, predicates and objects from the sentence to be processed.
6. The method of claim 5, wherein the analysis result further comprises at least one fourth labeling information for indicating a motor guest relationship between a core verb and a word other than the core verb in the sentence to be processed; and
based on the analysis result, extracting subjects, predicates and objects from the sentence to be processed, and further comprising:
in response to determining that the analysis result does not include the third annotation information, determining target fourth annotation information meeting a second preset condition in the at least one fourth annotation information, extracting phrases from the sentence to be processed as objects based on the target fourth annotation information, sequentially taking the part of the event indicated by the second annotation information and the core verb indicated by the first annotation information as subjects and predicates in the sentence to be processed, and extracting the determined subjects, predicates and objects from the sentence to be processed.
7. The method of claim 1, wherein the selecting a triplet from the triplet set as a target triplet comprises:
obtaining a target classification model, wherein the target classification model is a trained classification model for predicting whether the relationship among subjects, predicates and objects in a triplet is correct;
selecting, based on the target classification model, a triplet from the triplet set, the relationship between the included subject, predicate and object being correct, constituting a first triplet set;
and selecting a triplet from the first triplet set as a target triplet.
8. The method of claim 7, wherein the obtaining a target classification model comprises:
acquiring labeling information of at least one triplet in the triplet set, wherein the labeling information is used for indicating whether the relation among subjects, predicates and objects in the corresponding triplet is correct;
extracting features of the triples in the at least one triplet to obtain feature information, and inputting the feature information of the triples into an initial model to obtain a prediction result corresponding to the triples, wherein the prediction result is used for indicating whether the relation among subjects, predicates and objects in the triples is correct; comparing the prediction result with the labeling information of the triplet, and determining whether the initial model reaches a preset optimization target according to the comparison result; and in response to determining that the initial model reaches the optimization target, taking the initial model as a target classification model.
9. The method of claim 7, wherein the selecting a triplet from the first set of triples as a target triplet comprises:
performing preset disambiguation operation on triples in the first triplet set;
and taking the triples in the first triplet set after the disambiguation operation as target triples.
10. An apparatus for generating information, comprising:
an acquisition unit configured to acquire text information to be processed, wherein the text information to be processed includes at least one sentence;
the first generation unit is configured to extract sentences meeting a first preset condition from the at least one sentence to form a sentence set; the first preset condition includes: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs used for stating main words in the sentence or adverbs used for modifying the verbs; the characters in the character set are conjunctions or punctuations for connecting words with parallel relations in sentences;
a second generating unit configured to extract, for a sentence in the sentence set, a subject, a predicate, and an object from the sentence, to form a triplet, wherein there are parallel words in the object;
A selection unit configured to combine the composed triples into a triplet set, and select a triplet from the triplet set as a target triplet;
and a third generation unit configured to extract parallel words from objects in the target triplet, take a subject in the target triplet as a parent knowledge point, and take words in the extracted parallel words as child knowledge points, and generate parent-child relationship information for indicating a parent-child relationship between the parent knowledge point and the child knowledge point.
11. The apparatus of claim 10, wherein the first generation unit is further configured to:
for a sentence in the at least one sentence, determining whether the sentence includes keywords in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; if the sentence includes characters in the character set, the sentence is extracted.
12. The apparatus of claim 10, wherein the first preset condition further comprises: the length of the sentence is not more than the preset word number; and
the first generation unit is further configured to:
for a sentence in the at least one sentence, determining whether the sentence includes keywords in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is greater than the preset word number; and if the length of the sentence is not greater than the preset word number, extracting the sentence.
13. The apparatus of claim 10, wherein the second generation unit comprises:
and the extraction subunit is configured to take the sentence in the sentence set as a sentence to be processed, perform syntactic analysis and semantic role analysis on the sentence to be processed to obtain an analysis result, and extract subjects, predicates and objects from the sentence to be processed based on the analysis result.
14. The apparatus of claim 13, wherein the analysis result includes first labeling information for indicating a core verb in the sentence to be processed and second labeling information for indicating a part of the core verb that is to be executed; and
the extraction subunit is further configured to:
determining whether the analysis result further comprises third labeling information for indicating the incident part of the core verb;
and if the third labeling information is included, sequentially determining the executive part indicated by the second labeling information, the core verb indicated by the first labeling information and the executive part indicated by the third labeling information as subjects, predicates and objects in the sentence to be processed, and extracting the determined subjects, predicates and objects from the sentence to be processed.
15. The apparatus of claim 14, wherein the analysis result further comprises at least one fourth labeling information for indicating a motor guest relationship between a core verb and a word other than the core verb in the sentence to be processed; and
the extraction subunit is further configured to:
in response to determining that the analysis result does not include the third annotation information, determining target fourth annotation information meeting a second preset condition in the at least one fourth annotation information, extracting phrases from the sentence to be processed as objects based on the target fourth annotation information, sequentially taking the part of the event indicated by the second annotation information and the core verb indicated by the first annotation information as subjects and predicates in the sentence to be processed, and extracting the determined subjects, predicates and objects from the sentence to be processed.
16. The apparatus of claim 10, wherein the pick unit comprises:
an acquisition subunit configured to acquire a target classification model, wherein the target classification model is a trained classification model for predicting whether a relationship between subjects, predicates, and objects in a triplet is correct;
A generation subunit configured to select, based on the target classification model, a triplet from the triplet set in which a relationship between the included subject, predicate, and object is correct, constituting a first triplet set;
a selecting subunit configured to select a triplet from the first set of triples as a target triplet.
17. The apparatus of claim 16, wherein the acquisition subunit is further configured to:
acquiring labeling information of at least one triplet in the triplet set, wherein the labeling information is used for indicating whether the relation among subjects, predicates and objects in the corresponding triplet is correct;
extracting features of the triples in the at least one triplet to obtain feature information, and inputting the feature information of the triples into an initial model to obtain a prediction result corresponding to the triples, wherein the prediction result is used for indicating whether the relation among subjects, predicates and objects in the triples is correct; comparing the prediction result with the labeling information of the triplet, and determining whether the initial model reaches a preset optimization target according to the comparison result; and in response to determining that the initial model reaches the optimization target, taking the initial model as a target classification model.
18. The apparatus of claim 16, wherein the pick subunit is further configured to:
performing preset disambiguation operation on triples in the first triplet set;
and taking the triples in the first triplet set after the disambiguation operation as target triples.
19. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-9.
20. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-9.
CN201810791223.2A 2018-07-18 2018-07-18 Method and device for generating information Active CN110807311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810791223.2A CN110807311B (en) 2018-07-18 2018-07-18 Method and device for generating information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810791223.2A CN110807311B (en) 2018-07-18 2018-07-18 Method and device for generating information

Publications (2)

Publication Number Publication Date
CN110807311A CN110807311A (en) 2020-02-18
CN110807311B true CN110807311B (en) 2023-06-23

Family

ID=69486556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810791223.2A Active CN110807311B (en) 2018-07-18 2018-07-18 Method and device for generating information

Country Status (1)

Country Link
CN (1) CN110807311B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444349B (en) * 2020-03-06 2023-09-12 深圳追一科技有限公司 Information extraction method, information extraction device, computer equipment and storage medium
CN111709248B (en) * 2020-05-28 2023-07-11 北京百度网讯科技有限公司 Training method and device for text generation model and electronic equipment
CN111859858B (en) * 2020-07-22 2024-03-01 智者四海(北京)技术有限公司 Method and device for extracting relation from text
CN111858894A (en) * 2020-07-29 2020-10-30 网易(杭州)网络有限公司 Semantic missing recognition method and device, electronic equipment and storage medium
CN112528641A (en) * 2020-12-10 2021-03-19 北京百度网讯科技有限公司 Method and device for establishing information extraction model, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573980A (en) * 2015-12-10 2016-05-11 百度在线网络技术(北京)有限公司 Information segment generation method and device
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN107632979A (en) * 2017-10-13 2018-01-26 华中科技大学 The problem of one kind is used for interactive question and answer analytic method and system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7403890B2 (en) * 2002-05-13 2008-07-22 Roushar Joseph C Multi-dimensional method and apparatus for automated language interpretation
WO2007121614A1 (en) * 2006-04-26 2007-11-01 Wenhe Xu An automated translation method for translating a lagnuage into multiple languages
CN101595474B (en) * 2007-01-04 2012-07-11 思解私人有限公司 Linguistic analysis
US9400778B2 (en) * 2011-02-01 2016-07-26 Accenture Global Services Limited System for identifying textual relationships
CN103440252B (en) * 2013-07-25 2016-11-16 北京师范大学 Information extracting method arranged side by side and device in a kind of Chinese sentence
CN106844368B (en) * 2015-12-03 2020-06-16 华为技术有限公司 Method for man-machine conversation, neural network system and user equipment
CN107451164B (en) * 2016-06-01 2020-05-19 华为技术有限公司 Semantic query method and device
CN106776535A (en) * 2016-11-16 2017-05-31 金陵科技学院 Scientific and technical literature fine granularity relation excavation method based on two-stage syntax parsing
CN107798136B (en) * 2017-11-23 2020-12-01 北京百度网讯科技有限公司 Entity relation extraction method and device based on deep learning and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573980A (en) * 2015-12-10 2016-05-11 百度在线网络技术(北京)有限公司 Information segment generation method and device
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN107632979A (en) * 2017-10-13 2018-01-26 华中科技大学 The problem of one kind is used for interactive question and answer analytic method and system

Also Published As

Publication number Publication date
CN110807311A (en) 2020-02-18

Similar Documents

Publication Publication Date Title
CN107491547B (en) Search method and device based on artificial intelligence
US11023505B2 (en) Method and apparatus for pushing information
CN110807311B (en) Method and device for generating information
CN107066449B (en) Information pushing method and device
CN107491534B (en) Information processing method and device
CN107679039B (en) Method and device for determining statement intention
CN107256267B (en) Query method and device
CN107241260B (en) News pushing method and device based on artificial intelligence
US9715531B2 (en) Weighting search criteria based on similarities to an ingested corpus in a question and answer (QA) system
US20160171373A1 (en) Training a Question/Answer System Using Answer Keys Based on Forum Content
US9542496B2 (en) Effective ingesting data used for answering questions in a question and answer (QA) system
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
CN109635094B (en) Method and device for generating answer
CN110569494B (en) Method and device for generating information, electronic equipment and readable medium
US11651015B2 (en) Method and apparatus for presenting information
CN109190123B (en) Method and apparatus for outputting information
CN110737824B (en) Content query method and device
CN110738056B (en) Method and device for generating information
CN111104796B (en) Method and device for translation
CN109376220B (en) Method and device for acquiring information
US11443106B2 (en) Intelligent normalization and de-normalization of tables for multiple processing scenarios
CN114691850A (en) Method for generating question-answer pairs, training method and device of neural network model
CN111753548A (en) Information acquisition method and device, computer storage medium and electronic equipment
CN113761183A (en) Intention recognition method and intention recognition device
JP2020035427A (en) Method and apparatus for updating information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant