CN110807311A - Method and apparatus for generating information - Google Patents

Method and apparatus for generating information Download PDF

Info

Publication number
CN110807311A
CN110807311A CN201810791223.2A CN201810791223A CN110807311A CN 110807311 A CN110807311 A CN 110807311A CN 201810791223 A CN201810791223 A CN 201810791223A CN 110807311 A CN110807311 A CN 110807311A
Authority
CN
China
Prior art keywords
sentence
triple
information
subject
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810791223.2A
Other languages
Chinese (zh)
Other versions
CN110807311B (en
Inventor
沈之锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810791223.2A priority Critical patent/CN110807311B/en
Publication of CN110807311A publication Critical patent/CN110807311A/en
Application granted granted Critical
Publication of CN110807311B publication Critical patent/CN110807311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a device for generating information. One embodiment of the method comprises: acquiring text information to be processed, wherein the text information to be processed comprises at least one sentence; extracting sentences meeting a first preset condition from at least one sentence to form a sentence set; for sentences in the sentence set, extracting a subject, a predicate and an object from the sentences to form a triple, wherein the object has parallel words; combining the formed triples into a triple set, and selecting the triples from the triple set as target triples; extracting parallel words from the object in the target triple, taking the subject in the target triple as a father knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating parent-child relationship information for indicating the parent-child relationship between the father knowledge point and the child knowledge points. The embodiment realizes the mining of the parent-child relationship between the knowledge points.

Description

Method and apparatus for generating information
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating information.
Background
Whether in the process of user learning or the process of knowledge graph construction, etc., it is a very frequent requirement to obtain the father node or the child node of a certain knowledge point. The method can help the user to know the knowledge of the larger field associated with the knowledge point, and can also make the user know the knowledge point and further decompose the knowledge point into smaller knowledge points. Therefore, the method for mining the parent-child relationship among the knowledge points has important significance and effect.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating information.
In a first aspect, an embodiment of the present application provides a method for generating information, where the method includes: acquiring text information to be processed, wherein the text information to be processed comprises at least one sentence; extracting sentences meeting a first preset condition from at least one sentence to form a sentence set; for sentences in the sentence set, extracting a subject, a predicate and an object from the sentences to form a triple, wherein the object has parallel words; combining the formed triples into a triple set, and selecting the triples from the triple set as target triples; extracting parallel words from the object in the target triple, taking the subject in the target triple as a father knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating parent-child relationship information for indicating the parent-child relationship between the father knowledge point and the child knowledge points.
In some embodiments, the first preset condition includes: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs for setting out subjects in the sentence or adverbs for modifying the verbs; the characters in the character set are conjunctions or punctuation marks for connecting words in a sentence in a side-by-side relationship.
In some embodiments, extracting a sentence satisfying a first preset condition from the at least one sentence includes: for a sentence in the at least one sentence, determining whether the sentence comprises a keyword in the keyword set; if yes, further determining whether the sentence comprises characters in the character set; and if the sentence comprises the characters in the character set, extracting the sentence.
In some embodiments, the first preset condition further comprises: the length of the sentence is not more than the preset number of words; and extracting a sentence satisfying a first preset condition from the at least one sentence, including: for a sentence in the at least one sentence, determining whether the sentence comprises a keyword in the keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is larger than the preset word number; and if the length of the sentence is not more than the preset word number, extracting the sentence.
In some embodiments, for a sentence in the set of sentences, extracting the subject, predicate, and object from the sentence comprises: and regarding the sentences in the sentence set as the sentences to be processed, performing syntactic analysis and semantic role analysis on the sentences to be processed to obtain analysis results, and extracting subjects, predicates and objects from the sentences to be processed based on the analysis results.
In some embodiments, the analysis result includes first annotation information indicating the core verb in the sentence to be processed and second annotation information indicating the actor part of the core verb; and extracting a subject, a predicate and an object from the sentence to be processed based on the analysis result, including: determining whether the analysis result further includes third annotation information indicating a subject part of the core verb; and if the third annotation information is included, sequentially determining the object-of-use part indicated by the second annotation information, the core verb indicated by the first annotation information and the object-of-use part indicated by the third annotation information as the subject, the predicate and the object in the sentence to be processed, and extracting the determined subject, the predicate and the object from the sentence to be processed.
In some embodiments, the analysis result further includes at least one fourth label information, the fourth label information is used for indicating the verb-guest relationship between the core verb and the word except the core word in the sentence to be processed; and extracting a subject, a predicate and an object from the sentence to be processed based on the analysis result, further comprising: in response to determining that the analysis result does not include the third annotation information, target fourth annotation information meeting a second preset condition is determined in the at least one fourth annotation information, a phrase is extracted from the sentence to be processed as an object based on the target fourth annotation information, a subject part indicated by the second annotation information and a core verb indicated by the first annotation information are sequentially used as a subject and a predicate in the sentence to be processed, and the determined subject, the predicate and the object are extracted from the sentence to be processed.
In some embodiments, selecting a triple from the triple set as the target triple includes: obtaining a target classification model, wherein the target classification model is a trained classification model used for predicting whether the relation among subjects, predicates and objects in the triples is correct or not; based on the target classification model, selecting a triple with correct relationship among the subject, the predicate and the object from the triple set to form a first triple set; and selecting a triple from the first triple set as a target triple.
In some embodiments, obtaining the object classification model comprises: obtaining marking information of at least one triple in the triple set, wherein the marking information is used for indicating whether the relation among a subject, a predicate and an object in the corresponding triple is correct or not; extracting characteristics of triples in at least one triple to obtain characteristic information, inputting the characteristic information of the triples into an initial model to obtain a prediction result corresponding to the triples, wherein the prediction result is used for indicating whether the relation among subjects, predicates and objects in the triples is correct or not; comparing the prediction result with the labeling information of the triple, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization goal, the initial model is taken as a goal classification model.
In some embodiments, selecting a triple from the first triple set as the target triple includes: performing a preset disambiguation operation on triples in the first triplet set; and taking the triples in the first triple set after the disambiguation operation as target triples.
In a second aspect, an embodiment of the present application provides an apparatus for generating information, where the apparatus includes: an acquisition unit configured to acquire text information to be processed, wherein the text information to be processed includes at least one sentence; a first generation unit configured to extract sentences satisfying a first preset condition from at least one sentence to form a sentence set; the second generation unit is configured to extract a subject, a predicate and an object from a sentence in the sentence set to form a triple, wherein the object has a parallel word; the selecting unit is configured to combine the formed triples into a triple set, and select the triples from the triple set as target triples; and a third generation unit configured to extract a parallel word from the object in the target triple, use the subject in the target triple as a parent knowledge point, use a word in the extracted parallel word as a child knowledge point, and generate parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge point.
In some embodiments, the first preset condition includes: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs for setting out subjects in the sentence or adverbs for modifying the verbs; the characters in the character set are conjunctions or punctuation marks for connecting words in a sentence in a side-by-side relationship.
In some embodiments, the first generating unit is further configured to: for a sentence in the at least one sentence, determining whether the sentence comprises a keyword in the keyword set; if yes, further determining whether the sentence comprises characters in the character set; and if the sentence comprises the characters in the character set, extracting the sentence.
In some embodiments, the first preset condition further comprises: the length of the sentence is not more than the preset number of words; and the first generating unit is further configured to: for a sentence in the at least one sentence, determining whether the sentence comprises a keyword in the keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is larger than the preset word number; and if the length of the sentence is not more than the preset word number, extracting the sentence.
In some embodiments, the second generating unit comprises: and the extraction subunit is configured to, regarding the sentences in the sentence set as sentences to be processed, perform syntactic analysis and semantic role analysis on the sentences to be processed to obtain analysis results, and extract subjects, predicates and objects from the sentences to be processed based on the analysis results.
In some embodiments, the analysis result includes first annotation information indicating the core verb in the sentence to be processed and second annotation information indicating the actor part of the core verb; and the extraction subunit is further configured to: determining whether the analysis result further includes third annotation information indicating a subject part of the core verb; and if the third annotation information is included, sequentially determining the object-of-use part indicated by the second annotation information, the core verb indicated by the first annotation information and the object-of-use part indicated by the third annotation information as the subject, the predicate and the object in the sentence to be processed, and extracting the determined subject, the predicate and the object from the sentence to be processed.
In some embodiments, the analysis result further includes at least one fourth label information, the fourth label information is used for indicating the verb-guest relationship between the core verb and the word except the core word in the sentence to be processed; and the extraction subunit is further configured to: in response to determining that the analysis result does not include the third annotation information, target fourth annotation information meeting a second preset condition is determined in the at least one fourth annotation information, a phrase is extracted from the sentence to be processed as an object based on the target fourth annotation information, a subject part indicated by the second annotation information and a core verb indicated by the first annotation information are sequentially used as a subject and a predicate in the sentence to be processed, and the determined subject, the predicate and the object are extracted from the sentence to be processed.
In some embodiments, the selecting unit includes: an obtaining subunit configured to obtain a target classification model, wherein the target classification model is a trained classification model for predicting whether a relationship between a subject, a predicate, and an object in a triplet is correct; the generating subunit is configured to select a triple with a correct relationship among the included subject, predicate and object from the triple set based on the target classification model, and form a first triple set; a selecting subunit configured to select a triple from the first triple set as a target triple.
In some embodiments, the obtaining subunit is further configured to: obtaining marking information of at least one triple in the triple set, wherein the marking information is used for indicating whether the relation among a subject, a predicate and an object in the corresponding triple is correct or not; extracting characteristics of triples in at least one triple to obtain characteristic information, inputting the characteristic information of the triples into an initial model to obtain a prediction result corresponding to the triples, wherein the prediction result is used for indicating whether the relation among subjects, predicates and objects in the triples is correct or not; comparing the prediction result with the labeling information of the triple, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization goal, the initial model is taken as a goal classification model.
In some embodiments, the selecting subunit is further configured to: performing a preset disambiguation operation on triples in the first triplet set; and taking the triples in the first triple set after the disambiguation operation as target triples.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when executed by the one or more processors, cause the one or more processors to implement a method as described in any implementation of the first aspect.
In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method described in any implementation manner of the first aspect.
According to the method and the device for generating information, the text information to be processed including at least one sentence is obtained, the sentences meeting the first preset condition are extracted from the at least one sentence, and the sentence set is formed, so that the sentences in the sentence set are used as corpus content for mining parent-child relations among knowledge points. And then for the sentences in the sentence set, extracting subjects, predicates and objects from the sentences to form triples, combining the triples into a triple set, selecting the triples from the triple set as target triples so as to extract parallel words from the objects in the target triples, taking the subjects in the target triples as father knowledge points, taking the words in the extracted parallel words as child knowledge points, and generating parent-child relationship information for indicating the parent-child relationship between the father knowledge points and the child knowledge points. Therefore, generation of triples including the subject, the predicate and the object and extraction of parallel words included by the object in the target triples are effectively utilized, and mining of parent-child relationships among knowledge points is achieved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for generating information according to the present application;
FIG. 3 is a schematic illustration of an application scenario of a method for generating information according to the present application;
FIG. 4 is a flow diagram of yet another embodiment of a method for generating information according to the present application;
FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating information according to the present application;
FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for generating information or the apparatus for generating information of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include information generating terminals 101, 102, 103, a network 104, and an information storage terminal 105. The network 104 is a medium used to provide communication links between the information generating terminals 101, 102, 103 and the information storage terminal 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The information generating terminals 101, 102, 103 may interact with the information storage terminal 105 through the network 104 to receive or transmit messages and the like. For example, the information generating terminals 101, 102, and 103 may obtain the text information to be processed from the information storage terminal 105, and then perform processing such as analysis on the text information to be processed, to obtain a processing result (e.g., generated parent-child relationship information indicating a parent-child relationship between a parent knowledge point and a child knowledge point).
The information generating terminals 101, 102, and 103 may be terminal devices or servers. When the information generating terminals 101, 102, 103 are terminal devices, various communication client applications, such as a web browser application, an application for mining parent-child relationships between knowledge points, and the like, may be installed on the terminal devices.
The information storage terminal 105 may be a server that provides various services, such as a server for storing text information for processing by the information generation terminals 101, 102, 103.
It should be noted that the method for generating information provided in the embodiment of the present application is generally executed by the information generating terminals 101, 102, 103, and accordingly, the apparatus for generating information is generally disposed in the information generating terminals 101, 102, 103.
It should be noted that the terminal device may be hardware or software. When the terminal device is hardware, it may be various electronic devices including, but not limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When the terminal device is software, the terminal device can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
In practice, if the information generating terminals 101, 102, 103 store the required text information to be processed in advance, the system architecture 100 may not include the information storage terminal 105.
It should be understood that the number of information generating sides, networks, and information storing sides in fig. 1 is merely illustrative. There may be any number of information generating terminals, networks, and information storage terminals, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating information in accordance with the present application is shown. The process 200 of the method for generating information comprises the following steps:
step 201, text information to be processed is obtained.
In this embodiment, the executing entity (for example, the information generating end 101, 102, 103 shown in fig. 1) of the method for generating information may acquire the text information to be processed from the connected server (for example, the information storing end 105 shown in fig. 1) in real time, or may acquire the text information to be processed locally. Wherein the text information to be processed may comprise at least one sentence.
As an example, the execution subject described above may acquire the to-be-processed text information indicated by the information generation request in response to receiving the information generation request for the to-be-processed text information. The sender of the information generation request may be a user side or a server side, and this embodiment does not limit this aspect at all.
Step 202, sentences meeting a first preset condition are extracted from at least one sentence to form a sentence set.
In this embodiment, the execution subject may extract a sentence satisfying a first preset condition from the at least one sentence, and form a sentence set. The first preset condition may include, for example: the sentence comprises keywords in a preset keyword set and characters in a preset character set. Here, the keywords in the keyword set may be verbs for stating subjects in sentences. The characters in the character set may be conjunctions or punctuation marks for connecting words in a sentence in a side-by-side relationship.
As an example, for a sentence in the at least one sentence, the executing entity may first determine whether the sentence includes a keyword in the keyword set. Then, the executing body may further determine whether the sentence includes characters in the character set in response to determining that the sentence includes the keywords in the keyword set. Finally, the executing body may determine that the sentence is a sentence satisfying a first preset condition in response to determining that the sentence includes the characters in the character set, and the executing body may extract the sentence.
By using the first preset condition, it is possible to ensure that the extracted sentence is a sentence including three sentence components of the subject, the predicate, and the object, and the parallel word. The extracted sentences are used as the corpus content for the subsequent mining of the parent-child relationship between the knowledge points, and the mining efficiency can be improved. It should be understood that juxtaposed words may include at least two words in a juxtaposed relationship. In the sentence "threads can be generally classified into three main categories, i.e., fastening threads, transmission threads and pipe threads according to the purpose", there is a juxtaposition between the "fastening threads", "transmission threads" and "pipe threads", and they belong to the juxtaposition.
In some optional implementations of the present embodiment, the keywords in the keyword set may be verbs for setting out the subject in the sentence or adverbs for modifying the verbs. It should be noted that the keyword set may be obtained by performing word expansion on an initial keyword set manually. The initial keywords may include, for example: is divided into, including, main, containing, yes and having. The execution end (e.g., the execution main body or a server to which the execution main body is connected) for generating the keyword set may input each keyword in the initial keywords into a preset correlation model (e.g., Word2vec) for generating a Word vector, so as to obtain a plurality of words related to the keyword. The execution end may extract a preset number (e.g., 30) of words from the obtained words as the extended keywords of the keyword. The execution end may merge the expanded keywords corresponding to the initial keyword and the initial keyword into a keyword set.
In some optional implementation manners of this embodiment, since the number of the keywords in the keyword set is large, in order to improve matching efficiency, the executing entity may use a multi-mode matching algorithm (for example, an AC automaton, which is called Aho-corpasic automation in english), and match the keywords in the keyword set with the words included in the sentences in the at least one sentence to determine whether the sentence includes the keywords in the keyword set.
Step 203, for the sentences in the sentence set, extracting the subject, the predicate and the object from the sentences to form a triple.
In this embodiment, for a sentence in the sentence set, the executing entity may extract a subject, a predicate, and an object from the sentence to form a triple.
As an example, the execution subject may perform a syntactic analysis (e.g., a dependency syntactic analysis) on the sentence, resulting in an analysis result. The dependency parsing can be used to identify grammatical components such as main, predicate, object, predicate, shape, complement, etc. in the sentence, and analyze the relationship between the components. The analysis results may be used to indicate a relationship type to which a relationship between different words in the sentence belongs. The relationship type may include a predicate relationship, an actor-guest relationship, and the like. Then, the executing agent may find two words belonging to the dominance relationship in the sentence based on the analysis result, and use the word on the left side of the two words as the subject and the word on the right side as the predicate. The executing agent may then find a target word in the sentence having a verb relationship with the predicate based on the analysis result. If the number of target words is 1, the executing entity may extract a phrase having a word adjacent to the predicate and on the right side of the predicate as a start word and the target word as an end word from the sentence, and use the phrase as an object; if the number of target words is greater than 1, the executing agent may select a target word having the largest number of characters spaced apart from the predicate from among the target words, extract a phrase having a word adjacent to the predicate and on the right of the predicate as a start word and the selected target word as an end word from the sentence, and set the phrase as an object. Finally, the executing body can extract the determined subject, predicate and object from the sentence to form a triple.
Taking the sentence "the working part of the function gauge has the verifying part, the positioning part and the guiding part" as an example, if the subject "the working part of the function gauge", the predicate "having" and the object "the verifying part, the positioning part and the guiding part" are extracted from the sentence, the above-mentioned execution body may constitute the following triplet < the working part of the function gauge, having the verifying part, the positioning part and the guiding part >.
And step 204, combining the formed triples into a triple set, and selecting the triples from the triple set as target triples.
In this embodiment, the execution agent may combine the triples composed in step 203 into a triple set. The execution agent may select a triple from the triple set as a target triple. As an example, the execution subject may choose each triple in the triple set as a target triple.
In some alternative implementations of the present embodiment, the relationship between the subject, predicate and object in some triples may be incorrect. For example, the subject or object is too long, and the like, and is not actually suitable as the subject or object of a sentence. Thus, the executive may analyze the triples in the set of triples to determine whether the relationship between the subject, predicate, and object in the triples is correct. The executing entity may select a triplet including a correct relationship among the subject, the predicate, and the object from the triplet set, and form a first triplet set. The execution agent may select a triple from the first triple set as a target triple. For example, each triple in the first triple set is selected as the target triple.
Here, the executing entity may obtain a target classification model, and predict whether or not a relationship among a subject, a predicate, and an object included in a triple in the triple set is correct by using the target classification model. The target classification model may be a trained classification model for predicting whether the relationship between the subject, predicate and object in the triplet is correct. The target classification model may be obtained by training an initial model. The initial Model may be, for example, a Naive Bayesian Model (NBM) or a Support Vector Machine (SVM) or the like, which is not trained or is not trained. It should be noted that, if the executing entity does not execute the process 200 for the first time, the executing entity may obtain a pre-stored target classification model because the executing entity generally stores the target classification model locally or in a server to which the executing entity is connected.
Step 205, extracting parallel words from the objects in the target triple, taking the subject in the target triple as a parent knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating parent-child relationship information for indicating the parent-child relationship between the parent knowledge point and the child knowledge points.
In this embodiment, for each target triple, the execution subject may extract a parallel word from an object in the target triple, use a subject in the target triple as a parent knowledge point, use a word in the extracted parallel word as a child knowledge point, and generate parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge point.
Taking the target triplet < working part of function gauge, including, for example, checking part, positioning part and guiding part >, the executing body may segment the object in the triplet according to the characters included in the object in the character set. Here, the object includes characters "," and "existing in the above character set. The execution body described above may divide the object into, for example, "verification part |, | positioning part |, and | guide part", where "|" may represent a delimiter. The execution body described above may take the words other than the characters ", and" divided out as parallel words, that is, "the inspection portion", "the positioning portion", and "the guide portion" as parallel words. The execution main body described above may extract the juxtaposed words "check portion", "positioning portion", and "guide portion" from the object. The executing body may use the main phrase "working part of function gauge" as a parent knowledge point, and "verifying part", "positioning part", and "guiding part" as child knowledge points, respectively. The execution subject may generate, for example, the following three pieces of parent-child relationship information:
the subject: working part of the function gauge, predicate: the method comprises the following steps: a checking part;
the subject: working part of the function gauge, predicate: the method comprises the following steps: a positioning portion;
the subject: working part of the function gauge, predicate: the method comprises the following steps: a guide portion.
In some optional implementations of this embodiment, the execution principal may store the generated parent-child relationship information in a pre-specified database, so as to form a corresponding knowledge graph. Thus, the knowledge graph can be used for scenes such as knowledge recommendation. Taking a knowledge recommendation scenario as an example, when a user learns a knowledge point, the knowledge map can be used to acquire a parent knowledge point and/or a child knowledge point of the knowledge point. The acquired knowledge points are pushed to the user, so that the user can learn conveniently, and the knowledge plane of the user can be expanded.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating information according to the present embodiment. In the application scenario of fig. 3, the server 301 may obtain the text information to be processed from the server 302 in real time. The obtained text information to be processed may include a sentence A, B, C. The content of sentence a may be "interchangeability includes multiple classifications". The contents of sentence B may be "interchangeability may be classified into two categories of full interchangeability and incomplete interchangeability". The content of sentence C may be "the working portion of the function gauge has a checking portion, a positioning portion, and a guiding portion".
Then, the server 301 may determine whether the sentences A, B, C satisfy the first preset condition respectively. Wherein the first preset condition may include: the sentence comprises keywords in a preset keyword set and characters in a preset character set. The server 301 may extract the sentence B, C and compose the sentence B, C into the sentence set 303 in response to determining that the sentences B, C all satisfy the first preset condition.
Next, the server 301 may extract the subject "interchangeability", the predicate "classified" and the object "complete interchangeability and incomplete interchangeability", which constitute a triple < interchangeability, classified as complete interchangeability and incomplete interchangeability >, as indicated by reference numeral 304, from the sentence B. The server 301 may also extract from the sentence C the subject "working part of function gauge", the predicate "having" and the object "checking part, positioning part and guiding part", which constitute the triplet < working part of function gauge, having, checking part, positioning part and guiding part >, as indicated by reference numeral 305.
The server 301 may then merge the triples indicated by reference numeral 304 and reference numeral 305, respectively, into a triple set 306. The server 301 may choose each triplet in the triplet set 306 as the target triplet.
Finally, for the target triple indicated by reference numeral 304, the server 301 may extract the juxtaposed words "complete interchangeability" and "incomplete interchangeability" from the object of the target triple. The server 301 may use the subject in the target triple as a parent knowledge point, and use the parallel words "complete interchangeability" and "incomplete interchangeability" as child knowledge points, respectively. Server 301 may generate the following parent-child relationship information (as shown at reference numeral 307):
the subject: interchangeability, predicate: the method comprises the following steps: complete interchangeability;
the subject: interchangeability, predicate: the method comprises the following steps: incomplete interchangeability.
For a target triple indicated by reference numeral 305, the server 301 may extract the juxtaposed words "check part", "positioning part", and "guide part" from the object of the target triple. The execution subject may use the subject in the target triple as a parent knowledge point, and use the parallel words "inspection part", "positioning part", and "guiding part" as child knowledge points, respectively. Server 301 may generate the following parent-child relationship information (as indicated by reference numeral 308):
the subject: working part of the function gauge, predicate: there, object: a checking part;
the subject: working part of the function gauge, predicate: there, object: a positioning portion;
the subject: working part of the function gauge, predicate: there, object: a guide portion.
The method provided by the above embodiment of the application effectively utilizes generation of triples including subjects, predicates and objects and extraction of parallel words included in the objects in the target triples, and realizes mining of parent-child relationships among knowledge points.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating information is shown. The flow 400 of the method for generating information comprises the steps of:
step 401, obtaining text information to be processed.
In this embodiment, the executing entity (for example, the information generating end 101, 102, 103 shown in fig. 1) of the method for generating information may acquire the text information to be processed from a connected server (for example, the information storing end 105 shown in fig. 1) or may acquire the text information to be processed locally. Wherein the text information to be processed may comprise at least one sentence.
Step 402, extracting sentences meeting the following first preset conditions from at least one sentence: the sentences comprise keywords in a preset keyword set and characters in a preset character set, the length of the sentences is not more than the preset word number, and the extracted sentences form a sentence set.
In this embodiment, the executing entity may extract a sentence satisfying the following first preset condition from the at least one sentence: the sentence comprises the keywords in the preset keyword set and the characters in the preset character set, and the length of the sentence is not more than the preset word number. Then, the execution body may compose the extracted sentences into a sentence set.
Wherein the keywords in the keyword set may be verbs for setting out the subject in the sentence or adverbs for modifying the verbs. The characters in the character set may be conjunctions or punctuation marks for connecting words in a sentence in a side-by-side relationship.
As an example, for a sentence in the at least one sentence, the executing body may determine whether the sentence includes a keyword in the keyword set. If so, the executing entity may further determine whether the sentence includes characters in the character set. If the sentence includes the characters in the character set, the executing entity may further determine whether the length of the sentence is greater than the preset number of words. If the length of the sentence is not greater than the predetermined number of words, the executing entity may extract the sentence.
It should be noted that, because the number of the keywords in the keyword set is large, in order to improve the matching efficiency, the executing entity may use a multi-mode matching algorithm (e.g., AC automaton) to match the keywords in the keyword set with the words included in the sentences in the at least one sentence, so as to determine whether the sentence includes the keywords in the keyword set.
And 403, regarding the sentences in the sentence set, taking the sentences as sentences to be processed, performing syntactic analysis and semantic role analysis on the sentences to be processed to obtain analysis results, and extracting subjects, predicates and objects from the sentences to be processed based on the analysis results to form triples.
In this embodiment, regarding the sentence in the sentence set, the sentence is regarded as a to-be-processed sentence, and the executing body may perform syntactic analysis and semantic character analysis on the to-be-processed sentence to obtain an analysis result. The executing agent may extract a subject, a predicate, and an object from the sentence to be processed based on the analysis result, and form a triple.
It should be noted that the syntax analysis may include, for example, dependency syntax analysis. Dependency parsing can be used to identify grammatical components such as major, predicate, minor, definite, shape, complement, etc. in a sentence and analyze the relationships between the components. In general, arguments can be divided into several types based on different semantic relationships between predicates and arguments. This type of argument is generally referred to as a semantic role. The argument may be called a question element, an item, etc., and is a semantic component having a direct relationship with the predicate and being governed by the predicate. Semantic roles may include execute, commit, and the like. Wherein, an action generally refers to the body of an action, i.e. a person or thing that gives an action or changes. A story generally refers to an object of action, i.e., a person or thing subject to action. Semantic role analysis can be used to analyze the types of arguments in sentences and perform semantic role labeling on the arguments.
In this embodiment, the analysis result may include first annotation information indicating a core verb in the sentence to be processed, and second annotation information indicating a actor part of the core verb. As an example, the first annotation information can be represented by, for example, "HED" or the like, which can be used to represent a core relationship, which can point to the core of the entire sentence (which can be the core verb of the predicate in the sentence). The second label information may be represented by "a 0" or the like, for example. The execution subject may search the analysis result for third annotation information indicating the subject part of the core verb. The third label information may be represented by "a 1", for example. If the sentence to be processed is found, the executing agent may use the actor part indicated by the second annotation information, the core verb indicated by the first annotation information, and the victim part indicated by the third annotation information as the subject, the predicate, and the object in the sentence to be processed in sequence. The executing agent may extract the determined subject, predicate and object from the sentence to be processed to form a triple.
In some optional implementation manners of this embodiment, the analysis result may further include at least one fourth label information while including the first label information and the second label information. The fourth labeling information may be used to indicate a verb-guest relationship between the core verb and a word other than the core word in the sentence to be processed. The fourth label information can be represented by, for example, "VOB" or the like. If the analysis result does not include the third label information, the executing entity may perform the following extraction operation:
first, the execution body may determine target fourth markup information satisfying a second preset condition among the at least one fourth markup information. As an example, the second preset condition may include, for example: the number of characters between the two words indicated is the largest. For each of the at least one fourth label information, the executing entity may count the number of characters between two words indicated by the fourth label information. Then, the executing entity may compare the counted numbers, and determine the fourth labeled information corresponding to the maximum number of the counted numbers as the target fourth labeled information.
Then, the executing entity may extract a phrase from the sentence to be processed as an object based on the target fourth markup information. For example, the above-mentioned sentence to be processed is "interchangeability can be divided into two types of complete interchangeability and incomplete interchangeability", and the two words indicated by the target fourth annotation information include "division" and "class", where the "division" is a core verb and there is a guest-moving relationship between the "division" and the "class". The execution subject may extract a phrase formed by characters and "class" between the two words, i.e., "both full interchangeability and incomplete interchangeability", from the sentence to be processed, and set the phrase as an object.
Finally, the executing agent may use the actor part indicated by the second markup information and the core verb indicated by the first markup information as a subject and a predicate in the sentence to be processed in sequence. The executing agent may extract the determined subject, predicate and object from the sentence to be processed to form a triple.
And step 404, combining the formed triples into a triple set, and acquiring the labeling information of at least one triple in the triple set.
In this embodiment, the execution agent may combine the triples composed in step 403 into a triple set. The execution subject may obtain annotation information of at least one triple in the triple set. And the annotation information can be used for indicating whether the relationship among the subject, the predicate and the object in the corresponding triple is correct or not.
Note that the annotation information may be manually set. The execution main body can output the triple set and output corresponding prompt information to prompt related personnel to set marking information for at least one triple in the triple set. The execution subject may receive annotation information of at least one triplet sent by the relevant person through the electronic device. The label information may be represented by 0 or 1, for example. 0 may represent that the relationship between the subject, predicate, and object in the triplet is erroneous. A1 may represent that the relationship between the subject, predicate, and object in the triplet is correct.
Step 405, extracting the characteristics of the triples in at least one triplet to obtain characteristic information, inputting the characteristic information of the triples into an initial model to obtain a prediction result corresponding to the triples; comparing the prediction result with the labeling information of the triple, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization goal, the initial model is taken as a goal classification model.
In this embodiment, after obtaining the labeling information of at least one triplet, the executing entity may use the triplet as a to-be-processed triplet in the at least one triplet, and perform the following model training operation:
firstly, the execution main body can perform feature extraction on the triple to be processed to obtain feature information. Here, the execution subject may determine whether the subject in the to-be-processed triple starts with a centering structure, and obtain a determination result. The execution subject can also count the length of the subject in the triplet to be processed, the number of nouns in the subject, the length of the object and the number of pause numbers in the object. The executing entity may use the counted information, the obtained determination result, the predicate in the to-be-processed triple, and the like as the feature information of the to-be-processed triple. It should be noted that the analysis result may further include fifth label information for indicating the centering relationship between words. The fifth label information may be represented by "ATT" or the like, for example. The execution subject may determine whether the subject in the to-be-processed triple starts with a centering structure based on fifth label information included in the analysis result associated with the to-be-processed triple.
Then, the execution subject may input the extracted feature information into the initial model to obtain a prediction result. The prediction result may be used to indicate whether the relationship between the subject, the predicate, and the object in the to-be-processed triple is correct. The initial model may be, for example, an untrained or untrained naive bayes model or a support vector machine, or the like.
Then, the execution subject may compare the obtained prediction result with the labeling information of the triplet to be processed, and determine whether the initial model reaches a preset optimization target according to the comparison result. The optimization target may mean that the accuracy of the prediction result output by the initial model is greater than a preset accuracy threshold.
Finally, the executing agent may treat the initial model as a target classification model in response to determining that the initial model meets the optimization goal.
In some optional implementation manners of this embodiment, the executing agent may further adjust a network parameter of the initial model in response to determining that the initial model does not meet the optimization goal, use the adjusted initial model as the initial model, reselect a triplet from the at least one triplet as a triplet to be processed, and continue to execute the model training operation.
And 406, selecting the triples with correct relationships among the subjects, the predicates and the objects from the triple set based on the target classification model, and forming a first triple set.
In this embodiment, the executing entity may select, based on the target classification model, a triple including a correct relationship among the subject, the predicate, and the object from the triple set, and form a first triple set.
For example, for each triple in the triple set, if the triple has corresponding annotation information, the executing entity may determine whether the annotation information is annotation information indicating that the relationship between the subject, the predicate, and the object in the triple is correct, and if so, the executing entity may select the triple to be included in the first triple set. If the triple has no corresponding label information, the execution main body can extract the characteristics of the triple, and the extracted characteristic information is input into the target classification model to obtain a prediction result. The executing entity may determine whether the prediction result is a prediction result indicating that the relationship between the subject, the predicate, and the object in the triple is correct, and if so, the executing entity may select the triple to be included in the first triple set.
Step 407, a preset disambiguation operation is performed on triples in the first set of triples.
In this embodiment, the executing entity may execute a preset disambiguation operation on a triplet in the first triplet set.
As an example, the executing agent may perform the following disambiguation operations on triples of the first set of triples:
regarding the triple in the first triple set, taking the triple as the triple to be identified, the execution subject may search, in the first triple set, for a triple in which one or two items included in the triple to be identified are consistent with the corresponding item in the triple to be identified, and the remaining items are inconsistent with the corresponding item in the triple to be identified. If the triples are found, the execution main body can select one of the triples to be identified and the found triples to be reserved, and clear the rest of the triples. For example, the execution main body may obtain respective frequencies of the triples to be identified and the found triples, compare the obtained frequencies, and select the triplet corresponding to the maximum frequency to be retained.
It should be noted that, if the number of the maximum frequencies is greater than 1, the execution main body may randomly select one maximum frequency, and select the triple corresponding to the maximum frequency to be reserved.
Optionally, if the number of the maximum frequencies is greater than 1, the execution main body may also send a prompt message to the relevant person to prompt the relevant person to perform manual selection. The execution subject may select the triple indicated by the selection result for retention in response to receiving the selection result sent by the relevant person through the electronic device.
And step 408, taking the triples in the first triple set after the disambiguation operation as target triples, extracting parallel words from objects in the target triples, taking the subject in the target triples as a parent knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating parent-child relationship information for indicating parent-child relationships between the parent knowledge points and the child knowledge points.
In this embodiment, the executing entity may use the triplet in the first triplet set after the disambiguation operation as the target triplet. The execution main body can extract parallel words from the object in the target triple, take the subject in the target triple as a father knowledge point, take the words in the extracted parallel words as a child knowledge point, and generate parent-child relationship information for indicating the parent-child relationship between the father knowledge point and the child knowledge point. For the generation method of parent-child relationship information, refer to the related description of step 205 in the embodiment shown in fig. 2, and are not described herein again.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating information in the present embodiment highlights the step of extracting the sentence satisfying the following first preset condition: the sentence comprises keywords in a preset keyword set and characters in a preset character set, and the length of the sentence is not more than the preset word number; performing syntactic analysis and semantic role analysis on the sentences in the sentence set, and extracting subjects, predicates and objects from the sentences based on the analysis result; training the initial model to obtain a target classification model; a step of performing a preset disambiguation operation on triples of the first set of triples. Therefore, the scheme described in the embodiment can effectively save time cost and improve the effectiveness of the generated parent-child relationship information.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the apparatus 500 for generating information of the present embodiment includes: the obtaining unit 501 is configured to obtain text information to be processed, where the text information to be processed may include at least one sentence; the first generating unit 502 is configured to extract sentences satisfying a first preset condition from at least one sentence, constituting a sentence set; the second generating unit 503 is configured to, for a sentence in the sentence set, extract a subject, a predicate, and an object from the sentence, and compose a triplet, where there are parallels in the object; the selecting unit 504 is configured to combine the composed triples into a triple set, and select a triple from the triple set as a target triple; the third generating unit 505 is configured to extract a parallel word from an object in a target triple, regard a subject in the target triple as a parent knowledge point, regard a word in the extracted parallel word as a child knowledge point, and generate parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge point.
In the present embodiment, in the apparatus 500 for generating information: the specific processing of the obtaining unit 501, the first generating unit 502, the second generating unit 503, the selecting unit 504, and the third generating unit 505 and the technical effects thereof can refer to the related descriptions of step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional implementations of the present embodiment, the first preset condition may include: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set can be verbs for setting out subjects in the sentence or adverbs for modifying the verbs; the characters in the character set may be conjunctions or punctuation marks for connecting words in a sentence in a side-by-side relationship.
In some optional implementations of the present embodiment, the first generating unit 502 may be further configured to: for a sentence in the at least one sentence, determining whether the sentence comprises a keyword in the keyword set; if yes, further determining whether the sentence comprises characters in the character set; and if the sentence comprises the characters in the character set, extracting the sentence.
In some optional implementation manners of this embodiment, the first preset condition may further include: the length of the sentence is not more than the preset number of words; and the first generating unit 502 may be further configured to: for a sentence in the at least one sentence, determining whether the sentence comprises a keyword in the keyword set; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises characters in the character set, further determining whether the length of the sentence is larger than the preset word number; and if the length of the sentence is not more than the preset word number, extracting the sentence.
In some optional implementations of this embodiment, the second generating unit 503 may include: and an extraction subunit (not shown in the figure) configured to, for a sentence in the sentence set, take the sentence as a to-be-processed sentence, perform syntactic analysis and semantic character analysis on the to-be-processed sentence to obtain an analysis result, and extract a subject, a predicate, and an object from the to-be-processed sentence based on the analysis result.
In some optional implementations of the embodiment, the analysis result may include first annotation information indicating the core verb in the sentence to be processed and second annotation information indicating the actor part of the core verb; and the above extraction subunit may be further configured to: determining whether the analysis result further includes third annotation information indicating a subject part of the core verb; and if the third annotation information is included, sequentially determining the object-of-use part indicated by the second annotation information, the core verb indicated by the first annotation information and the object-of-use part indicated by the third annotation information as the subject, the predicate and the object in the sentence to be processed, and extracting the determined subject, the predicate and the object from the sentence to be processed.
In some optional implementations of this embodiment, the analysis result may further include at least one fourth label information, where the fourth label information may be used to indicate a verb-to-guest relationship between a core verb and a word other than the core word in the sentence to be processed; and the above extraction subunit may be further configured to: in response to determining that the analysis result does not include the third annotation information, target fourth annotation information meeting a second preset condition is determined in the at least one fourth annotation information, a phrase is extracted from the sentence to be processed as an object based on the target fourth annotation information, a subject part indicated by the second annotation information and a core verb indicated by the first annotation information are sequentially used as a subject and a predicate in the sentence to be processed, and the determined subject, the predicate and the object are extracted from the sentence to be processed.
In some optional implementations of this embodiment, the selecting unit 504 may include: an obtaining subunit (not shown in the figure) configured to obtain a target classification model, wherein the target classification model may be a classification model trained to predict whether the relationship between the subject, the predicate, and the object in the triplet is correct; a generating subunit (not shown in the figure) configured to select, based on the target classification model, a triple of which the relationship between the included subject, predicate and object is correct from the triple set, and constitute a first triple set; a selecting subunit (not shown in the figure) configured to select a triple from the first triple set as a target triple.
In some optional implementations of this embodiment, the obtaining subunit may be further configured to: obtaining annotation information of at least one triple in the triple set, wherein the annotation information can be used for indicating whether the relationship among the subject, the predicate and the object in the corresponding triple is correct or not; extracting characteristics of triples in at least one triple to obtain characteristic information, inputting the characteristic information of the triples into an initial model to obtain a prediction result corresponding to the triples, wherein the prediction result can be used for indicating whether the relation among subjects, predicates and objects in the triples is correct or not; comparing the prediction result with the labeling information of the triple, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization goal, the initial model is taken as a goal classification model.
In some optional implementations of this embodiment, the selecting subunit may be further configured to: performing a preset disambiguation operation on triples in the first triplet set; and taking the triples in the first triple set after the disambiguation operation as target triples.
The apparatus provided by the above embodiment of the present application effectively utilizes generation of triples including a subject, a predicate, and an object, and extraction of parallel words included in an object in a target triplet, thereby implementing mining of parent-child relationships between knowledge points.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in implementing an electronic device (e.g., the information generating end 101, 102, 103 shown in FIG. 1) of an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or information storage device. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first generation unit, a second generation unit, a selection unit, and a third generation unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquiring unit may also be described as a "unit that acquires text information to be processed".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to: acquiring text information to be processed, wherein the text information to be processed can comprise at least one sentence; extracting sentences meeting a first preset condition from at least one sentence to form a sentence set; for sentences in the sentence set, extracting a subject, a predicate and an object from the sentences to form a triple, wherein the object has parallel words; combining the formed triples into a triple set, and selecting the triples from the triple set as target triples; extracting parallel words from the object in the target triple, taking the subject in the target triple as a father knowledge point, taking the words in the extracted parallel words as child knowledge points, and generating parent-child relationship information for indicating the parent-child relationship between the father knowledge point and the child knowledge points.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (22)

1. A method for generating information, comprising:
acquiring text information to be processed, wherein the text information to be processed comprises at least one sentence;
extracting sentences meeting a first preset condition from the at least one sentence to form a sentence set;
for the sentences in the sentence set, extracting a subject, a predicate and an object from the sentences to form a triple, wherein the object has parallel words;
combining the formed triples into a triple set, and selecting the triples from the triple set as target triples;
extracting parallel words from the objects in the target triple, taking the subject in the target triple as a father knowledge point, taking the words in the extracted parallel words as a child knowledge point, and generating father-son relationship information for indicating the father-son relationship between the father knowledge point and the child knowledge point.
2. The method of claim 1, wherein the first preset condition comprises: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs for setting out subjects in the sentence or adverbs for modifying the verbs; the characters in the character set are conjunctions or punctuation marks used for connecting words with parallel relation in sentences.
3. The method according to claim 2, wherein said extracting a sentence satisfying a first preset condition from said at least one sentence comprises:
for a sentence of the at least one sentence, determining whether the sentence comprises a keyword in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; and if the sentence comprises the characters in the character set, extracting the sentence.
4. The method of claim 2, wherein the first preset condition further comprises: the length of the sentence is not more than the preset number of words; and
the extracting of the sentence satisfying a first preset condition from the at least one sentence includes:
for a sentence of the at least one sentence, determining whether the sentence comprises a keyword in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises the characters in the character set, further determining whether the length of the sentence is larger than the preset word number; and if the length of the sentence is not more than the preset word number, extracting the sentence.
5. The method of claim 1, wherein for a sentence in the set of sentences, extracting a subject, a predicate, and an object from the sentence comprises:
and regarding the sentences in the sentence set as sentences to be processed, carrying out syntactic analysis and semantic role analysis on the sentences to be processed to obtain analysis results, and extracting subjects, predicates and objects from the sentences to be processed based on the analysis results.
6. The method of claim 5, wherein the analysis result comprises first annotation information indicating a core verb in the sentence to be processed and second annotation information indicating a composer part of the core verb; and
the extracting a subject, a predicate and an object from the sentence to be processed based on the analysis result comprises:
determining whether the analysis result further includes third annotation information indicating a subject portion of the core verb;
if the third annotation information is included, sequentially determining the object-of-use part indicated by the second annotation information, the core verb indicated by the first annotation information and the object-of-use part indicated by the third annotation information as a subject, a predicate and an object in the sentence to be processed, and extracting the determined subject, predicate and object from the sentence to be processed.
7. The method of claim 6, wherein the analysis result further comprises at least one fourth label information, and the fourth label information is used for indicating the actor relationship between the core verb and the word except the core word in the sentence to be processed; and
the extracting of the subject, the predicate and the object from the sentence to be processed based on the analysis result further comprises:
in response to determining that the analysis result does not include the third annotation information, target fourth annotation information meeting a second preset condition is determined in the at least one fourth annotation information, a phrase is extracted from the sentence to be processed as an object based on the target fourth annotation information, a subject part indicated by the second annotation information and a core verb indicated by the first annotation information are sequentially used as a subject and a predicate in the sentence to be processed, and the determined subject, the predicate and the object are extracted from the sentence to be processed.
8. The method of claim 1, wherein the selecting a triple from the set of triples as a target triple comprises:
obtaining a target classification model, wherein the target classification model is a trained classification model for predicting whether the relation among subjects, predicates and objects in the triples is correct or not;
based on the target classification model, selecting the triples with correct relations among the subjects, the predicates and the objects from the triple set to form a first triple set;
and selecting a triple from the first triple set as a target triple.
9. The method of claim 8, wherein the obtaining a target classification model comprises:
obtaining marking information of at least one triple in the triple set, wherein the marking information is used for indicating whether the relation among a subject, a predicate and an object in the corresponding triple is correct or not;
extracting characteristics of the triples in the at least one triple to obtain characteristic information, inputting the characteristic information of the triples into an initial model to obtain a prediction result corresponding to the triples, wherein the prediction result is used for indicating whether the relation among the subject, the predicate and the object in the triples is correct or not; comparing the prediction result with the labeling information of the triple, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization goal, the initial model is taken as a goal classification model.
10. The method of claim 8, wherein the selecting a triple from the first triple set as a target triple comprises:
performing a preset disambiguation operation on triples in the first triplet set;
and taking the triples in the first triple set after the disambiguation operation as target triples.
11. An apparatus for generating information, comprising:
an acquisition unit configured to acquire text information to be processed, wherein the text information to be processed includes at least one sentence;
a first generation unit configured to extract sentences satisfying a first preset condition from the at least one sentence to form a sentence set;
the second generation unit is configured to extract a subject, a predicate and an object from a sentence in the sentence set to form a triple, wherein the object has a parallel word;
the selecting unit is configured to combine the formed triples into a triple set, and select a triple from the triple set as a target triple;
and a third generating unit configured to extract a parallel word from the object in the target triple, use the subject in the target triple as a parent knowledge point, use a word in the extracted parallel word as a child knowledge point, and generate parent-child relationship information indicating a parent-child relationship between the parent knowledge point and the child knowledge point.
12. The apparatus of claim 11, wherein the first preset condition comprises: the sentence comprises keywords in a preset keyword set and characters in a preset character set, wherein the keywords in the keyword set are verbs for setting out subjects in the sentence or adverbs for modifying the verbs; the characters in the character set are conjunctions or punctuation marks used for connecting words with parallel relation in sentences.
13. The apparatus of claim 12, wherein the first generating unit is further configured to:
for a sentence of the at least one sentence, determining whether the sentence comprises a keyword in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; and if the sentence comprises the characters in the character set, extracting the sentence.
14. The apparatus of claim 12, wherein the first preset condition further comprises: the length of the sentence is not more than the preset number of words; and
the first generating unit is further configured to:
for a sentence of the at least one sentence, determining whether the sentence comprises a keyword in the set of keywords; if yes, further determining whether the sentence comprises characters in the character set; if the sentence comprises the characters in the character set, further determining whether the length of the sentence is larger than the preset word number; and if the length of the sentence is not more than the preset word number, extracting the sentence.
15. The apparatus of claim 11, wherein the second generating unit comprises:
and the extraction subunit is configured to, regarding the sentence in the sentence set as a sentence to be processed, perform syntactic analysis and semantic role analysis on the sentence to be processed to obtain an analysis result, and extract a subject, a predicate and an object from the sentence to be processed based on the analysis result.
16. The apparatus of claim 15, wherein the analysis result includes first annotation information indicating a core verb in the sentence to be processed and second annotation information indicating a composer part of the core verb; and
the extraction subunit is further configured to:
determining whether the analysis result further includes third annotation information indicating a subject portion of the core verb;
if the third annotation information is included, sequentially determining the object-of-use part indicated by the second annotation information, the core verb indicated by the first annotation information and the object-of-use part indicated by the third annotation information as a subject, a predicate and an object in the sentence to be processed, and extracting the determined subject, predicate and object from the sentence to be processed.
17. The apparatus of claim 16, wherein the analysis result further comprises at least one fourth label information, the fourth label information being used for indicating a verb-to-guest relationship between a core verb and a word other than a core word in the sentence to be processed; and
the extraction subunit is further configured to:
in response to determining that the analysis result does not include the third annotation information, target fourth annotation information meeting a second preset condition is determined in the at least one fourth annotation information, a phrase is extracted from the sentence to be processed as an object based on the target fourth annotation information, a subject part indicated by the second annotation information and a core verb indicated by the first annotation information are sequentially used as a subject and a predicate in the sentence to be processed, and the determined subject, the predicate and the object are extracted from the sentence to be processed.
18. The apparatus of claim 11, wherein the selecting unit comprises:
an obtaining subunit configured to obtain a target classification model, wherein the target classification model is a trained classification model for predicting whether a relationship between a subject, a predicate, and an object in a triplet is correct;
a generating subunit configured to select, based on the target classification model, a triple in which a relationship between the included subject, predicate, and object is correct from the triple set, and constitute a first triple set;
a selecting subunit configured to select a triple from the first triple set as a target triple.
19. The apparatus of claim 18, wherein the acquisition subunit is further configured to:
obtaining marking information of at least one triple in the triple set, wherein the marking information is used for indicating whether the relation among a subject, a predicate and an object in the corresponding triple is correct or not;
extracting characteristics of the triples in the at least one triple to obtain characteristic information, inputting the characteristic information of the triples into an initial model to obtain a prediction result corresponding to the triples, wherein the prediction result is used for indicating whether the relation among the subject, the predicate and the object in the triples is correct or not; comparing the prediction result with the labeling information of the triple, and determining whether the initial model reaches a preset optimization target according to the comparison result; in response to determining that the initial model meets the optimization goal, the initial model is taken as a goal classification model.
20. The apparatus of claim 18, wherein the selecting subunit is further configured to:
performing a preset disambiguation operation on triples in the first triplet set;
and taking the triples in the first triple set after the disambiguation operation as target triples.
21. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.
22. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-10.
CN201810791223.2A 2018-07-18 2018-07-18 Method and device for generating information Active CN110807311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810791223.2A CN110807311B (en) 2018-07-18 2018-07-18 Method and device for generating information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810791223.2A CN110807311B (en) 2018-07-18 2018-07-18 Method and device for generating information

Publications (2)

Publication Number Publication Date
CN110807311A true CN110807311A (en) 2020-02-18
CN110807311B CN110807311B (en) 2023-06-23

Family

ID=69486556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810791223.2A Active CN110807311B (en) 2018-07-18 2018-07-18 Method and device for generating information

Country Status (1)

Country Link
CN (1) CN110807311B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444349A (en) * 2020-03-06 2020-07-24 深圳追一科技有限公司 Information extraction method and device, computer equipment and storage medium
CN111709248A (en) * 2020-05-28 2020-09-25 北京百度网讯科技有限公司 Training method and device of text generation model and electronic equipment
CN111858894A (en) * 2020-07-29 2020-10-30 网易(杭州)网络有限公司 Semantic missing recognition method and device, electronic equipment and storage medium
CN111859858A (en) * 2020-07-22 2020-10-30 智者四海(北京)技术有限公司 Method and device for extracting relationship from text
CN112528641A (en) * 2020-12-10 2021-03-19 北京百度网讯科技有限公司 Method and device for establishing information extraction model, electronic equipment and readable storage medium
CN111858894B (en) * 2020-07-29 2024-06-04 网易(杭州)网络有限公司 Semantic miss recognition method and device, electronic equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1363200A2 (en) * 2002-05-13 2003-11-19 Knowledgenetica Corporation Multi-dimensional method and apparatus for automated language interpretation
WO2007121614A1 (en) * 2006-04-26 2007-11-01 Wenhe Xu An automated translation method for translating a lagnuage into multiple languages
WO2008080190A1 (en) * 2007-01-04 2008-07-10 Thinking Solutions Pty Ltd Linguistic analysis
CN103440252A (en) * 2013-07-25 2013-12-11 北京师范大学 Method and device for extracting parallel information in Chinese sentence
CN103443787A (en) * 2011-02-01 2013-12-11 埃森哲环球服务有限公司 System for identifying textual relationships
CN105573980A (en) * 2015-12-10 2016-05-11 百度在线网络技术(北京)有限公司 Information segment generation method and device
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN106776535A (en) * 2016-11-16 2017-05-31 金陵科技学院 Scientific and technical literature fine granularity relation excavation method based on two-stage syntax parsing
WO2017092380A1 (en) * 2015-12-03 2017-06-08 华为技术有限公司 Method for human-computer dialogue, neural network system and user equipment
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN107451164A (en) * 2016-06-01 2017-12-08 华为技术有限公司 A kind of method and device of semantic query
CN107632979A (en) * 2017-10-13 2018-01-26 华中科技大学 The problem of one kind is used for interactive question and answer analytic method and system
CN107798136A (en) * 2017-11-23 2018-03-13 北京百度网讯科技有限公司 Entity relation extraction method, apparatus and server based on deep learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1363200A2 (en) * 2002-05-13 2003-11-19 Knowledgenetica Corporation Multi-dimensional method and apparatus for automated language interpretation
WO2007121614A1 (en) * 2006-04-26 2007-11-01 Wenhe Xu An automated translation method for translating a lagnuage into multiple languages
WO2008080190A1 (en) * 2007-01-04 2008-07-10 Thinking Solutions Pty Ltd Linguistic analysis
CN103443787A (en) * 2011-02-01 2013-12-11 埃森哲环球服务有限公司 System for identifying textual relationships
CN103440252A (en) * 2013-07-25 2013-12-11 北京师范大学 Method and device for extracting parallel information in Chinese sentence
WO2017092380A1 (en) * 2015-12-03 2017-06-08 华为技术有限公司 Method for human-computer dialogue, neural network system and user equipment
CN105573980A (en) * 2015-12-10 2016-05-11 百度在线网络技术(北京)有限公司 Information segment generation method and device
CN107451164A (en) * 2016-06-01 2017-12-08 华为技术有限公司 A kind of method and device of semantic query
CN106776535A (en) * 2016-11-16 2017-05-31 金陵科技学院 Scientific and technical literature fine granularity relation excavation method based on two-stage syntax parsing
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN107632979A (en) * 2017-10-13 2018-01-26 华中科技大学 The problem of one kind is used for interactive question and answer analytic method and system
CN107798136A (en) * 2017-11-23 2018-03-13 北京百度网讯科技有限公司 Entity relation extraction method, apparatus and server based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
冀铁亮等: "词汇化句法分析与子语类框架获取的互动方法", 《中文信息学报》 *
冀铁亮等: "词汇化句法分析与子语类框架获取的互动方法", 《中文信息学报》, no. 01, 25 January 2007 (2007-01-25), pages 120 - 126 *
黄承慧等: "一种基于主谓宾结构的文本检索算法", 《计算机科学》 *
黄承慧等: "一种基于主谓宾结构的文本检索算法", 《计算机科学》, no. 09, 15 September 2010 (2010-09-15), pages 173 - 176 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444349A (en) * 2020-03-06 2020-07-24 深圳追一科技有限公司 Information extraction method and device, computer equipment and storage medium
CN111444349B (en) * 2020-03-06 2023-09-12 深圳追一科技有限公司 Information extraction method, information extraction device, computer equipment and storage medium
CN111709248A (en) * 2020-05-28 2020-09-25 北京百度网讯科技有限公司 Training method and device of text generation model and electronic equipment
CN111709248B (en) * 2020-05-28 2023-07-11 北京百度网讯科技有限公司 Training method and device for text generation model and electronic equipment
CN111859858A (en) * 2020-07-22 2020-10-30 智者四海(北京)技术有限公司 Method and device for extracting relationship from text
CN111859858B (en) * 2020-07-22 2024-03-01 智者四海(北京)技术有限公司 Method and device for extracting relation from text
CN111858894A (en) * 2020-07-29 2020-10-30 网易(杭州)网络有限公司 Semantic missing recognition method and device, electronic equipment and storage medium
CN111858894B (en) * 2020-07-29 2024-06-04 网易(杭州)网络有限公司 Semantic miss recognition method and device, electronic equipment and storage medium
CN112528641A (en) * 2020-12-10 2021-03-19 北京百度网讯科技有限公司 Method and device for establishing information extraction model, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN110807311B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
US11232140B2 (en) Method and apparatus for processing information
CN107491547B (en) Search method and device based on artificial intelligence
CN107220386B (en) Information pushing method and device
US11288593B2 (en) Method, apparatus and device for extracting information
CN108153901B (en) Knowledge graph-based information pushing method and device
CN107220352B (en) Method and device for constructing comment map based on artificial intelligence
CN107066449B (en) Information pushing method and device
US9286290B2 (en) Producing insight information from tables using natural language processing
US9135240B2 (en) Latent semantic analysis for application in a question answer system
US10630798B2 (en) Artificial intelligence based method and apparatus for pushing news
US9542496B2 (en) Effective ingesting data used for answering questions in a question and answer (QA) system
CN109635094B (en) Method and device for generating answer
CN110807311B (en) Method and device for generating information
US11494420B2 (en) Method and apparatus for generating information
CN109522341B (en) Method, device and equipment for realizing SQL-based streaming data processing engine
US11651015B2 (en) Method and apparatus for presenting information
CN111159220B (en) Method and apparatus for outputting structured query statement
CN110555205B (en) Negative semantic recognition method and device, electronic equipment and storage medium
CN110738056B (en) Method and device for generating information
CN110555451A (en) information identification method and device
CN112329429A (en) Text similarity learning method, device, equipment and storage medium
CN111368036B (en) Method and device for searching information
CN114118072A (en) Document structuring method and device, electronic equipment and computer readable storage medium
JP2020035427A (en) Method and apparatus for updating information
CN111209348B (en) Method and device for outputting information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant