CN109582933B - Method and related device for determining text novelty - Google Patents

Method and related device for determining text novelty Download PDF

Info

Publication number
CN109582933B
CN109582933B CN201811348626.6A CN201811348626A CN109582933B CN 109582933 B CN109582933 B CN 109582933B CN 201811348626 A CN201811348626 A CN 201811348626A CN 109582933 B CN109582933 B CN 109582933B
Authority
CN
China
Prior art keywords
candidate
target
entity
relationship
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811348626.6A
Other languages
Chinese (zh)
Other versions
CN109582933A (en
Inventor
陈伟然
姜庭欣
杨冠梅
段博超
郭永红
何佳
王志强
王希桢
李静毅
刘乾楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hexiang Wisdom Technology Co ltd
Original Assignee
Beijing Hexiang Wisdom Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hexiang Wisdom Technology Co ltd filed Critical Beijing Hexiang Wisdom Technology Co ltd
Priority to CN201811348626.6A priority Critical patent/CN109582933B/en
Publication of CN109582933A publication Critical patent/CN109582933A/en
Application granted granted Critical
Publication of CN109582933B publication Critical patent/CN109582933B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method for determining text novelty and a related device, wherein the method comprises the following steps: determining a target text; extracting a plurality of target entities in the target text to obtain a target entity set; acquiring a candidate entity set of each candidate text in the candidate text set; determining a first entity intersection of the target entity set and the candidate entity set, wherein the first entity intersection is a matching entity in the target entity set and the candidate entity set; determining a novelty of the target text and the candidate text according to the difference parameters of the first entity intersection and the target entity set. In the embodiment of the application, the accuracy of novelty calculation is improved.

Description

Method and related device for determining text novelty
Technical Field
The invention relates to the field of data processing, in particular to a method for determining text novelty and a related device.
Background
With the advent of the technology explosion era, the information importance is continuously enhanced, the data volume is continuously increased, and the information retrieval is particularly important.
Users often need to search the database according to the target text and search the database for candidate texts similar to the target text, but most of the current search methods focus on matching text characters based on text search and text search. For example, the user determines keywords in the target text, inputs the keywords, and then the retrieval system performs keyword matching with candidate texts in the database according to the keywords, wherein the higher the number of the keywords is matched, the lower the novelty of the candidate texts with the target text is.
In the current mode, a user is required to determine keywords, the selection of the keywords greatly affects the retrieval result, the selection of the keywords is subjective and is not necessarily the understanding of the actual content of the target text, and therefore the accuracy of novelty of the target text and the candidate text is low.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and a related apparatus for determining a text novelty, where all target entities in a target text and all candidate entities in each candidate text are determined, and the novelty of the target text and the candidate text is determined according to a difference parameter between a first entity intersection and a target entity set.
In a first aspect, an embodiment of the present application provides a method for determining text novelty, including:
determining a target text;
extracting a plurality of target entities in the target text to obtain a target entity set;
acquiring a candidate entity set of each candidate text in the candidate text set;
determining a first entity intersection of the target entity set and the candidate entity set, wherein the first entity intersection is a matching entity in the target entity set and the candidate entity set;
determining a novelty of the target text and the candidate text according to the difference parameters of the first entity intersection and the target entity set.
In one possible implementation, the method further includes:
extracting a plurality of binary relations in the target text to obtain a target binary relation set, wherein the binary relations comprise two entities and relations between the two entities;
acquiring a candidate binary relation set comprising a plurality of binary relations in the candidate text;
determining a first binary relation intersection of the target binary relation set and the candidate binary relation set, wherein the first binary relation intersection comprises matched binary relations in the target binary relation set and the candidate binary relation set;
the determining the novelty degree of the target text and the candidate text according to the difference parameters of the first entity set and the target entity set comprises:
the determining a first entity novelty according to the difference parameters of the first entity set and the target entity set;
determining a first binary relation novelty according to the difference parameters of the first binary relation intersection and the target binary relation set;
determining a novelty of the target text and the candidate text according to the first entity novelty and the first secondary relationship novelty.
In one possible implementation, the method further includes:
extracting a target ternary relationship set in the target text, wherein the target ternary relationship set comprises a plurality of ternary relationships, the ternary relationship comprises two binary relationships, and the two binary relationships have the same entity;
acquiring a candidate ternary relationship set comprising a plurality of ternary relationships in the candidate text;
determining a first ternary relationship intersection of the target ternary relationship set and the candidate ternary relationship set, wherein the first ternary relationship intersection comprises matched ternary relationships in the target ternary relationship set and the candidate ternary relationship set;
the determining the novelty of the target text and the candidate text in accordance with the first entity novelty and the first secondary relationship novelty comprises:
determining a first ternary relationship novelty according to the difference parameters of the first ternary relationship intersection and the target ternary relationship set;
determining a novelty of the target text and the candidate text based on the first entity novelty, the first secondary relationship novelty, and the first tertiary relationship novelty.
In one possible implementation, the extracting a plurality of target entities in the target text includes:
inputting the target text into an entity extraction model, and identifying a plurality of target entities in the target text through the entity extraction model.
In one possible implementation manner, the extracting a plurality of binary relations in the target text includes:
and inputting the target texts of the identified target entities into a relationship extraction model, and extracting the binary relationship between the target entities through the relationship extraction model.
In one possible implementation, the method includes:
and according to the relation between the target entities, performing structured representation on the target text to generate a target structure.
In one possible implementation, the method includes a node and an edge, where the node is used for representing the target entities, and the edge is used for representing the relationship between the target entities.
In one possible implementation, each candidate text in the candidate text set is a structured candidate structure, and the target text is a target structure, the method further includes:
extracting a candidate entity set of a candidate atlas, wherein the candidate atlas comprises at least one candidate structure;
the determining an entity intersection of the target entity set and the candidate entity set comprises:
determining a second entity intersection of the target set of entities and a set of candidate entities of the candidate atlas;
the method further comprises the following steps:
determining the novelty of the target text and the candidate atlas according to the difference parameters of the second entity set and the target entity set.
In one possible implementation, when the candidate atlas includes at least two candidate structures, the at least two candidate structures are a first candidate structure and a second candidate structure;
determining an associated entity of the first candidate structure and the second candidate structure;
and associating the first candidate structure with the second candidate structure through the association entity to obtain the candidate map.
In one possible implementation, the method further includes:
extracting a plurality of binary relations in the target structure to obtain a target binary relation set;
positioning two target entities contained in each target binary relation in the target binary relation set to corresponding two entity positions in the candidate map;
calculating the distance between the two entity positions corresponding to each target binary relation;
determining a second binary relation novelty of each target binary relation relative to the candidate atlas according to the distance;
the determining the novelty of the target text and the candidate text according to the difference parameters of the second entity set and the target entity set comprises:
determining a second entity novelty according to the difference parameters of the second entity set and the target entity set;
determining the novelty of the target structure with the candidate atlas based on the second entity novelty and the second binary relationship novelty.
In one possible implementation, the method further includes:
acquiring a candidate binary relation set comprising a plurality of binary relations in the candidate map;
determining a second binary relation intersection of the target binary relation set and the candidate binary relation set;
determining a first binary relation novelty according to the difference parameters of the second binary relation intersection and the target binary relation set;
determining a binary relation novelty according to the first binary relation novelty, the second binary relation novelty and respective corresponding weights;
the determining the novelty of the target structure and the candidate structure according to the second entity novelty and the second binary relationship novelty comprises:
determining the novelty of the target structure and the candidate atlas according to the second entity novelty and the binary relationship novelty.
In one possible implementation, the method further includes:
extracting a plurality of ternary relations in the target structure to obtain a target ternary relation set;
positioning any two target entities contained in each target ternary relationship in the target ternary set to corresponding three entity positions in the candidate map;
calculating a distance between any two of the three physical locations;
determining a second tertiary novelty for each of the target triples relative to the candidate atlas based on the distance;
the determining the novelty of the target structure and the candidate structure according to the second entity novelty and the second binary relationship novelty comprises:
determining the novelty of the target structure and the candidate atlas according to the second entity novelty, the second binary relational novelty and the second ternary relational novelty.
In one possible implementation, the method further includes:
acquiring a candidate ternary relationship set comprising a plurality of ternary relationships in the candidate map;
determining a second ternary relationship intersection of the target ternary relationship set and the candidate ternary relationship set;
determining a first ternary relationship novelty according to the difference parameters of the ternary relationship intersection and the target ternary relationship set;
determining a ternary relationship novelty according to the first ternary relationship novelty, the second ternary relationship novelty and respective corresponding weights;
determining the novelty of the target structure with the candidate atlas based on the second entity novelty and the binary relationship novelty, comprising:
determining the novelty of the target structure and the candidate atlas according to the second entity novelty, binary relationship novelty, and ternary relationship novelty.
In one possible implementation manner, the extracting a plurality of binary relations in the target text includes:
acquiring an entity relationship data set, wherein the entity relationship data set is obtained according to entities in a text set and the relationship between the entities; the entity relationship matrix comprises N entities and the relationship among the N entities, wherein N is greater than or equal to 2;
querying the entity relationship data set to obtain M second entities having a relationship with the first entity, wherein M is less than or equal to N;
searching the second entity in a preset range in the target text;
and if at least one target second entity in the M second entities is found, establishing a relationship between the first entity and the target second entity.
In a possible implementation manner, before searching for the second entity within a preset range in the target text, the method further includes:
creating an entity matching window;
and determining a preset range in the target text according to the size of the entity matching window.
In a second aspect, an apparatus for determining text novelty is provided in this embodiment, including:
the first determining module is used for determining a target text;
the extracting module is used for extracting a plurality of target entities in the target text determined by the first determining module to obtain a target entity set;
the acquisition module is used for acquiring a candidate entity set of each candidate text in the candidate text set;
a second determining module, configured to determine a first entity intersection between the target entity set identified by the extracting module and the candidate entity set acquired by the acquiring module, where the first entity intersection is a matched entity in the target entity set and the candidate entity set;
a novelty determination module for determining the degree of novelty of the target text and the candidate text according to the difference parameters of the first entity intersection determined by the second determination module and the target entity set extracted by the extraction module.
In a third aspect, an embodiment of the present application provides an electronic device, including:
a memory and a processor;
the memory and the processor are communicatively connected to each other, the memory has stored therein computer instructions, and the processor executes the computer instructions to perform the method according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer storage medium, where the computer storage medium stores computer instructions for causing the computer to execute the method according to the first aspect.
In this embodiment, a target text requiring novelty determination is determined first, where the target text may be a patent; further extracting a plurality of target entities in the target text to obtain a target entity set; acquiring a candidate entity set of each candidate text in the candidate text set; traversing each candidate text, and determining a first entity intersection of the target entity set and a candidate entity set of each candidate text, wherein the first entity intersection is a matched entity in the target entity set and the candidate entity set; finally, determining the novelty of the target text and the candidate text according to the difference parameters of the first entity intersection and the target entity set. In the embodiment, all target entities in a target text and all candidate entities in each candidate text are considered, the novelty of the target text and the candidate text is determined according to the difference parameters between the intersection of the first entities and the target entity set, compared with the prior art, the novelty is determined only through keywords determined subjectively by a user and through keyword matching, and the determination method of the novelty needs to be influenced by subjective understanding of the user.
Drawings
The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:
FIG. 1 is a flow chart illustrating steps of an embodiment of a method for training a structured model according to the present application;
FIG. 2 is a flowchart illustrating steps of an embodiment of a method for text structuring according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a target structure in an embodiment of the present application;
FIG. 4 is a diagram illustrating an image structure according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating steps of an embodiment of a method for determining text similarity according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a Word2vec model training process in the embodiment of the present application;
FIG. 7 is a flowchart illustrating steps of one embodiment of a method for determining text novelty, in an embodiment of the present application;
FIG. 8 is a schematic diagram of a candidate atlas in an embodiment of the application;
FIG. 9 is a flowchart illustrating steps of an embodiment of a method for obtaining image information according to an embodiment of the present application;
FIG. 10 is a schematic illustration of the figure description and the figures in the candidate text in an embodiment of the present application;
FIG. 11 is a schematic view of a topology of a first candidate image and a second candidate image in an embodiment of the present application;
fig. 12 is a flowchart illustrating steps of an embodiment of a method for obtaining entity information according to an embodiment of the present application;
FIG. 13 is a block diagram illustrating an embodiment of an apparatus for determining text novelty according to an embodiment of the present application;
FIG. 14 is a block diagram of an embodiment of an apparatus for determining text novelty according to the present application;
fig. 15 is a schematic structural diagram of an embodiment of an electronic device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
The embodiment of the present application provides a text structuring method, where texts in the embodiment of the present application include, but are not limited to, technical documents, patent documents, academic papers, and the like, and structured representation of texts is performed, and then structured information (for example, a structure diagram) is obtained to help a user to understand the text content. Alternatively, the structured information may be used as a search formula for searching information, and patent documents are taken as an example, and most of the current search methods for patents are based on text search, and text search focuses on matching text characters, lacks understanding of user needs and patent contents, and does not search based on content understanding. The method provided by the embodiment of the application expresses the patent text in a structured mode, so that the retrieval can be carried out on the basis of understanding the patent content, and the retrieval accuracy is improved.
The embodiment of the application provides a text structuring method, which is applied to an electronic device, wherein the electronic device can be a server or a terminal device, and the terminal device includes but is not limited to a computer, a mobile phone, a palm computer and the like. The electronic equipment acquires a target text to be structured, wherein the target text can be a patent, for example, then the target text is input into a trained entity extraction model, and an entity in the target text is identified through the entity extraction model; then, inputting the target text of the recognized entities into a trained relationship extraction model, and extracting the relationship between the entities through the relationship extraction model; according to the entities and the relationship between the entities, the target text is structurally represented, and structured information (or referred to as structured text representation) is generated, for example, the structured text representation may be a structure diagram or a flow chart, etc. In the embodiment of the application, the entities in the target text are extracted through the trained entity extraction model, the relationships among the entities are extracted through the trained relationship extraction model, and the structured text representation is automatically generated according to the entities and the relationships among the entities, so that the text content can be easily understood, the conversion speed is high, and the labor cost is saved.
For convenience of understanding, the text structuring method provided in the embodiments of the present application first explains the words provided in the embodiments of the present application:
entity: the term used for representing a feature in a text (e.g. a patent, a paper), and in such technical documents as a patent, a paper, the entity is a term used for representing a technical feature, and the entity includes a component, an attribute, or an attribute value.
Assembly of: representing components in the text, such as charging equipment, memory.
The attributes are as follows: represents an attribute of the component, such as the "voltage" of the charging device.
Attribute values: a value representing one attribute of the component, such as the voltage of the charging device is "240 v".
Relationships between entities: the relationship between the technical features specifically includes a relationship between the components, a relationship between the components and the attributes, or a relationship between the attributes and the attribute values.
Wherein, 1) the kinds of relationships between the components include but are not limited to:
including the relationship, for example, the charging post includes a control unit.
The connection relationship, for example, the humidity adjusting device is connected with the cooling fan.
2) Component to attribute relationship:
the component has certain properties, such as the charging device has voltage properties.
3) Relationship of property to property value of a component:
the attribute has a specific attribute value, such as a voltage "yes" 240 v.
Example 1
As will be understood with reference to fig. 1, the method for text structuring provided in the embodiments of the present application is described in detail below, and the method for text structuring mainly includes two parts, a first part is a training structured model, and a second part is a structured representation of a text.
Firstly, training a structural model;
the structured model comprises an entity extraction model for extracting entities and a relation extraction model for extracting the relation between the entities, and the training method comprises the following steps:
step 101, obtaining a labeled first corpus set, wherein the first corpus set is obtained by performing entity corpus labeling on each text in the first text set according to a first preset rule.
The first text set includes, but is not limited to, technical documents, patents, academic papers, etc., and the first text set in the embodiment of the present application is described by taking patents as examples. For example, the first set of text may include ten thousand patents, and it should be noted that the number of patents included in the first set of text is by way of example only and not by way of limitation.
The first corpus set is obtained by performing entity corpus labeling on each text in the first text set according to a first preset rule. The first preset rule is as follows: a first vocabulary representing the entity and a second vocabulary representing a non-entity are distinguished.
Specifically, a part of the contents of one of the patents in the first text set is taken as an example for explanation:
the text is: "a car high-order stop lamp, its characterized in that: the text labeling and reading device comprises a rectangular installation base plate (1), wherein a shell frame (2) matched with the installation base plate is arranged on the installation base plate (1), a plurality of partition plates (3) "are arranged in the shell frame (2), and for the text, the labeling linguistic data are in the following formats:
' an automobile high-mount brake lamp ' comprises a rectangular/pre-installation/start installation/in-seat/in-plate/end (/ after 1), wherein a matching/pre outer/start shell/in-frame/end (/ after 2) is arranged on the/pre-installation/start installation/in-seat/in-plate/end (/ after 1), a plurality of/pre partition/start plates/ends (3) are arranged in the/pre outer/start shell/in-frame/end (/ after 2), and each/pre partition/start plate/end has a/pre shaft/entry '/after
Wherein, the first preset rule is specifically as follows: the first identifier (e.g.: start) represents the first word of the entity, the second identifier (e.g.: end) represents the last word of the entity, and the third identifier (e.g.: in) represents the word of the component between the first identifier start and the second identifier end. The fourth identifier (e.g.,/entity) represents that the component has only one word. The fifth flag (e.g./pre) represents the word preceding the first flag start. The sixth identifier (e.g.: after) represents all the words after the second identifier end except the entity name, and is assigned a uniform seventh identifier (e.g.: w).
For example: bag/w includes/w moment/w shape/w/pre-mount/start mount/in-seat/in-plate/end (/ after 1/w)/w.
It should be noted that, in the embodiments of the present application, the identifiers marked on the corpus are only for illustration and are not used to limit the embodiments of the present application.
And 102, training the first corpus set to obtain an entity extraction model.
The first corpus set is trained by using a Conditional Random Field (CRF) model to obtain model parameters, and the entity extraction model is constructed according to the model parameters.
The CRF can label Chinese characters, namely, words (word groups) are formed by the characters, the frequency information of the appearance of the words of the characters is considered, the context is considered, and the CRF has good learning capacity, so that the CRF has good effect on identifying ambiguous words and unknown words.
And 103, taking the second text set as the input of the entity extraction model, and identifying entity information in the second text set through the entity extraction model.
This second set of text is also a set of patents. And taking the second text set as the input of the entity extraction model, and identifying entity information in the second text set through the entity extraction model.
For example, part of the contents of one of the patents in this second set of texts is:
a battery monitoring and managing device comprises a battery pack (1), a monitoring module (2), a CPU (central processing unit) processor (3) and a display (4), wherein for the section of characters, an entity extraction model is used for analyzing to obtain:
one/w type/w electricity/w pool/w monitor/w measure/w tube/w manage/w install/w set/w,/w pack/w include/w electricity/start pool/in set/end (/ w 1/w)/w,/w monitor/start measure/in module/in block/end (/ w 2/w)/w,/w CPU/start process/in processor/end (/ w 3/w)/w and/w display/start show/in processor/end (/ w 4/w)/w
Four component names are extracted from the above text example: the monitoring system comprises a battery pack, a monitoring module, a CPU (central processing unit) processor and a display.
And 104, acquiring a labeled second corpus set, wherein the second corpus set is obtained by performing relation corpus labeling and entity labeling on each text of the second text set according to a second preset rule.
And after the entity extraction model finishes component extraction, performing linguistic data annotation on the relationship, and converting the linguistic data annotation into a linguistic data format of a CRF model for training.
The second preset rule is as follows: and distinguishing and labeling a first vocabulary representing the entity, a third vocabulary representing the relation and a third vocabulary representing the non-entity and non-relation.
Specifically, the following examples are given:
example (c): the mounting seat board (1) is provided with a shell frame (2) matched with the mounting seat board
And carrying out relation corpus standard on the text, and marking as follows:
the mounting base plate/e (/ w 1/w)/w upper/w set/r _ start has/r _ end and/w phase/w match/w shell frame/e (/ w 2/w)/w.
Wherein the seventh identifier (e.g.: w) is a common character, the eighth identifier (e.g.: e) is a component recognized by the entity extraction model, the ninth identifier (e.g.: r _ start) represents the beginning word of the relationship, and the ninth identifier (e.g.: r _ end) represents the ending word of the relationship.
It should be noted that, in the embodiment of the present application, the entity extraction model identifies the relationship between the entity and the entity, in the example in the embodiment of the present application, the component identified by the entity extraction model is only an example, and the entity extraction model may also identify the attribute and the attribute value, but the example is not an example in the embodiment.
And 105, training the second corpus information set to obtain the relationship extraction model.
And training the second corpus information set by using a CRF (domain name model) to obtain model parameters, and constructing the relationship extraction model according to the model parameters. The model parameters comprise regularization item parameters a, and a value L2 can achieve a better fitting effect than that of L1. The hyper-parameter c can take a value of 3, and can fit training data as much as possible. And f is a threshold value f of the features participating in training, wherein the f takes a value of 3, and if the occurrence frequency of the word is less than f, the word does not participate in training.
For example, the relationship between entities extracted from the text is: the 'seat mounting plate' is provided with a 'shell frame'.
In the embodiment of the application, a labeled first corpus set is obtained, wherein the first corpus set is obtained by labeling entity corpuses of each text in a first text set according to a first preset rule; then training the first corpus set to obtain an entity extraction model, wherein the entity extraction model is used for extracting entities in texts; then, taking a second text set as the input of the entity extraction model, and identifying entity information in the second text set through the entity extraction model; acquiring a labeled second corpus set; and training the second corpus information set to obtain the relationship extraction model, wherein the relationship extraction model is used for extracting the relationship between the entities, and the entities and the relationship between the entities are used for carrying out structural representation on the text.
On the basis of the foregoing embodiment, the entity extraction model in this embodiment of the present application includes at least two entity extraction submodels, where the at least two entity extraction submodels include a first entity extraction submodel and a second entity extraction submodel, and the training of the first corpus set to obtain the entity extraction model may further specifically include:
training the first corpus set to obtain the first entity extraction submodel;
taking a third text set as an input of the first entity extraction sub-model, and identifying a target entity set in the third text set through the first entity extraction sub-model;
and training the target entity set to obtain the second entity extraction submodel.
In the embodiment of the application, an entity dictionary does not need to be prepared in advance, a certain amount of linguistic data (such as a first corpus set) is only required to be labeled to train a first entity extraction submodel, then a target entity set in a third text set is identified through the first entity extraction submodel, the target entity set can be used as a new labeled linguistic data, then the target entity set is trained to obtain a second entity extraction submodel, the second entity extraction submodel can cover more entities, an entity dictionary is generated accordingly, the entity dictionary can contain more and more entities through the identification of a plurality of entity extraction submodels, for example, entity vocabularies extracted in all patents are collected together to form an entity dictionary, and the entity dictionary can comprise 2 columns of entities and the frequency. The frequency is the number of patents that contain the component. For example, the mounting base, 3; a housing frame, 4. In the embodiment of the application, a certain amount of entity corpora are marked, the entity extraction submodel is continuously trained, more entities are covered by the entity extraction submodels, and the accuracy of recognizing the entities in the text is greatly improved.
Similarly, the relationship extraction model in the embodiment of the present application includes at least two relationship extraction submodels, where the at least two entity extraction submodels include a first relationship extraction submodel and a second relationship extraction submodel, and the second corpus information set is trained to obtain the relationship extraction model, and the relationship extraction model may further specifically include:
training the second corpus set to obtain the first relation extraction submodel;
taking a fourth text set as the input of the first relation extraction submodel, and identifying a target relation set in the fourth text set through the first relation extraction submodel;
and training the target relation set to obtain the second entity extraction submodel.
In the embodiment of the application, an entity relationship dictionary does not need to be prepared in advance, only a certain amount of relationship linguistic data (such as a second corpus set) need to be labeled to train a first relationship extraction submodel, then a target relationship set in a fourth text set is identified through the first relationship extraction submodel, the target relationship set can be used as a new labeled relationship linguistic data, then the target relationship set is trained to obtain a second relationship extraction submodel, the second relationship extraction submodel can cover more relationships, a relationship dictionary is generated through the relationship extraction submodel, the relationship dictionary can contain more and more relationships through the identification of a plurality of relationship extraction submodels, for example, relationship vocabularies extracted from all patents are collected together to form a relationship dictionary, and the relationship dictionary can comprise 2 columns of relationship + frequency. Frequency is the number of patents that contain this relationship. For example, including, 10; is provided with 20. In the embodiment of the application, a certain amount of relation corpora are marked, the submodel is extracted through continuous iterative training of the relation, the submodel is extracted through a plurality of relations to cover more relations, and the relation accuracy in the text is greatly improved.
Then carrying out text structured representation;
as shown in fig. 2, an embodiment of the present application provides a text structuring method, which includes the following steps:
step 201, obtaining a target text to be structured.
A target text to be structured is obtained, which may be a patent, for example.
Step 202, inputting the target text into an entity extraction model, and identifying a target entity set in the target text through the entity extraction model. The entity extraction model is obtained by training the first corpus set, and the first corpus set is obtained by performing entity corpus labeling on each text in the first text set.
Firstly, the target text is input into an entity extraction model, and a target entity set in the target text is identified through the entity extraction model. For example, the target text includes the following: "a car high-order stop lamp, its characterized in that: the target text extraction model comprises a rectangular installation base plate (1), wherein a shell frame (2) matched with the installation base plate is arranged on the installation base plate (1), a plurality of partition plates (3) are arranged in the shell frame (2), and a target entity set in the target text output by the entity extraction model is the installation base plate, the shell frame and the partition plates.
Step 203, inputting the target text identified to the target entity set into a relationship extraction model, and extracting the relationship between the target entities through the relationship extraction model.
Inputting the target text recognized to the target entity into a relationship extraction model, and outputting the relationship between the target entities by the relationship extraction model, for example, the relationship between the entities is: the mounting seat plate is provided with a shell frame; the shell frame is provided with a clapboard.
And 204, performing structured representation on the target text according to the relation between the entity and the target entity to generate a target structure.
Referring to fig. 3 for understanding, fig. 3 is a schematic diagram of a target structure. The generative target structure comprises nodes and edges, the nodes representing the entities, the entities comprising components, attributes, or attribute values; the edges represent relationships between entities, including relationships between the components, relationships between the components and the attributes, or relationships between the attributes and the attribute values.
For example, the entities and their relationships extracted from a patent are as follows:
the brake lamp comprises a mounting base plate
The brake lamp comprises a grating plate
The brake lamp comprises an LED lamp
The mounting seat board is provided with a shell frame
The outer casing frame is provided with a partition board
The outer casing frame is provided with an installation cavity
The result of entity extraction in the target text is fused with the result of entity relationship extraction, so as to obtain the structure diagram (such as the target structure shown in fig. 3) of the whole target text.
In the embodiment of the application, the entities in the target text are extracted through the trained entity extraction model, the relationships between the entities are extracted through the trained relationship extraction model, the structured text representation is automatically generated according to the entities and the relationships between the entities, and the target text or the candidate text is composed of the entities and the relationships between the entities, so that the relationships between the entities in the text content are extracted, the understanding of the text content is facilitated, the conversion speed is high, and the labor cost is saved.
In an application scenario, a user finds a piece of target text (such as a patent), the patent piece is long, or the logic is strong, it takes a lot of time for the user to understand the content of the patent subjectively, the user can convert the patent into a structure diagram through the electronic device (such as a mobile phone), the mobile phone receives the patent, inputs the patent into an entity extraction model, and identifies a target entity set in the patent through the entity extraction model; then, inputting the patents identified to the target entity set into a relationship extraction model, and extracting the relationship between the target entities through the relationship extraction model; and according to the target entity and the relation between the target entities, performing structural representation on the target text to generate a target structure, and displaying the target structure by the terminal. Alternatively, the user may send the patent to a server through a terminal (e.g., a mobile phone), the server converts the patent into a target structure, and then the server sends the target structure to the terminal, and the terminal displays the target structure. According to the method and the device, the target text is converted into the target structure, so that the user can understand the content in the target text more conveniently, and the labor cost is greatly saved.
On the basis of the above embodiment, the present application also provides another embodiment, in which the relationship between the entities is extracted by the relationship extraction model, there may be a case where a target entity may appear in two sentences, so that the relationship extraction model may not be able to identify. For example, in one example, the text to be recognized is "battery pack connection monitoring module; the CPU processor and the display are also connected. "battery pack connection detection module", that is, the relationship between the battery pack and the detection module, may be recognized by the above-described relationship extraction model, and there may be a case where the CPU processor and the display cannot be recognized in another sentence.
In view of the above situation, solving the problem that the relationship between entities exists in different sentences and the relationship extraction model may exist in a situation that cannot be identified, the present application provides another embodiment:
the target text includes a first entity, and after step 203, before step 204, the following steps may be further included:
acquiring an entity relationship data set, wherein the entity relationship data set is obtained by extracting entities in a text set and the relationship between the entities; the entity relationship matrix comprises N entities and the relationship among the N entities, wherein N is greater than or equal to 2;
and querying the entity relationship data set to obtain M second entities having relationships with the first entity, wherein M is less than or equal to N.
Searching the second entity in a preset range in the target text;
and if at least one target second entity in the M second entities is found, establishing a relationship between the first entity and the target second entity.
Specifically, firstly, an entity relationship data set is obtained, wherein the entity relationship data set is obtained by extracting entities in a text set and relationships among the entities; the entity relationship matrix comprises N entities and the relationship among the N entities, wherein N is greater than or equal to 2.
The specific method for acquiring the entity relationship data set comprises the following steps:
inputting the text set into an entity extraction model, and identifying entity information in the text set through the entity extraction model; the collection of text may be understood to include a collection of multiple texts, for example, a collection of hundreds of thousands of patents. It should be noted that the number of texts included in the text collection is for illustration and is not a limitation to the embodiment of the present application.
And inputting the target text set with the identified entity information into a relation extraction model, and extracting the relation between the entity and the entity in each text in the text set through the relation extraction model. The entity relationship data set includes relationships between entities and entities in each text in the text collection.
The entity relationship data set is shown as matrix a below:
brake lamp Base seat …… LED lamp …… Lamp shell
Brake lamp 0 Is provided with 0 0
Base seat 0 0 Included Connection of
……
LED lamp 0 0 Connection of
……
Lamp shell 0 Connection of Connection of 0
And then, inquiring in the entity relationship data set to obtain M second entities having relationships with the first entity, wherein M is less than or equal to N.
For example, in the target text, the first entity is "base", and "base" has no relationship with other components, then it is likely that one situation is that "base" and the component with which it has a relationship are in different statements, then it is necessary to determine which first entities have relationships with which entities in the entity relationship dataset, and in the target text, the first entities may also have relationships with which entities.
For example, the first entity is a "base". Searching for a second entity related to the "base" in the matrix a may specifically be:
locating a row of "base" in the matrix A, and obtaining all the component sets S _ a associated with the "base", the S _ a including the components: LED lamp, lamp body. Locating a column of "base" in matrix A, obtaining a set S _ b of all components associated with "base", S _ b comprising the following components: stop lamp, lamp body.
The set S is S _ a + S _ b, and the set S includes (S _0, S _1, S _2 … S _ k … S _ n);
in the above example, the set S includes (LED lamp, lamp housing, brake lamp).
Further, searching the second entity within a preset range in the target text;
the preset range may be determined by the size of an entity matching window, and the preset range in the target text is determined according to the size of the entity matching window. The size of the physical matching window may be predetermined.
Starting from the position where the component appears, the target second entity is found within g positions forward and g positions backward. For example, the entity matching window looks for the second entity starting from the "base" position within 10 characters forward and 10 characters backward.
And finally, if at least one target second entity in the M second entities is found, establishing the relationship between the first entity and the target second entity.
For example, if 3 second entities are found, the 3 second entities are: the LED lamp and the stop lamp are used as target second entities, and then a relation between the base and the target second entities is established, wherein the type of the relation is 'related'.
In this embodiment, an entity relationship data set is obtained, and the entity relationship data set is queried to obtain M second entities having a relationship with the first entity, where M is less than or equal to N; then searching the second entity in a preset range in the target text; if at least one target second entity in the M second entities is found, establishing a relationship between the first entity and the target second entity so as to solve the problem that a relationship extraction model may not be capable of identifying the second entities related to the first entity in different sentences.
Optionally, the target structure in this embodiment of the application may be a text structure, and may also be an image structure, and the specific manner of generating the image structure includes:
firstly, acquiring target image information for representing the entity;
specifically, the image set can be obtained from internet data (such as various related forums, patent databases, paper databases) and local databases;
identifying text in each image in the set of images; and if the target entity is matched with the characters in the image set, selecting image information for representing the target entity from the image set. For example, the text in each image in the set of images is identified, and if the text (e.g., engine) in the first image matches the text (e.g., engine) of the first target entity, the text (e.g., link) in the second image matches the text (e.g., link) of the second target entity, and the text (e.g., lower pressing mechanism) in the third image matches the text (e.g., lower pressing mechanism) of the second target entity, then the first image, the second image, and the third image are selected as the image information representing the first target entity and the second target entity.
Then, a target structure represented by image information is generated according to the target entity and the relationship between the target entities.
Referring to fig. 4, fig. 4 is a schematic diagram of an image structure. For example, the relationship between "engine", "connecting rod", and "hold-down mechanism" is: the "engine" is connected to the "connecting rod" and the "engine" is connected to the "pressing mechanism", and the image structure shown in fig. 4 is generated according to the "engine", "connecting rod", and "pressing mechanism" and the connection relationship therebetween. In this example, image information used for representing the target entity is acquired, an image structure is generated according to the target entity and the relationship between the target entities, the image structure is displayed, the entities in the text and the relationship between the entities are more vividly embodied, and the text content is easier to understand for the user.
The method for training the entity extraction model and the relationship extraction model is described in detail above, and the entity extraction model and the relationship extraction model are applied to structurally represent the text.
It should be noted that the executing agent for executing the steps 101 to 105 and the executing agent for executing the steps 201 to 204 may be the same electronic device or different electronic devices; steps 101-105 before step 201, after the entity extraction model and the relationship extraction model are trained, step 101-105 may not be executed, and step 201 is directly executed.
Example 2
Referring to fig. 5, an embodiment of the present application further provides a method for determining text similarity, where the method in this example is applied to an electronic device, where the electronic device may be a server or a terminal, and the method may include the following steps:
301, acquiring a target text and a candidate data set, wherein the candidate data set comprises a plurality of arrays, and each array in the plurality of arrays represents a semantic vector of an entity; the entity is included in the candidate text.
The server may receive a target text sent by the terminal, for example, the target text may be a patent.
The specific method for acquiring the candidate data set by the server comprises at least the following two modes:
in a first possible implementation:
first, a text set is obtained, where the text set includes n candidate texts, where n is an integer greater than or equal to 2, it is understood that the text set may be all patents in one technical field in a patent library, or the text set may be a subset of all patents in one technical field in a patent library, for example, n may be one hundred thousand or million.
Then, extracting entities in each candidate text of the n candidate texts to obtain m entities, where it should be noted that, in this step, a specific method for extracting entities in each candidate text of the n candidate texts may be extracting according to the entity extraction model described in embodiment 1, inputting each candidate text into the entity extraction model, and outputting entities in each candidate text through the entity extraction model to obtain m entities, where m is an integer greater than or equal to 2, and for example, m may be ten million, twenty million, and so on.
Determining a target matrix according to the n candidate texts and the entities contained in each candidate text, for example, the target matrix B is as follows:
entity 1 …… Entity j …… Entity m
Patent
1 1 0 0
Patent 2 0 3 4
……
Patent i 0 1 1
…… 0 0
Patent n 6 1 1
In matrix B, n rows and m columns are included, each of the n rows representing a candidate text, and each of the m columns representing an entity. Where B [ i ] [ j ] ═ the number of times entity j appears in patent i. For example, entity j appears 3 times in patent 2, entity m appears 1 time in patent i, and so on.
And finally, carrying out singular value decomposition on the target matrix B to obtain a candidate data set.
Specifically, singular value decomposition is performed on the target matrix B as follows:
B=UΣVT
the matrix U is obtained as a matrix of n rows and k columns, each row representing a vector of text (e.g. patent).
The matrix sigma, is a matrix of eigenvalues of the matrix B, k rows and k columns, where k is a specified number, e.g., k may be 300.
The matrix V, k rows and m columns, where each column represents a vector of one entity, in this example the candidate data set is the matrix V, which may also be referred to as a "candidate matrix".
An example of this matrix V is as follows:
Figure BDA0001864367250000141
Figure BDA0001864367250000151
each column in the matrix V is used to represent a k-dimensional vector of a component, where each value V [ i ] [ j ] represents the projection value of entity j in the ith dimension.
In this example, the target matrix B and the matrix V are shown for convenience of description, and are not meant to be restrictive in the present application.
In a second possible implementation:
the candidate data set may be obtained by a trained Word2vec model, where the candidate data set includes vectors of multiple entities, the Word2vec model is obtained by training according to an entity corpus set, and the entity corpus set may be obtained by the method described in step 101 in embodiment 1, or the entity corpus set may also be obtained by performing entity extraction on each text in a text set by using an entity extraction model, and each Word in the entity corpus set is numbered from 1 to W in sequence, where W is an integer greater than 1. Inputting the entity corpus set into a Word2vec model, setting the maximum distance between the current Word and the predicted Word in one sentence to be l, for example, the l may be 5, 10, etc., and in this example, the l may be illustrated by taking 5 as an example. Please refer to fig. 6 for understanding, fig. 6 is a schematic diagram of the Word2vec model training process.
The Word2vec model includes an input layer, an intermediate layer, and an output layer.
The input layer has d nodes, corresponding to d entities.
And in the middle layer, 300 nodes are provided, and each input layer node has edges which are all connected with the 300 nodes.
And the output layer has d nodes in total and corresponds to d entities.
And traversing each entity t in the entity corpus set, acquiring the sequence number i of the t, wherein the input layer [ i ] is equal to 1, and the nodes of the other input layers are equal to 0.
The other words within the distance 5 of t are obtained, and the numbers a1, a2, a3, a4 and a5 of the other words are obtained, the position of the write output layer a1 is 1, the position of a2 is 1, the position of a3 is 1, the position of a4 is 1, the position of a5 is 1, and the rest positions are 0.
And calling a gradient descent algorithm to calculate the weight of each edge.
After model training is completed, the weight list of any input layer node i to 300 edges of the middle layer node is the vector representing the ith entity. The vectors of the i entities constitute the candidate data set.
The candidate data set in this example comprises a vector of a plurality of entities. And inputting the entity extracted from each candidate text into the Word2vec model, outputting the vector of each entity through the Word2vec model, and forming the candidate data set by the vectors of all the obtained entities.
Step 302, extracting a target entity set in the target text, wherein the entity set represented by the plurality of arrays of the candidate data set comprises the target entity set.
In the embodiment of the present application, the candidate data set obtained in the first implementation manner may be taken as an example for explanation. Referring to the example of the matrix V, each column represents an array, and each data includes a plurality of elements, each element representing a projection value of an entity on a dimension.
A target entity set in a target text is extracted through the entity extraction model described in the above embodiment 1, where the target entity set includes all target entities in the target text, for example, the target text includes 3 target entities, and the 3 target entities are entity 1 (e.g., a seat board) and entity j (e.g., an LED lamp), respectively. The set of entities represented by the plurality of arrays of the candidate dataset contains the target set of entities, e.g., the set of entities represented by vectors in matrix V (seat panel, …, LED light, …, connectors) contains entity 1 and entity j in the target text. It should be noted that, in this example, the entities and the quantities included in the target text and the entities and the quantities included in the candidate data set are all examples for convenience of description, and do not limit the present application.
Step 303, determining an included angle value between each target entity in the target entity set and each entity vector in each candidate text according to the candidate data set, so as to obtain entity similarity.
And calculating the included angle value of the vector of each target entity and each entity in each candidate text according to the entity vectors in the candidate data set. For example, the target entities in the target text are: entity 1 and entity j. The entities in a candidate text c are: and the entity 2 and the entity x respectively calculate the similarity between the entity 1 and the entity 2, the similarity between the entity 1 and the entity x, the similarity between the entity j and the entity 2 and the similarity between the entity j and the entity x aiming at the candidate text c.
The description will be given by taking the calculation of the similarity between the entity 1 and the entity j as an example:
in a first possible implementation:
the entity similarity (Rela) is the cosine value of the angle between two entity vectors.
For example, relax (entity 1, entity 2) ═ the cosine of the angle between the entity 1 vector (V1) and the entity 2 vector (V2).
In a second possible implementation: determining a target distance between an end point of the vector of each target entity and an end point of the vector of each entity in each candidate text;
and determining the entity similarity according to the candidate data set by determining the cosine value (represented by 'Distance 1') of the included angle between the semantic vector of each target entity in the target entity set and the semantic vector of each entity in each candidate text and the target Distance (represented by 'Distance 2').
Distance1 is the cosine of the angle between V1 and V2.
Figure BDA0001864367250000161
Wherein, Distance1 is the cosine value of the included angle between V1 and V2.
The similarity Rela (entity 1, entity 2) between entity 1 and entity 2 is Distance1 weight1+ Distance2 weight 2.
Wherein Weight1 represents the Weight of Distance1, and Weight2 represents the Weight of Distance 2. Weight1 and Weight2 can have default values of 0.5, or can be specified by the user according to the actual usage scenario, for example, Weight1 is 0.6 and Weight2 is 0.4.
In this example, the similarity between any two entities is obtained according to the cosine value of the included angle between the two vectors and the target distance of the end point of the two vectors, the included angle between the two vectors is considered, the end point positions of the two vectors are considered, and the user can determine the weight of the cosine value of the included angle and the target distance according to the actual application scene, so that the accuracy of calculating the similarity between the entities is improved.
And step 304, determining the target similarity of the target text and each candidate text according to the entity similarity.
In a first implementation manner, for each candidate text, accumulating the entity similarity of each target entity in the target text to obtain a first accumulated similarity;
and determining the target similarity of the target text and each candidate text according to the first accumulated similarity.
For example, in the above example, entity 1 and entity j, where the entities in one candidate text c are: and an entity 2 and an entity x, respectively calculating the similarity (marked as ' Re 1 ') between the entity 1 and the entity 2, the similarity (marked as ' Re 2 ') between the entity 1 and the entity x, the similarity (marked as ' Re 3 ') between the entity j and the entity 2, and the similarity (marked as ' Re 4 ') between the entity j and the entity x for the candidate text c, and then accumulating the calculated similarities (Re 1 ', ' Re 2 ', ' Re 3 ' and ' Re 4 ') with each entity to obtain a first accumulated similarity, wherein in the calculation process, the scores of similarity degrees smaller than 50% (not contained) can be all 0. In one implementation, the first accumulated similarity may be used as the similarity between the target text and the candidate text.
Optionally, for each candidate text, the similarity sim1 between each entity in the target text and the candidate text is calculated.
Sim1 ═ first cumulative similarity/(target text entity total U candidate text entity total), the Sim1 can be the target similarity of the target text and the candidate text.
In this embodiment, the electronic device obtains a target text and a candidate data set, where the candidate data set includes a plurality of arrays, and each array in the plurality of arrays represents a semantic vector of an entity; the entity is contained in a candidate text; further, extracting a target entity set in the target text, wherein the entity set represented by the plurality of arrays of the candidate data set comprises the target entity set; determining the cosine value of the included angle between the semantic vector of each target entity in the target entity set and the semantic vector of each candidate entity in each candidate text according to the candidate data set to obtain entity similarity; in this embodiment, the similarity between each target entity and each candidate entity in the candidate text may be calculated, and the target similarity between the target text and each candidate text may be determined according to the entity similarity. In the embodiment, the similarity between the target text and the candidate text is determined by considering the similarity between each entity in the target text and the candidate text, and the determination of the similarity can really show the similarity between the content of the target text and the content of the candidate text.
On the basis of the above example, before step 304, the method further comprises the steps of:
extracting the relation between target entities in the target text;
acquiring a candidate relation set in each candidate text; determining the relationship similarity between each relationship in the target relationship set and each candidate relationship in the candidate relationship set according to the entity similarity;
in step 304, the target similarity between the target text and each candidate text is determined according to the entity similarity and the relationship similarity.
The relationship in the embodiment of the present application includes a binary relationship, or a binary relationship to X meta relationship, where X is an integer greater than or equal to 3, and the binary relationship includes two entities and a relationship between the two entities. The X meta-relations comprise X entities and at least (X-1) binary relations, each binary relation in the (X-1) binary relations comprises an associated entity, and the at least (X-1) binary relations connect the (X-1) binary relations through the associated entities.
For example, when X equals 3, then the relationship includes a binary relationship and a ternary relationship; when X is equal to 4, the relationship includes a binary relationship, a ternary relationship, and a quaternary relationship.
The following illustrates binary and ternary relationships:
the binary relation: including two entities and the relationship between them, i.e. entity 1+ entity 2+ the relationship of entity 1 and entity 2, for example: the stop lamp (entity 1) comprises (relation to) a base (entity 2).
The ternary relationship: the method comprises two binary relations, for example, a binary relation 1 and a binary relation 2, wherein the two binary relations have the same entity, and the same entity is an associated entity and is used for connecting the two binary relations. The three-element relationship is (brake lamp-mounting seat plate, mounting seat plate-housing frame). Wherein the mounting seat plate is an associated entity.
The following further describes the method for determining the similarity of the binary relation and the similarity of the ternary relation:
optionally, the binary relationship between every two target entities in the target text is extracted, so as to obtain a target binary relationship set of the target text. For example, the set of target binary relationships is: (brake lamp-mounting base plate, mounting base plate-housing frame, housing frame-partition plate, housing frame-mounting cavity, brake lamp-grating plate, brake lamp-LED lamp).
And acquiring a candidate binary relation set in each candidate text. For example, the set of candidate binary relationships is: (stop lamp-base, base-housing frame, housing frame-dustproof coating film, stop lamp-grating plate, stop lamp-LED lamp, LED lamp-lamp housing).
And determining the binary relation similarity of each binary relation in the target binary relation set and each candidate binary relation in the candidate binary relation set according to the entity similarity. The binary relation similarity is: a similarity of a first target entity in the target binary relationship to a first candidate entity of the candidate binary relationship, a similarity of a second target entity in the target binary relationship to a second candidate entity of the candidate binary relationship, a sum of the similarities of the relationships in the target binary relationship and the relationships in the candidate binary relationship. The formula is: binary relation similarity Rela2 (target binary relation, candidate binary relation) ═ Rela1 (target entity 1, candidate entity 1) + Rela1 (target entity 2, candidate entity 2) + R (target relation, candidate relation); if relationship 1 equals relationship 2, then R (relationship 1, relationship 2) is 1; if relationship 1 is not equal to relationship 2, R (relationship 1, relationship 2) is 0. For example, the target binary relationship is: brake light-mounting base plate, the candidate binary relation is: the brake light-base binary relation similarity Rela2 (brake light-mounting seat plate, brake light-base) is Rela1 (brake light ) + Rela1 (mounting seat plate, base) + R (connection ).
Further, accumulating the binary relation similarity of each binary relation in the target text to obtain a second accumulated similarity; the second accumulated similarity is: and traversing each binary relation in the target text in the candidate text, calculating the similarity Rela2 of each binary relation, recording the score of the entity similarity degree of less than 50% (not including) as 0, and adding all the similarities.
Further, for each candidate text, the similarity Sim2 of each binary relation in the target text and the candidate structure is calculated. Specifically, a union of the total number of binary relations in the target text and the total number of binary relations in the candidate text is calculated, for example, if the total number of binary relations in the target text is 12 and the total number of binary relations in the candidate text is 14, the union is 14, Sim3 is a ratio of the second cumulative similarity to the union, which is shown as follows:
sim2 is the second cumulative similarity/(total number of binary relations in target text U total number of binary relations in candidate text).
Further, on the basis of the above embodiment, the method may further include the following steps:
and determining a target ternary relationship set according to the target binary relationship set, wherein the target ternary relationship set comprises a plurality of ternary relationships, the ternary relationship comprises two binary relationships, and the two binary relationships have the same entity. For example, the set of target ternary relationships is: (brake lamp-mounting base plate, mounting base plate-housing frame), (mounting base plate-housing frame, housing frame-partition), (mounting base plate-housing frame, housing frame-mounting cavity).
And acquiring a candidate ternary relation set in each candidate text. For example, the set of candidate tri-relationships is: (brake lamp-base, base-housing frame), (base-housing frame, housing frame-dustproof coating).
And determining the ternary relationship similarity between each ternary relationship in the target ternary relationship set and each candidate ternary relationship in the candidate ternary relationship set according to the binary relationship similarity. The similarity of the ternary relationship is: the binary relationship similarity between a first target binary relationship in the target ternary relationship and a first candidate binary relationship in the candidate ternary relationship and the binary relationship similarity between a second target binary relationship in the target ternary relationship and a second candidate binary relationship in the candidate ternary relationship may be expressed as follows:
rela3 (target ternary relationship, candidate ternary relationship) ═ Rela2 (first target binary relationship, first candidate binary relationship) + Rela2 (second target binary relationship, second candidate binary relationship).
For example, the target three-dimensional relationship is: (brake lamp-mounting base plate, mounting base plate-housing frame);
the candidate three-dimensional relationship is as follows: (stop lamp-base, base-housing frame);
rela3[ (brake lamp-mounting seat plate, mounting seat plate-housing frame), (brake lamp-base, base-housing frame) ]
Rela2 (brake light-mounting plate, brake light-base) + Rela2 (mounting plate-housing frame, base-housing frame)
The number of the parts is equal to Rela1 (brake light ) + Rela1 (mounting seat plate, base) + R (connection ) + Rela1 (mounting seat plate, base) + Rela1 (housing frame ) + R (connection, connection).
Accumulating the ternary relationship similarity of each ternary relationship in the target text to obtain a third accumulated similarity; and traversing each three-element relation with the third accumulated similarity as the target text in the candidate text, calculating the similarity Rela3 of each candidate three-element relation, wherein the scores with the entity similarity degree of less than 50% (not including) are all 0, and adding all the similarities.
And calculating the similarity Sim3 of each ternary relation in the target text and the candidate text. Specifically, a union of the total number of the ternary relations in the target text and the total number of the ternary relations in the candidate text is calculated, for example, if the total number of the ternary relations in the target text is 10 and the total number of the ternary relations in the candidate text is 8, the union is 10, Sim3 is a ratio of the third accumulated similarity to the union, which is shown as follows:
sim3 is the third cumulative similarity/(total number of triplets in target text U total number of triplets in candidate text).
Further, the target entity comprises a specific entity, and the method further comprises:
determining entity similarity of the particular entity; the specific entity may be a user-specified entity, and the number of specific entities is not limited. For example, the specific entity is a "stop lamp", or the specific entity may be a "stop lamp" and an "installation seat plate", and the specific entity may be an entity which is important in a practical technical solution, and in this example, the specific entity may be described by taking a "stop lamp" as an example. For example, the candidate entities included in the candidate text are "stop light", "base", and "lamp housing", and for the candidate text, the entity similarity of a specific entity includes: the similarity between the brake lamp and the brake lamp (denoted as R11), the similarity between the brake lamp and the base (denoted as R12), and the similarity between the brake lamp and the lamp housing (denoted as R13).
Accumulating the entity similarity of the specific entity aiming at each candidate text to obtain a fourth accumulated similarity; the fourth accumulated similarity is: r11+ R12+ R13.
In step 304, the similarity SIM between the target text and the candidate text is calculated according to the first accumulated similarity, the second accumulated similarity, the third accumulated similarity, the fourth accumulated similarity and their corresponding weights.
Equation 1: SIM (SIM) SIM1 weight1+ SIM2 weight2+ SIM3 weight3+ SIM4 weight4, where weight1 is the weight of entity similarity, weight2 is the weight of binary relational similarity, weight3 is the weight of ternary relational similarity, and weight4 is the weight of a particular entity similarity.
Weight1, weight2, weight3, and weight4 may be set according to the context of a particular application, for example, if the user considers the similarity of a specific entity and the similarity of a binary relationship to be more important, weight2 and weight4 may be set to higher values, for example, weight4 is 0.4, weight2 is 0.3, weight1 is 0.2, and weight3 is 0.1. In general, weight1, weight2, weight3, and weight4 may be set to 0.25.
As can be seen from equation 1, in a first possible implementation manner, the similarity between the target text and the candidate text may be determined according to the first accumulated similarity and the second accumulated similarity, that is, the condition that weight3 is 0 and weight4 is 0.
In a second possible implementation manner, the similarity between the target text and the candidate text may be determined according to the first accumulated similarity and the third accumulated similarity, that is, the condition that weight2 is 0 and weight4 is 0.
In a third possible implementation manner, the similarity between the target text and the candidate text may be determined according to the first accumulated similarity and the fourth accumulated similarity, that is, the condition that weight2 is 0 and weight3 is 0.
In a fourth possible implementation manner, the similarity between the target text and the candidate text, that is, the condition that weight4 is 0, may be determined according to the first accumulated similarity, the second accumulated similarity, and the third accumulated similarity.
In a fifth possible implementation manner, the similarity between the target text and the candidate text may be determined according to the first accumulated similarity, the second accumulated similarity, and the fourth accumulated similarity, that is, the condition that weight3 is 0.
Further, in this embodiment of the application, the similarity between the target text and each candidate text in the candidate text set may be sorted according to the size of the SIM, the target text and each candidate text in the candidate text set may be sorted in a descending order or a descending order of the similarity, and a preset number of candidate texts may be displayed in a similarity order, for example, 3 candidate texts may be displayed in the order.
In this embodiment, the similarity between the target text and the candidate text is determined by calculating the similarity between each target entity in the target text and the candidate entity in the candidate text, the relationship between the target entities in the target text, and the relationship between the candidate entities in the candidate text, and the relationship between the entities and the entities thereof can reflect the actual expression of the content in the text in consideration of both the similarity of the entities and the similarity of the relationship. Further, the relationship may include a binary relationship to an N-gram relationship, for example, the relationship may include a binary relationship and a ternary relationship, the binary relationship includes two entities and a relationship between the two entities, the ternary relationship includes two binary relationships, and the two binary relationships may be connected by an associated entity. In the embodiment of the application, the ternary relationship relates to three entities and the relationship between the three entities, so that the similarity between the target binary relationship and the candidate binary relationship and the similarity between the target ternary relationship and the candidate relationship can reflect the actual expression of the content in the text. Furthermore, the similarity of a specific target entity can be determined, the similarity of the target text and the candidate text can be determined according to the specific application scene of the user, and the actual demand of the user is enhanced.
Optionally, determining the novelty of the target text and each candidate text according to the target similarity, wherein the novelty is inversely related to the target similarity. The higher the similarity of the target text to the candidate text, the lower the novelty of the target text relative to the candidate text. For example, if the target similarity is 70%, the novelty may be 1-70% — 30%, or the novelty may be 1-k × 70%, where k is a correction coefficient, and in this embodiment, the specific method for determining the novelty is not limited, and the novelty is inversely related to the target similarity.
Optionally, on the basis of the foregoing embodiment, the target text in this embodiment may be a target structure, and the candidate text may be a candidate structure, that is, the target text is converted into the target structure through the entity extraction model and the relationship extraction model by the method described in embodiment 1, and the candidate text is converted into the candidate structure through the entity extraction model and the relationship extraction model.
Specifically, the target text is a structured text, and in step 201, the step of obtaining the target text may further include the following steps: acquiring a target text;
inputting the target text into an entity extraction model, and identifying an entity in the target text through the entity extraction model;
inputting the target text of the identified entities into a relation extraction model, and extracting the relation between the entities through the relation extraction model;
and according to the entity and the relation between the entities, performing structured representation on the target text to generate a structured text.
Optionally, in the step 202, the extracting the target entity set in the target text may specifically include the following steps:
and taking the target text as the input of an entity extraction model, and extracting a target entity set in the target text through the entity extraction model, wherein the entity extraction model is obtained by training the first corpus set, and the first corpus set is obtained by performing entity corpus labeling on each text in the first text set.
Optionally, the step of extracting the binary relationship between each two entities in the target text may further specifically include the following steps:
inputting the target texts of the identified target entity set into a relationship extraction model, and extracting the relationship between the target entities through the relationship extraction model; the relation extraction model is used for training the second corpus information set, and the second corpus set is obtained by carrying out relation corpus labeling and entity labeling on each text of the second text set.
Example 3
Referring to fig. 7, an embodiment of the present application further provides a method for determining text novelty, where the method is applied to an electronic device, where the electronic device may be a server or a terminal, and in this embodiment, the electronic device may be described by taking a terminal as an example, and the method specifically includes the following steps:
step 401, determining a target text.
For example, the target text may be a patent, a paper, and in this embodiment, the target text is described by taking a patent as an example.
Step 402, extracting a plurality of target entities in the target text to obtain a target entity set.
In this example, a plurality of target entities in the target text are extracted through the entity extraction model in embodiment 1, specifically, the target text is input into the entity extraction model, and the plurality of target entities in the target text are identified through the entity extraction model, and the plurality of target entities form the target entity set.
And step 403, acquiring a candidate entity set of each candidate text in the candidate text set.
The candidate text set can be a patent set, the candidate text set comprises a plurality of candidate texts (such as patents), the server acquires the candidate text set from a patent database, and extracts candidate entities of each candidate text in the candidate text set in an off-line manner in advance to obtain a candidate entity set. Or, the server may also extract the candidate entity of each candidate text in the candidate text set on line to obtain the candidate entity set, and specifically, may extract the candidate entity of each candidate text in the candidate text set by using the entity extraction model described in embodiment 1 to obtain the candidate entity set.
Step 404, determining a first entity intersection of the target entity set and the candidate entity set, where the first entity intersection is a matching entity in the target entity set and the candidate entity set.
For example, the target entity set is (brake light, base, lamp housing), and the candidate entity set is (brake light, mounting base plate, housing frame). The first entity intersection is (brake light).
Step 405, determining the novelty of the target text and the candidate text according to the difference parameter of the first entity intersection and the target entity set.
The determining a first entity novelty based on the difference parameters of the first set of entities and the target set of entities. Namely:
first entity novelty ═ target entity set-intersection (target entity set, candidate entity set) ]/target entity set ═ 1-first entity intersection/target entity set.
In this embodiment, the difference parameter between the first entity intersection and the target entity set is a ratio of the first entity intersection to the target entity set, or the difference parameter between the first entity intersection and the target entity set may also be a ratio of the first entity intersection to the target entity set multiplied by a coefficient, and the difference parameter has other variations, which are not described herein.
In this embodiment, a target text requiring novelty determination is determined first, where the target text may be a patent; further extracting a plurality of target entities in the target text to obtain a target entity set; acquiring a candidate entity set of each candidate text in the candidate text set; traversing each candidate text, and determining a first entity intersection of the target entity set and a candidate entity set of each candidate text, wherein the first entity intersection is a matched entity in the target entity set and the candidate entity set; finally, determining the novelty of the target text and the candidate text according to the difference parameters of the first entity intersection and the target entity set. In the embodiment, all target entities in a target text and all candidate entities in each candidate text are considered, the novelty of the target text and the candidate text is determined according to the difference parameters between the intersection of the first entities and the target entity set, compared with the prior art, the novelty is determined only through keywords determined subjectively by a user and through keyword matching, and the determination method of the novelty needs to be influenced by subjective understanding of the user.
Optionally, on the basis of the above embodiment, before step 405, the embodiment of the present application may further include the following steps:
extracting a plurality of binary relations in the target text to obtain a target binary relation set, wherein the binary relations comprise two entities and relations between the two entities;
acquiring a candidate binary relation set comprising a plurality of binary relations in the candidate text;
determining a first binary relation intersection of the target binary relation set and the candidate binary relation set, wherein the first binary relation intersection comprises matched binary relations in the target binary relation set and the candidate binary relation set;
then, in step 405, determining the novelty degree of the target text and the candidate text according to the difference parameter between the first entity set and the target entity set may specifically include:
determining a first entity novelty according to the difference parameters of the first entity set and the target entity set; namely: first entity novelty (R1_1) — [ target entity set-intersection (target entity set, candidate entity set) ]/target entity set-1 — first entity intersection/target entity set. Determining a first binary relation novelty according to the difference parameters of the first binary relation intersection and the target binary relation set;
r2_1 ═ target binary relationship set-intersection (target binary relationship set, candidate binary relationship set ]/target binary relationship set ═ 1-first binary relationship intersection/target binary relationship set.
The difference parameter between the first binary relation intersection and the target binary relation set may be a ratio of the first binary relation intersection to the target binary relation set, or may be other deformations such as multiplying the ratio by a coefficient, and the details are not limited.
Optionally, in another implementation manner, the novelty degrees of the target text and the candidate text may be determined according to the first entity novelty degree and the first binary relation novelty degree and their respective weights. In the implementation mode, the novelty of the target binary relation in the target text and the candidate binary relation in the candidate text is further calculated, and when the novelty of the target text and the candidate text is determined, the novelty between entities is considered, the novelty between the binary relations is further combined, and the accuracy of the novelty is improved.
On the basis of the above embodiment, the method may further include the following steps:
extracting a target ternary relationship set in the target text, wherein the target ternary relationship set comprises a plurality of ternary relationships, the ternary relationship comprises two binary relationships, and the two binary relationships have the same entity;
acquiring a candidate ternary relationship set comprising a plurality of ternary relationships in the candidate text;
determining a first ternary relationship intersection of the target ternary relationship set and the candidate ternary relationship set, wherein the first ternary relationship intersection comprises matched ternary relationships in the target ternary relationship set and the candidate ternary relationship set;
wherein, the determining the novelty of the target text and the candidate text according to the first entity novelty and the first binary relation novelty may further include:
and determining a first ternary relationship novelty according to the difference parameters of the first ternary relationship intersection and the target ternary relationship set. That is, R3_1 is [ target ternary relationship set-intersection (target ternary relationship set, candidate ternary relationship set ]/target binary relationship set is 1-first ternary relationship intersection/target ternary relationship set, the difference parameter of the first ternary relationship intersection and the target binary relationship set may be a ratio of the first ternary relationship intersection and the target ternary relationship set, or may be another variation such as multiplying the ratio by a coefficient, and the like, and is not particularly limited.
Determining the novelty of the target text and the candidate text according to the first entity novelty, the first secondary relationship novelty, the first tertiary relationship novelty and their respective corresponding weights.
The novelty-R1 _1 weight1+ R2_1 weight2+ R3_1 weight3, where the weight1 is the weight of the first entity's novelty in this example; weight2 is the weight of the first dyadic relationship novelty; weight3 is the weight of the first ternary relationship. In the implementation mode, the novelty of the target ternary relationship in the target text and the candidate ternary relationship in the candidate text is further calculated, and when the novelty of the target text and the candidate text is determined, the novelty between entities is considered, and the accuracy of the novelty is improved by further combining the novelty between the binary relationships and the novelty between the ternary relationships.
It should be noted that, in the embodiment of the present application, the relationship may also include a 4-element relationship, a 5-element relationship, and the like, and in the embodiment, only the binary relationship and the ternary relationship are taken as examples for illustration, and are not used to limit the present application.
Optionally, in this embodiment, the target text is a structured text, that is, a target structure, and each candidate text in the candidate text set is a structured candidate structure. In this example, the candidate atlas may be obtained according to the candidate structures, and it is understood that the candidate atlas may include at least one candidate structure, and when the candidate atlas includes one candidate structure, the candidate atlas is the same as the candidate structure. When the candidate atlas includes more than or equal to 2 candidate structures, please understand with reference to fig. 8, fig. 8 is a schematic diagram of the structures of the candidate atlas, and the method for determining the candidate atlas may further include the following steps:
determining an associated entity of the first candidate structure and the second candidate structure; for example, the first candidate structure includes entities: base, lamp body and lamp shade. Relationships between entities include: base-lamp shade base-lamp housing. The second candidate structure includes entities: lamp body, lamp wick and electric door. Relationships between entities include: lamp housing-lamp housing of lamp wick-electric door. The association entity of the first candidate structure and the second candidate structure is "lamp housing".
And associating the first candidate structure with the second candidate structure through the association entity to obtain the candidate map. As will be understood in conjunction with fig. 8, the first candidate structure and the second candidate structure are associated by the association entity.
Based on the foregoing embodiment, optionally, in this embodiment, when the target text and the candidate text are both structured texts, the novelty of the target structure and the candidate graph may be calculated, in this embodiment, the number of candidate structures included in the candidate graph is not limited, for example, the candidate graph may include 3 candidate structures, 4 candidate structures, or all candidate structures in a candidate entity set, each candidate structure has an associated entity, and each candidate structure may be connected by the associated entity. The method in this embodiment may further comprise the steps of:
extracting a candidate entity set of the candidate map; in the candidate graph, each node represents an entity and each edge represents a set of relationships. The relationship is also exemplified by a binary relationship and a ternary relationship, and the binary relationship set is a relationship set of all adjacent two nodes in the candidate map. The ternary relationship set is a relationship set of all three adjacent nodes in the candidate graph.
Determining a second entity intersection of the target entity set and the candidate entity set of the candidate atlas, which may be understood by combining with step 404 in this embodiment;
determining a second entity novelty based on the difference parameter of the second entity intersection and the target entity set; namely: second entity novelty R1_2 ═ target entity set-intersection [ target entity set, candidate entity set) ]/target entity set ═ 1 — second entity intersection/target entity set. This step can be understood in conjunction with step 405 in this embodiment.
Optionally, the method may further include the steps of:
extracting a plurality of binary relations in the target structure to obtain a target binary relation set; for example, one target binary relationship included in the target binary relationship set is "lamp envelope-wick".
Positioning two target entities contained in each target binary relation in the target binary relation set to corresponding two entity positions in the candidate map; and positioning the target binary relation lamp shell-lamp wick into a candidate map, and finding two nodes of the lamp shell and the lamp wick in the candidate map.
Calculating the distance between the two entity positions corresponding to each target binary relation; calculating the distance from the lamp housing to the wick in the candidate map, where it should be noted that, the interval between two adjacent nodes in the candidate map is equal (for example, denoted as a), and the distance between two nodes is calculated, which can be understood as the distance from a first physical location (for example, "lamp housing") to a second physical location (for example, "wick"), taking fig. 8 as an example, the distance from the lamp housing to the wick is a, and the path from the base to the wick is: from the base to the lamp shell, from the lamp shell to the lamp wick, the distance from the base to the lamp shell is a, and from the lamp shell to the lamp wick is a, namely the distance L from the base to the lamp wick is 2 a.
Determining a second binary relation novelty of each target binary relation relative to the candidate atlas according to the distance; a second binary relation has a novelty score R2_2 proportional to L, with L being shorter the lower the novelty, and longer the novelty.
In a first possible implementation, the novelty of the target structure and the candidate atlas may be determined according to a second entity novelty R1_2 and a second binary relationship novelty R2_2 and their respective corresponding weights. In the implementation mode, the novelty of the second entity is determined, the novelty of the target binary relation in the target text and the novelty of the second binary relation in the candidate map are further calculated, and when the novelty of the target structure and the novelty of the candidate structure are determined, the novelty between the entities is considered, the novelty between the binary relations is further combined, and the accuracy of the novelty is improved.
In a second implementation manner, first, a candidate binary relation set including a plurality of binary relations in the candidate map is obtained; determining a second binary relation intersection of the target binary relation set and the candidate binary relation set; determining a first binary relation novelty according to the difference parameters of the second binary relation intersection and the target binary relation set;
then, the novelty of the binary relation can be calculated according to the first and second novelty of the binary relation and their respective corresponding weights. Namely: the binary novelty degree R2 is the first binary novelty degree R2_1 weight1+ R2_2 weight2, where weight1 is the weight of R2_1 in this example; weight2 is the weight of R2_ 2; the weight can be set differently according to different application scenarios.
Then, the novelty of the target structure and the candidate atlas is determined according to the second entity novelty R2_1 and the binary relation novelty R2_2 and their respective corresponding weights. In this implementation, the binary relation novelty is determined by the first and second binary relation novelty and their corresponding weights, which increases the applicable scenarios for determining the binary relation novelty.
On the basis of the above embodiment, optionally, the method may further include the following steps:
extracting a plurality of ternary relations in the target map to obtain a target ternary relation set; for example, the target set of three-dimensional relationships includes a target three-dimensional relationship "lamp housing-lamp wick-electric gate".
Locating three target entities contained in each target ternary relationship in the target ternary set to corresponding three entity positions in the candidate map; and respectively positioning the lamp shell, the lamp wick and the electric gate to the corresponding positions of the lamp shell, the lamp wick and the electric gate in the candidate atlas.
Calculating the shortest distance between any two positions in the three entity positions; and calculating the shortest distance L1 between any two adjacent nodes, namely the shortest distance L1 between the lamp shell and the lamp wick in the candidate map, and the shortest distance L2 between the lamp wick and the electric gate in the candidate map. The sum of the two shortest distances is calculated, and a novelty score R3_2 for a second ternary relationship is proportional to L1+ L2, with shorter L1+ L2 giving lower novelty and longer L1+ L2 giving higher novelty.
In a third possible implementation, the novelty of the target structure and the candidate graph may be determined according to the second entity novelty R1_2, the second binary relational novelty R2_2, and the second ternary relational novelty R3_2, and their respective corresponding weights.
For example, the novelty degree is R1_2 weight1+ R2_2 weight2+ R3_2 weight3, in this implementation, weight1 is the weight of the second entity novelty degree, weight2 is the weight of the second binary novelty degree, and weight3 is the weight of the second ternary relational novelty degree.
Further, in this embodiment of the application, the novelty of each candidate text in the target text and the candidate text set may be sorted according to the magnitude of the novelty, sorted from the greater to the lesser or from the lesser to the greater, and a preset number of candidate texts may be displayed in the order of the novelty, for example, 3 candidate texts may be displayed in the order.
In the implementation mode, the novelty of the target ternary relationship in the target structure and the second ternary relationship in the candidate map is further calculated, and when the novelty of the target structure and the candidate structure is determined, the novelty between entities is considered, and the accuracy of the novelty is improved by further combining the novelty between the binary relationships and the novelty between the ternary relationships.
Further, on the basis of the third implementation manner, a fourth possible implementation manner is also provided, and the method may further include the following steps:
determining a second ternary relationship intersection of a target ternary relationship set of the target structure and the candidate ternary relationship set of a candidate atlas;
determining a first ternary relationship novelty according to the difference parameter of the second ternary relationship intersection and the target ternary relationship set; namely: r3_1 ═ target set of ternary relationships-intersection (target set of ternary relationships, candidate set of ternary relationships) ]/target set of ternary relationships.
In a fourth possible implementation manner, firstly, a ternary relationship novelty is determined according to the first ternary relationship novelty, the second ternary relationship novelty and respective corresponding weights; namely: degree of novelty of ternary relation
R3 ═ R3_1 weight1+ R3_2 weight2, in this implementation, weight1 is the weight of R3_ 1; weight2 is the weight of R3_ 2.
Then, the novelty of the target structure and the candidate atlas is determined according to the second entity novelty R1_2, the binary relation novelty R2 and the ternary relation novelty R3 and their respective corresponding weights. In the implementation mode, the ternary relationship novelty is determined by the first ternary relationship novelty, the second ternary relationship novelty and corresponding weights of the first ternary relationship novelty and the second ternary relationship novelty, and applicable scenes for determining the ternary relationship novelty are increased.
It should be noted that, in the embodiments of the present application, the contents related to each other in embodiment 1, embodiment 2, and embodiment 3 may be cited as each other. As described above, in the step of extracting a plurality of binary relations in the target text, the following steps may be further included:
acquiring an entity relationship data set, wherein the entity relationship data set is obtained according to entities in a text set and the relationship between the entities; the entity relationship matrix comprises N entities and the relationship among the N entities, wherein N is greater than or equal to 2;
querying the entity relationship data set to obtain M second entities having a relationship with the first entity, wherein M is less than or equal to N;
searching the second entity in a preset range in the target text;
and if at least one target second entity in the M second entities is found, establishing a relationship between the first entity and the target second entity.
In the step before searching for the second entity within the preset range in the target text, the method may further include the steps of:
creating an entity matching window;
and determining a preset range in the target text according to the size of the entity matching window.
In the step of extracting a plurality of target entities from the target text, the method may further include the following steps:
inputting the target text into an entity extraction model, and identifying a plurality of target entities in the target text through the entity extraction model.
In the step of extracting the multiple binary relationships in the target text, the method may further include the following steps:
and inputting the target texts of the identified target entities into a relationship extraction model, and extracting the binary relationship between the target entities through the relationship extraction model.
And according to the relation between the target entities, performing structured representation on the target text to generate a target structure. The target structure includes nodes and edges, the nodes are used for representing the target entities, and the edges are used for representing the relations between the target entities.
Example 4
Referring to fig. 9, an embodiment of the present application provides a method for acquiring image information, where the method is applied to an electronic device, where the electronic device may be a server or a terminal, and an execution subject in the embodiment of the present application is not particularly limited, and the method may include the following steps:
501, receiving target text information to be matched; wherein the target text information comprises a target entity.
And if the execution main body is a terminal, the terminal receives target text information to be matched, which is input by a user. If the execution subject is the server, the server receives the target text information to be matched, which is sent by the terminal, for example, the target text information is 'engine'. In one application scenario, the executing entity may be described by taking a server as an example, for example, a user wants to search for image information corresponding to "engine", a terminal receives "engine" input by the user, the terminal sends the target entity to the server, and the server receives the target text information. It should be noted that the number of target entities in the embodiment of the present application is not limited, and the target entity "engine" in the present application is only an exemplary one and does not constitute a limiting description of the present application.
Step 502, matching the target entity with candidate entities associated with each candidate image in the image dataset.
The server matches the target entity with a candidate entity associated with each candidate image in an image data set, where the image data set may be stored inside the server or acquired from another device, which is not limited in particular. The image dataset comprises a large number of candidate images, and each candidate image has an associated candidate entity. For example, candidate image 1 is associated with a "link," candidate image 2 is associated with an "engine," and so on.
Step 503, if the target entity matches the candidate entity associated with the first candidate image in the image data set, determining the first candidate image as the candidate image matching the target entity.
For example, if a target entity (e.g., "engine") matches a candidate entity (e.g., "engine") associated with a first candidate image in the image dataset, the first candidate image is determined to be a candidate image matching the target entity.
Specifically, the specific way of matching the target entity with the candidate entity associated with the first candidate image in the image data set may be:
firstly, obtaining a semantic vector of a target entity and semantic vectors of candidate entities related to each candidate image; in a possible implementation manner, the semantic vector of the target entity and the semantic vector of the candidate entity may be obtained through the "candidate matrix" in step 301 in embodiment 2, and for a specific implementation manner, please refer to step 301 in embodiment 2 for understanding, which is not described herein again. In a second possible implementation manner, the speech vector of the target entity and the semantic vector of the candidate entity may be obtained through the trained Word2vec model in step 301 in embodiment 2, and for a specific implementation manner, please refer to step 301 in embodiment 2 for understanding, which is not described herein again.
And then, calculating the cosine value of the included angle between the semantic vector of the target entity and the semantic vector of the candidate entity.
And obtaining the similarity between the target entity and the candidate entity according to the cosine value of the included angle between the semantic vector of the target entity and the semantic vector of the candidate entity, wherein the higher the similarity is, the higher the matching degree between the target entity and the candidate entity is.
Determining U candidate entities associated with the target entity in the order of the matching degree from top to bottom, where U is an integer greater than or equal to 1, and determining the candidate images associated with the U candidate entities as first candidate images, where the number of the first candidate images is not limited.
And step 504, outputting the first candidate image.
And if the execution subject is the terminal, the terminal displays the first candidate image. If the execution subject is a server, the server sends the first candidate image to a terminal, so that the terminal displays the first candidate image.
In an application scenario, a user inputs an "engine", the terminal receives the "engine", then sends the "engine" to the server, the server matches the "engine" with each candidate entity in the image data set, and finally the server matches that the similarity between the target entity "engine" and the candidate entity "engine" is higher than a threshold value, and the similarity between the target entity "engine" and the candidate entity "engine" is also higher than the threshold value, so that a candidate image Aa associated with the candidate entity "engine" and a candidate image Ab associated with the candidate entity "engine" are determined as a first candidate image. The server sends the candidate image Aa and the candidate image Ab to the terminal, which presents the candidate image Aa and the candidate image Ab.
In the embodiment of the application, target text information to be matched is received firstly; wherein the target text information comprises a target entity; then matching the target entity with candidate entities associated with each candidate image in the image data set; if the target entity matches a candidate entity associated with a first candidate image in the image dataset, determining the first candidate image as a candidate image matching the target entity; the first candidate image is output. In the embodiment of the application, the output first candidate image is a candidate image matched with the target entity in the target text information, the candidate image can more vividly represent the target entity, and the method for acquiring the image information in the embodiment of the application does not need to look up drawings in texts manually one by one like in the prior art, and selects the image matched with the target entity, so that the labor cost is greatly saved.
On the basis of the above-described embodiment, the image data set may be established in advance, and how to establish the image data set will be described in detail below. In step 503, the image data set includes a first image data set, and before the target entity is matched with the text information associated with each candidate image in the image data set, the method may further include the following steps:
in a first possible implementation, the image dataset comprises a first image dataset.
Acquiring a candidate text set; the candidate text set can be a patent text set, the candidate text set comprises a plurality of candidate texts, and each candidate text comprises a candidate entity; if the execution subject is a terminal, the terminal may obtain the candidate text set from a server, and if the execution subject is a server, the candidate text set may be stored in the server, or may also be obtained by the server from another device, which is not limited in particular.
The frequency of occurrence of each candidate entity in the candidate text set is counted, for example, in the candidate text set, the frequency of occurrence of "engine" is 10000 times, the frequency of occurrence of "connecting rod" is 9900 times, the frequency of occurrence of "pressing mechanism" is 9800 times, and the like, and the candidate entities and the frequency of occurrence thereof in this example are only for illustration and do not constitute a limiting description of the embodiments of the present application.
Determining a high-frequency entity according to the frequency; the high-frequency entities include entities which appear in the candidate text set with a frequency higher than a threshold, for example, the high-frequency entities are entities with a frequency higher than 9000. Or, the high-frequency entities include entities which are ranked according to frequency and are before a preset position, for example, all entities appearing in the candidate text are ranked according to the frequency from high to low, and the entity ranked before 10000 is selected as the high-frequency entity.
Associating each high frequency entity with at least one corresponding candidate image to obtain a first image dataset. The high frequency entities in the first image data set are entities with a higher frequency of occurrence.
Optionally, the image data set further includes a second image data set, and before the target entity is matched with the text information associated with each candidate image in the image data set, the method may further include the following steps:
acquiring a candidate text set; each candidate text in the candidate text set comprises a drawing description and a drawing, wherein the drawing description comprises a candidate entity and a candidate entity identifier, and the drawing comprises a candidate image and a candidate image identifier; each candidate text (e.g., patent) in the candidate text set, each patent including a figure description and a figure, is understood with reference to fig. 10, and fig. 10 is a schematic diagram of the figure description and the figure. In fig. 10, the illustration of the drawing includes a plurality of candidate entities and the corresponding number of each candidate entity in the drawing, for example, the number "1" corresponds to the "soymilk machine body", and the candidate image of the candidate entity corresponding to the number "1" in the drawing is the candidate image of the "soymilk machine body"; "head" corresponds to the number "2", and the candidate image of the candidate entity corresponding to the number "2" in the drawing is a candidate image of "head".
And establishing an incidence relation between the candidate entity and the candidate image according to the identification to obtain a second image data set. The second image data set is obtained by identifying the identifier (such as the number) in the drawing, matching the number in the drawing description with the number in the drawing, and then associating the candidate entity corresponding to the same number with the candidate image.
Optionally, the image data set further includes a third image data set, and before the target entity is matched with the text information associated with each candidate image in the image data set, the method may further include the following steps:
acquiring a candidate text set; each candidate text in the candidate text set comprises a title and an abstract figure; the candidate text also exemplifies patents, each of which includes a title and an abstract figure, which is a main figure that may represent the patent. For example, the patent is entitled "a soymilk maker".
And extracting abstract drawings in the candidate texts.
Identifying a candidate entity in the header; if the candidate entity in the soybean milk machine is extracted as the soybean milk machine through the entity extraction model.
And establishing an incidence relation between the candidate entity and the abstract drawing to obtain a third image data set. And establishing the association relation between the soybean milk machine and the abstract figure.
It is noted that the image dataset may comprise at least one of a first image dataset, a second image dataset and a third image dataset. In the embodiment of the present application, the image data set includes a first image data set, a second image data set, and a third image data set.
Optionally, in step 502, the step of matching the target entity with the candidate entity associated with each candidate image in the image data set may specifically include the following steps:
firstly, matching a target entity with candidate entities associated with candidate images in a first image data set; the candidate entities included in the first image data set are entities with a higher frequency of occurrence, and the target entity may be first matched with the high frequency entity to increase the matching rate.
And if the target entity does not match the candidate entity in the first image data set, matching the target entity with the candidate entity associated with each candidate image in other image data sets except the first image data set. And if the target entity does not match the candidate entity in the first image data set, matching the target entity with the candidate entity associated with each candidate image in the second image data set and/or the third image data set. And if the target entity is matched with the candidate entity in the first image data set, directly sending the candidate image associated with the candidate entity to the terminal so that the terminal displays the candidate entity. In the embodiment of the application, the target entity is firstly matched with the first image data set, so that the matching rate is improved.
Optionally, on the basis of the above embodiment, the image data set further includes a candidate image relationship, and the candidate image relationship includes a relationship between at least two candidate images and at least two candidate images. For example, the candidate image relationship is: (candidate image 1 connects candidate images 2) such as candidate image relationship (soymilk maker body image connects handpiece image). The candidate image relation is obtained according to the relation between the candidate entities, if the relation between the candidate entities is identified to be that the soybean milk maker body is connected with the machine head through the relation extraction model, the relation between the images related to the candidate entities is determined according to the relation between the candidate entities, and then the candidate image relation is obtained.
Optionally, on the basis of the foregoing embodiment, when the first candidate image is included in the target candidate image relationship, for example, in the image data set, the target candidate image relationship is (the soymilk grinder body image is connected to the handpiece image), and the first candidate image (such as the soymilk grinder body image) is included in the target candidate image, the method may further include the following steps:
firstly, determining a second candidate image contained in the target candidate image relationship, wherein the second candidate image and the first candidate image have a relationship; a second candidate image (e.g., a handpiece image) included in the target candidate image relationship is determined.
The first candidate image and the second candidate image are then output.
In an application scenario, if a target entity input by a user is a soybean milk maker, the structure of the soybean milk maker is understood more vividly through image information, a terminal sends the target entity to a server, the server matches the target entity (the soybean milk maker) with a candidate entity associated with each candidate image in an image data set, the candidate entity matched at the moment is a soybean milk maker body, and further, a first candidate image (namely a soybean milk maker body image) associated with the soybean milk maker body is connected with a second candidate image (namely a machine head image), and then the first candidate image (namely the soybean milk maker body image) and the second candidate image (namely the machine head image) are output. It should be noted that, in the embodiment of the present application, the number of the second candidate images is not limited, and in practical applications, the number of the first candidate images is not limited, for example, the number of the first candidate images is 2, each first candidate image may have a second candidate image with an association relationship, and the number of the second candidate images is also not limited, for example, each first candidate image has two second candidate images with an association relationship, the number of the last output images is 4, and the output first candidate image and the output second candidate image may be a topological structure, as shown in fig. 11, fig. 11 is a schematic topological diagram of the first candidate image and the second candidate image. The terminal not only displays the image information of the 'soymilk machine' but also displays other image information related to the 'soymilk machine'. In the embodiment, the second candidate image related to the first candidate image can be output according to the relation of the candidate images, and other images related to the first candidate image do not need to be manually analyzed and retrieved, so that the labor cost is saved, and the application scenes are increased.
On the basis of the above embodiment, optionally, the target entities at least include a first target entity and a second target entity, and the target text information further includes a first relationship between the first target entity and the second target entity; the method may further comprise the steps of:
if the first target entity matches a first candidate entity associated with a first candidate image in the image dataset, the second target entity matches a second candidate entity associated with a second candidate image in the image dataset; matching a first relationship between the first target entity and the second target entity with a second relationship between the first candidate entity and the second candidate entity;
if the first relationship matches the second relationship, the method further comprises:
and outputting the second candidate image.
In one application scenario, if the user inputs that the first target entity is "soymilk maker", the second target entity is "head", a first relationship between the first target entity and the second target entity is "connection", if the first target entity (soymilk maker) matches a first candidate entity (soymilk maker body) associated with a first candidate image in the image data set, the second target entity (head) matches a second candidate entity associated with a second candidate image (head image) in the image data set, and then further matches a relationship, the first relationship is "connection", a second relationship between the first candidate entity and the second candidate entity is "connection", and if the first relationship matches the second relationship, the second candidate image is output.
Optionally, the establishing of the relationship between the candidate images may specifically be:
extracting candidate entities in the candidate text and the relation between the candidate entities;
and establishing the relation between the candidate images associated with the candidate entities according to the relation between the candidate entities. If the relation between the candidate entity 'soybean milk machine body' and the candidate entity 'machine head' is extracted as 'connection', the relation between the candidate entity 'soybean milk machine body' and the candidate entity 'machine head' is established as connection relation.
Optionally, the extracting of the candidate entities in the candidate text and the relationship between the candidate entities may specifically include the following steps:
inputting the candidate text into an entity extraction model, and identifying candidate entities in the candidate text through the entity extraction model;
and inputting the candidate texts of the identified candidate entities into the relation extraction model, and outputting the relation between the candidate entities through the relation extraction model. Specifically, the step 202 and the step 203 in embodiment 1 may be referred to for extracting candidate entities through the entity extraction model and extracting relationships between the candidate entities through the relationship extraction model, which is not described herein again.
Optionally, the target text information is a target structure of the structured representation.
Example 5
Referring to fig. 12, an embodiment of the present application further provides a method for acquiring entity information, where the method is applied to an electronic device, and the electronic device may be a server or a terminal, and an execution subject in the embodiment of the present application is not limited specifically. For better understanding of the present embodiment, the words in the present embodiment will be explained first:
it should be noted that the "association relationship" between entities in this embodiment has the same meaning as the "relationship" between entities in the above embodiments 1 to 4. The explanation of the association relationship in the embodiment of the present application is also applicable to the explanation of the "relationship" in the above-described embodiments 1 to 4.
The attributes of the associative relationship include relationship types including, but not limited to, conceptual relationships, belongings, positional relationships, sequential relationships, and logical relationships.
Wherein, the conceptual relationship is as follows: it refers to general and specific relationships, i.e., upper and lower relationships, such as vehicles belonging to a generic concept relative to "cars" and vehicles belonging to a generic concept relative to "buses" and "cars".
Optionally, the relationship extraction model in this embodiment is obtained by further learning and training the claims in a large number of patent texts, the claims include a large number of upper and lower concepts, for example, the connecting assembly includes a screw and a nut, the connecting assembly is an upper concept, the screw and the nut are lower concepts, and the relationship extraction model can identify the upper and lower relationships between entities in the text by learning the large number of claims.
The term "relational" includes, but is not limited to, relational, and relational.
1) The inclusion relationship: the upper entity comprises lower entities according to the definition of the inclusion relationship, for example, the upper module comprises lower modules, for example, the automobile comprises wheels, and the upper and lower relationship is formed between the automobile and the wheels.
2) Connection relation: the entities have a connection relationship, such as a base connected with an LED lamp, and the relationship between the base and the LED lamp is a connection relationship.
3) The parallel relationship is as follows: the entities have a parallel relationship, for example, the soybean milk maker comprises an upper cover and a lower cover, the upper cover and the lower cover have no inclusion relationship or connection relationship, the upper cover and the lower cover are parallel, namely the upper cover and the lower cover have a parallel relationship.
The sequence relation is as follows: the entities have a precedence relationship. For example, step 1: receiving a first signal; step 2: and processing the signal to obtain a second signal. The first signal and the second signal have a step order, that is, the first signal is prior and the second signal is subsequent, so that the first signal and the second signal have a time order relationship, and the "first signal" and the "second signal" have a time order relationship.
The position relation is as follows: refer to the spatial relationship, such as inner, outer, left, right, front, back, etc. For example, the "LED lamp" is disposed on the "base", and the "LED lamp" and the "base" have a positional relationship.
The logical relationship is as follows: in the logical expression of the natural language, one entity is used as a reference position, at least one entity is searched in a preset range of the entity, and the entity at the reference position and the at least one entity in the preset range are in a logical relationship. For example, in a natural language logic expression: the utility model provides a double-deck lower cover soybean milk machine, includes cup and aircraft nose, and the aircraft nose is established on the cup, and the aircraft nose includes an upper cover and the lower cover that closes with this upper cover looks lid, and fixed mounting has motor and control circuit on the aircraft nose, and the motor shaft downwardly extending goes into the cup of motor room below, and the crushing cutter is equipped with to the motor shaft tip. Taking the 'motor' in the text as a reference position, g characters are forward or backward, for example, g is 10, then taking the motor as the reference position, 10 characters are forward, another entity 'head' is found, 10 characters are backward, a 'control circuit' and a 'motor shaft' are found, and then the 'head', 'control circuit' and 'motor shaft' are in a logical relation with the 'motor'.
Referring to fig. 12, a method for acquiring entity information provided in an embodiment of the present application may include the following steps:
601, receiving target text information; wherein the target text information comprises a first target entity.
And if the execution main body is the terminal, the terminal receives the target text information input by the user. And if the execution main body is the server, the server receives the target text information sent by the terminal. For example, the target text information is "engine". In this embodiment, the execution subject may be described by taking a server as an example. In one application scenario, for example, where the terminal receives user input "engine", the terminal sends the target entity to the server, which receives the target text message. It should be noted that the number of the first target entities in the embodiment of the present application is not limited, and the target entity "engine" in this example is only an exemplary one and does not constitute a limiting description of the present application.
Step 602, retrieving a first candidate entity matching a first target entity in a dataset; the data set comprises candidate entities and relations among the candidate entities, wherein the candidate entities at least comprise a first candidate entity and a second candidate entity which is in an association relation with the first candidate entity.
The data set may be pre-established and then stored, or may be obtained from another device. How this data set is built is explained below:
acquiring a candidate text set; the candidate text set can be a patent text set, the candidate text set comprises a plurality of candidate texts, and each candidate text comprises a candidate entity; and extracting candidate entities in each candidate text through the relationship extraction model, and extracting the relationship in the candidate text through the relationship extraction model to obtain the candidate entities and the relationship between the candidate entities. And obtaining a data set according to the candidate entities and the incidence relation among the candidate entities.
If the first target entity is 'soymilk maker', a first candidate entity matched with the first target entity in the data set, if the first candidate entity is 'soymilk maker body'; in the data set, a second candidate entity "upper cover" having an association with the first candidate entity "soymilk maker body". It should be noted that the association relationship in the embodiment of the present application includes the above-mentioned belonging relationship, conceptual relationship, sequential relationship and logical relationship.
If the second candidate entity can be "top cover", that is, the first candidate entity and the second candidate entity are in an affiliated relationship (including relationship), the second candidate entity is a candidate entity having a conceptual relationship, a sequential relationship or a logical relationship with the first candidate entity, which is not illustrated herein.
It should be noted that, in this step, the specific matching method between the first target entity and the first candidate entity may be understood by combining step 503 in embodiment 4, which is not described herein again.
Step 603, selecting a second candidate entity having an association relation with the first candidate entity in the data set.
A second candidate entity having an association relationship with the first candidate entity is selected in the data set, for example, the "top lid" and the first candidate entity are in an inclusion relationship, the "motor" and the first candidate entity are in a logical relationship, the "lid component" and the first candidate entity are in a conceptual relationship, and so on, which are not limited herein.
And step 604, outputting the second candidate entity.
And the server sends the second candidate entity to the terminal, and the terminal displays the second candidate entity. In this embodiment, the number of the second candidate entities is not limited, and the association relationship between the second candidate entities and the first candidate entities is also not limited.
In an application scenario, when a user needs to improve a related structure of a soybean milk machine, the user can input a soybean milk machine, a terminal receives the soybean milk machine input by the user and sends the soybean milk machine to a server, the server matches the soybean milk machine with candidate entities in a data set, the soybean milk machine is matched with a soybean milk machine body of the candidate entities, second candidate entities which are in incidence relation with the soybean milk machine body are determined, the server sends the second candidate entities to the terminal, the terminal displays a plurality of second candidate entities, and the second candidate entities can be displayed in a list form.
In the embodiment of the application, target text information is received; wherein the target text information comprises a first target entity; retrieving a first candidate entity in the dataset that matches the first target entity; the candidate entities at least comprise a first candidate entity and a second candidate entity which has an incidence relation with the first candidate entity; then selecting a second candidate entity having an association relation with the first candidate entity in the data set; and outputting the second candidate entity. In the embodiment, the second candidate entity related to the first target entity can be automatically recommended according to the first target entity, so that the situation that a user analyzes the second candidate entity piece by piece through retrieval is avoided, the mode of selecting the second candidate entity is further avoided, and the labor cost is greatly saved.
Optionally, on the basis of the above embodiment, the attribute of the association relationship includes a relationship type, the target text information further includes a target relationship condition, and the target relationship condition is used to indicate a relationship type between the target entity and the candidate entity to be obtained; the target relation condition may be a specific word expression, such as: including, connected, lower, etc. "include" indicates that the type of relationship between the target entity and the candidate entity to be acquired is the affiliated relationship; the "connection" indicates that the type of relationship between the target entity and the candidate entity to be acquired is an affiliated relationship, and the "lower" indicates that the type of relationship between the target entity and the candidate entity to be acquired is a conceptual relationship. Optionally, the target relationship condition may also be represented by a flag, for example, "bh" represents inclusion, "lj" represents "connection," and so on.
In step 603, the specific step of selecting a second candidate entity having an association relationship with the first candidate entity in the data set may further be:
and selecting a second candidate entity of a type meeting the target relation condition in the data set according to the first candidate entity.
For example, the target text message includes a first target entity, "soymilk maker", and the target relationship condition is "including", and then a second candidate entity meeting the "including" relationship is selected in the data set according to the first candidate entity, "soymilk maker body", for example, the second candidate entity may be "motor", "upper cover", and "lower cover".
In this embodiment, the target text information may further include a target relationship condition, and further, a second candidate entity of a type that meets the target relationship condition may be selected in the data set according to the first candidate entity, thereby increasing applicable scenarios.
Optionally, the selecting, in the data set, a second candidate entity having an association relationship with the first candidate entity may further include:
selecting a plurality of second candidate entities having an association relation with the first candidate entity in the data set according to the first candidate entity;
and selecting a target second candidate entity from the plurality of second candidate entities according to a preset rule, and taking the target second candidate entity as the second candidate entity.
In one implementation, a frequency of occurrence of each of a plurality of second candidate entities in the data set is determined; for example, the plurality of second candidate entities are "motor", "upper cover", and "lower cover", and the like. Wherein the frequency of occurrence of the "motor" in the data set is greater than the threshold, or the frequency of occurrence of the "motor" in the data set ranks first among all the second candidate entities.
And selecting a target second candidate entity from the plurality of second candidate entities according to the frequency, and taking the target second candidate entity as the second candidate entity. For example, "motor" may be selected as the target second candidate entity.
In another implementation, a date associated with a candidate text to which each of a plurality of second candidate entities belongs may be determined, the date associated including but not limited to an application date, a filing date, and a publication date, the plurality of second candidate entities belonging to different texts;
and selecting a target second candidate entity from the plurality of second candidate entities according to the related dates, and taking the target second candidate entity as the second candidate entity. The related date is described by taking a publication date as an example, and the target second candidate entity is selected from the plurality of second candidate entities according to the order of the publication date from the current date from the near to the far. For example, if the publication date of the patent document to which "motor" belongs is 2018.6.3, the publication date of the patent document to which "upper cover" belongs is 2017.5.4, and the publication date of the patent document to which "lower cover" belongs is 2017.1.4, the second candidate entity corresponding to the publication date closest to the current date may be selected as the target second candidate entity. It should be noted that, the plurality of second candidate entities in the embodiment are only examples for convenience of description, and do not constitute a limiting description of the present application.
Optionally, on the basis of the foregoing embodiment, the attribute of the association relationship further includes a relationship dimension, where the relationship dimension includes a binary relationship, or a binary relationship to an X-ary relationship, where X is an integer greater than or equal to 3, the binary relationship includes two entities and a relationship between the two entities, the X-ary relationship includes X entities, and at least (X-1) binary relationships, and the (X-1) binary relationships are connected through the association entity.
Optionally, on the basis of the foregoing embodiment, the number of the second candidate entities is multiple, the target text information further includes a second target entity and a target relationship condition, and selecting, in the data set, the second candidate entity having an association relationship with the first candidate entity may further specifically include:
retrieving a plurality of second candidate entities in the dataset that match the second target entity;
selecting a target second candidate entity meeting the target relation condition from a plurality of second candidate entities;
outputting the R element relation group; wherein, R is an integer which is greater than or equal to 2 and less than or equal to N, the R-element relation group comprises a plurality of R-element relations, each R-element relation comprises a first candidate entity, a target second candidate entity and a relation between the first candidate entity and the target second candidate entity.
For example, the first target entity is "engine", the second target entity is "link", the target relationship condition is "connection", the first candidate entity is "engine" and "engine", etc., a plurality of second candidate entities matching the second target entity are retrieved from the data set, the second candidate entities may be "upper link", "lower link", and "link assembly", etc., the R-element relationship group may be a binary relationship group and/or a ternary relationship group, in this embodiment, the R-element relationship group may be described by taking a binary relationship group as an example, for example, the binary relationship group includes: binary relation 1 (engine connected to upper connecting rod), binary relation 2 (engine connected to lower connecting rod), binary relation 3 (engine connected to connecting rod assembly), etc. In this embodiment, the R-element relationship group may be automatically retrieved and output according to the first target entity, the second target entity, and the relationship between the first target entity and the second target entity.
Optionally, the entity includes a component, and/or an attribute value.
The target entity comprises a target component, a target attribute, and/or a target attribute value; the candidate entities include candidate components, candidate attributes, and/or candidate attribute values, and the candidate entities are associated with the candidate texts to which the candidate entities belong, for example, the candidate texts are patent texts, each of which has a patent number, and the candidate entities can be associated with the candidate texts to which the candidate entities belong through the patent numbers. The method may further comprise:
respectively matching the target component with each candidate component, the target attribute with each candidate attribute, and/or the target attribute value with each candidate attribute value; for example, the target component is "motor", the target attribute is "voltage", and the target attribute value is "220V".
Determining a target candidate component, a target candidate attribute, and/or a target candidate attribute value that matches the target component;
acquiring a first candidate text associated with a target candidate component, a second candidate text associated with a target candidate attribute, and/or a third candidate text associated with a target candidate attribute value; the number of the first candidate texts, the second candidate texts, and the third candidate texts is not limited, and for example, there are 100 first candidate texts including "motor", 80 second candidate texts including "voltage", and 80 third candidate texts including "220V". The 100 first candidate texts, 80 second candidate texts and 80 third candidate texts may have the same candidate text, for example, the candidate text XX includes "motor", "voltage" and "220", that is, the first candidate text, the second candidate text and the third candidate text may be the same or different. The number of the first candidate texts, the second candidate texts, and the third candidate texts is only an example for convenience of description, and is not limited to the description of the present application.
And outputting the first candidate text, the second candidate text and/or the third candidate text.
Specifically, by outputting the first candidate text, the second candidate text, and/or the third candidate text in the form of a list, the user can view the candidate text containing "motor", "voltage", "220V", so as to facilitate the user to view the detailed description of the content in the candidate text including the target component, the target attribute, and/or the target attribute value.
Optionally, on the basis of the foregoing embodiment, the data set includes candidate relationships, where the candidate relationships include at least two candidate entities and relationships between the at least two candidate entities, the target text information includes a target relationship, the target relationship at least includes two target entities and relationships between the target entities, and the two target entities include the first target entity and the second target entity;
the step of selecting a second candidate entity having an association relationship with the first candidate entity in the data set may further specifically include:
retrieving a target candidate entity in the dataset that matches the second target entity, the target candidate entity having an association with the first candidate entity; for example, target relationships include a first target entity being a "cover," a second target entity being a "cover," and a relationship between the first target entity ("cover") and the second target entity ("cover") ("containing" relationship). Target candidate entities (such as an "upper cover" or an "upper cover body", etc., without limitation to a specific number) matching with a second target entity ("upper cover") are retrieved in the data set, and each target candidate entity has an association relationship (such as a containment relationship) with a first target entity (such as a cover body).
Searching a first candidate relation containing the target candidate entity according to the candidate relation, wherein the first candidate relation further comprises a third candidate entity and a relation between the target candidate entity and the third candidate entity; the data set comprises a plurality of candidate relations, and each candidate relation comprises at least two candidate entities and relations among the candidate entities; further, according to a large number of candidate relationships in the data set, a first candidate relationship including the target candidate entity (e.g., "top cover" or "upper cover") is found, for a brief description, the target candidate entity is exemplified by "upper cover", the first candidate relationship includes the target candidate relationship and a third candidate entity (e.g., button, display screen, etc.), for example, the first candidate relationship may be: (the upper end cover is provided with a button) or (the upper end cover is provided with a display screen) and the like. It should be noted that the association relationship between the target candidate entity in the first candidate relationship and the first candidate relationship is not limited, and may be, for example, setting, connecting, including, and the like.
Further, in a first implementation manner, a first candidate relationship is output as the second candidate entity, for example, output (upper end cover setting button), the server sends the first candidate relationship to the terminal, and the terminal displays the first candidate relationship according to the first candidate relationship, that is, displays (upper end cover setting button). In an application scenario, if a technician inputs (the cover body comprises an upper cover), the server can automatically recommend a component associated with the target relationship, that is, a "button" can be arranged on the "upper cover", or a "display screen" can be arranged on the "upper cover", which has great reference value for the technician to the technical improvement. In a second possible implementation manner, the third candidate entity may be further output. I.e. directly output the third candidate entity (i.e. button or display).
In a third possible implementation manner, a candidate entity similar to the third candidate entity may also be searched, and the similarity between the two entities is determined through the semantic vectors of the two entities in step 303 in embodiment 1, which is not described herein, and a candidate entity having a similarity with the third candidate entity greater than a threshold is selected, for example, the candidate entity similar to the third candidate entity is a "key", and the "key" of the candidate entity similar to the third candidate entity is directly output.
Optionally, in a fourth possible implementation manner, the third candidate entity may be matched with candidate entities included in each candidate relationship according to the candidate relationship, and a fourth candidate entity matched with the third candidate entity is determined; for example, the third candidate entity is a "button", and a fourth candidate entity (e.g., a key) that matches the third candidate entity (e.g., a "button").
Taking the second candidate relationship containing the fourth candidate entity as the second candidate entity, where the second candidate relationship containing the fourth candidate entity may be (the key is arranged on the operation panel), outputting the second candidate relationship, and the content that can be displayed on the terminal is as follows: the cover body comprises an upper cover, the upper cover is provided with keys, the keys are arranged on the operation panel, and optionally, the displayed content can be structured texts or structured images. In an application scenario, if a technician inputs (a cover body comprises an upper cover), the server can automatically recommend a component associated with the target relationship, that is, a button can be arranged on the upper cover, or a key can be arranged on the upper cover, the key is arranged on an operation panel, and recommendation of the server to the entity has great reference value for the technician for technical improvement.
Optionally, in a fifth possible implementation manner, the target text information includes a target relationship, where the target relationship at least includes two target entities and a relationship between the target entities, and the two target entities include the first target entity and the second target entity; the selecting, in the data set, a second candidate entity having an association relationship with the first candidate entity may further specifically include:
retrieving a target candidate entity in the dataset that matches the second target entity, the target candidate entity having an association with the first candidate entity; for example, target relationships include a first target entity being a "cover," a second target entity being a "cover," and a relationship between the first target entity ("cover") and the second target entity ("cover") ("containing" relationship). Target candidate entities (such as an "upper cover" or an "upper cover body", etc., without limitation to a specific number) matching with a second target entity ("upper cover") are retrieved in the data set, and each target candidate entity has an association relationship (such as a containment relationship) with a first target entity (such as a cover body).
Optionally, in a fifth possible implementation manner, a fifth candidate entity having an association relationship with the target candidate entity is searched according to a candidate relationship, where the fifth candidate entity is included in a third candidate relationship, where the third candidate relationship includes the fifth candidate entity, a sixth candidate entity, and a relationship between the fifth candidate entity and the sixth candidate entity; if a fifth candidate entity (the soymilk maker body) having an association relationship with the target candidate entity (the upper end cover) is searched according to the candidate relationship, the fifth candidate entity is included in a third candidate relationship, the third candidate relationship may be (the upper end cover is connected to the soymilk maker body), or the third candidate relationship may also be (the soymilk maker body includes the lower end cover), and the sixth candidate entity may be the same as or different from the target candidate entity.
Further, the third candidate relationship is output as the second candidate entity. In an application scenario, if a technician inputs (the cover body includes an upper cover), the server may automatically recommend a candidate relationship associated with the target relationship, for example, the terminal may display the following content: the cover body comprises an upper cover, the upper end cover is connected with the soymilk machine body, the soymilk machine body comprises a lower end cover or the cover body comprises an upper cover, the soymilk machine body is connected with the base, and the upper cover and the soymilk machine body are connected. In this example, according to the target relationship, the server may recommend a relationship having an association relationship with the target relationship, so that an application scenario is enhanced, and recommendation of the relationship by the server has a great reference value for technical improvement.
Optionally, in a sixth possible implementation manner, a fourth candidate relationship including the third candidate relationship is determined according to the candidate relationship; if, the fourth candidate relationship is: (the upper end cover is connected with the soymilk machine body, and the soymilk machine body is connected with the base), and further, the fourth candidate relation is taken as the second candidate entity to be output. In an application scenario, if a technician inputs (the cover body includes an upper cover), the server may automatically recommend a candidate relationship associated with the target relationship, for example, the terminal may display the following content: the cover body comprises an upper cover, the upper end cover is connected with the soymilk machine body, and the soymilk machine body comprises a lower end cover. In this example, according to the target relationship, the server may recommend a relationship having an association relationship with the target relationship, so that an application scenario is enhanced, and recommendation of the relationship by the server has a great reference value for technical improvement.
It should be noted that, in this embodiment, the candidate relationship, the target relationship, and the candidate entity are all exemplary illustrations, and do not constitute a limiting illustration of the present application.
Optionally, on the basis of the above embodiment, the data set further comprises an image data set, the image data set comprises a plurality of candidate images, each candidate image in the plurality of candidate images has an associated candidate entity, and after selecting a second candidate entity in the data set having an association with the first candidate entity, the method further comprises:
the image dataset is looked up from a second candidate entity, a candidate image associated with the second candidate entity is determined, the candidate image of the second candidate entity is taken as the second candidate entity.
For example, in one application scenario, the second candidate entities are "upper link" and "lower link", the image dataset is searched for according to the second candidate entities, candidate images associated with the "upper link" and the "lower link" are determined, and the image of the "upper link" and the image of the "lower link" are output as the second candidate entities.
In this embodiment, the candidate image of the second candidate entity may be acquired, and the candidate image of the second candidate entity may be directly output, so that the vividness of the second candidate entity is enhanced, and the image information is easier for the user to understand the second candidate entity.
Alternatively, on the basis of the above embodiment, how to create the image data set is explained as follows:
in one implementation, the image dataset comprises a first image dataset, the image dataset is looked up from a second candidate entity, and before determining a candidate image associated with the second candidate entity, the method further comprises:
acquiring a candidate text set, wherein the candidate text set comprises a plurality of candidate texts, and each candidate text comprises a candidate entity;
counting the occurrence frequency of each candidate entity in the candidate text set;
determining a high-frequency entity according to the frequency, wherein the high-frequency entity is as follows: entities that occur with a frequency above a threshold, or high frequency entities, are: entities before the preset position after sorting according to frequency;
associating each high frequency entity with at least one corresponding candidate image to obtain a first image dataset.
In a second implementation, the image dataset comprises a second image dataset, the image dataset is looked up according to a second candidate entity, and before determining a candidate image associated with the second candidate entity, the method further comprises:
acquiring a candidate text set, wherein each candidate text in the candidate text set comprises a drawing description and a drawing, the drawing description comprises a candidate entity and a candidate entity identifier, and the drawing comprises a candidate image and a candidate image identifier;
and establishing an incidence relation between the candidate entity and the candidate image according to the identification to obtain a second image data set.
In a third implementation, the image dataset comprises a third image dataset, the method further comprises, before searching the image dataset for a second candidate entity and determining a candidate image associated with the second candidate entity:
acquiring a candidate text set, wherein each candidate text in the candidate text set comprises a title and an abstract figure;
extracting abstract drawings in the candidate texts;
identifying a candidate entity in the header;
and establishing an incidence relation between the candidate entity and the abstract drawing to obtain a third image data set.
In this embodiment, the image data set includes a first image data set, a second image data set, and/or a third image data set, and the specific method for creating the first image data set, the second image data set, and the third image data set can be understood by referring to the specific method for creating image data in embodiment 4.
Optionally, how to find the image data set is explained below:
the image data set comprises a first image data set, the first image data set comprises candidate images of high-frequency entities, and the high-frequency entities are candidate entities with the use frequency higher than a threshold;
the first image dataset will be looked up from the second candidate entity;
if no candidate image associated with the second candidate entity is found in the first image dataset, other image datasets than the first image dataset (e.g. the second image dataset and/or the third image dataset) are found according to the second candidate entity.
Firstly, matching a target entity with candidate entities associated with candidate images in a first image data set; since the candidate entities contained in the first image data set are entities with a higher frequency of occurrence, the target entities may be first matched with the high frequency entities to increase the matching rate.
And if the target entity does not match the candidate entity in the first image data set, matching the target entity with the candidate entity associated with each candidate image in other image data sets except the first image data set. And if the target entity does not match the candidate entity in the first image data set, matching the target entity with the candidate entity associated with each candidate image in the second image data set and/or the third image data set. And if the target entity is matched with the candidate entity in the first image data set, directly sending the candidate image associated with the candidate entity to the terminal so that the terminal displays the candidate entity. In the embodiment of the application, the target entity is firstly matched with the first image data set, so that the matching rate is improved.
Embodiment 6 referring to fig. 13, an embodiment of an apparatus for determining text novelty is provided in the present application, where the apparatus is configured to perform the method steps actually performed by the electronic device in embodiment 3, and the apparatus 1300 includes:
a text determination module 1301, configured to determine a target text;
an entity extracting module 1302, configured to extract multiple target entities in the target text determined by the text determining module 1301 to obtain a target entity set;
an entity obtaining module 1303, which obtains a candidate entity set of each candidate text in the candidate text set;
an entity intersection determining module 1304, configured to determine a first entity intersection between the target entity set extracted by the entity extracting module 1302 and the candidate entity set acquired by the entity acquiring module 1303, where the first entity intersection is a matched entity in the target entity set and the candidate entity set;
a novelty determining module 1305, configured to determine the novelty of the target text and the candidate text according to the difference parameter between the first entity intersection determined by the entity intersection determining module 1304 and the target entity set extracted by the entity extracting module 1302.
Referring to fig. 14, on the basis of the embodiment corresponding to fig. 13, the embodiment of the present application provides another embodiment of an apparatus 1400 for determining text novelty, the apparatus further includes a relationship extraction module 1306, a relationship obtaining module 1307, and a relationship intersection determining module 1308;
a relationship extraction module 1306, configured to extract a plurality of binary relationships in the target text to obtain a target binary relationship set, where the binary relationships include two entities and a relationship therebetween;
a relation obtaining module 1307, configured to obtain a candidate binary relation set including a plurality of binary relations in the candidate text;
a relationship intersection determining module 1308, further configured to determine a first binary relationship intersection between the target binary relationship set extracted by the relationship extracting module 1306 and the candidate binary relationship set obtained by the relationship obtaining module 1307, where the first binary relationship intersection includes a binary relationship that matches the target binary relationship set and the candidate binary relationship set;
the novelty determination module 1305 is further specifically configured to:
determining a first entity novelty according to the difference parameters of the first entity set and the target entity set;
determining a first binary relation novelty according to the difference parameters of the first binary relation intersection and the target binary relation set;
determining a novelty of the target text and the candidate text according to the first entity novelty and the first secondary relationship novelty.
Optionally, the relationship extracting module 1306 is further configured to extract a target ternary relationship set in the target text, where the target ternary relationship set includes a plurality of ternary relationships, the ternary relationship includes two binary relationships, and the two binary relationships have the same entity;
the relationship obtaining module 1307 is further configured to obtain a candidate ternary relationship set including a plurality of ternary relationships in the candidate text;
a relationship intersection determining module 1308, further configured to determine a first ternary relationship intersection between the target ternary relationship set extracted by the relationship extracting module 1306 and the candidate ternary relationship set obtained by the relationship obtaining module 1307, where the first ternary relationship intersection includes a ternary relationship that matches the target ternary relationship set and the candidate ternary relationship set;
the novelty determination module 1305 is further specifically configured to:
determining a first ternary relationship novelty according to the difference parameters of the first ternary relationship intersection and the target ternary relationship set;
determining a novelty of the target text and the candidate text based on the first entity novelty, the first secondary relationship novelty, and the first tertiary relationship novelty.
Alternatively to this, the first and second parts may,
the entity extraction module 1302 is further configured to input the target text into an entity extraction model, and identify a plurality of target entities in the target text through the entity extraction model.
Optionally, the relationship extraction module 1306 is further configured to input the target text that has been identified to the target entity into a relationship extraction model, and extract a binary relationship between the target entities through the relationship extraction model.
Optionally, a generating module 1309 is further included;
a generating module 1309, configured to perform structured representation on the target text according to the target entity extracted by the entity extracting module 1302 and the relationship between the target entities extracted by the relationship extracting module 1306, so as to generate a target structure.
Optionally, the target structure includes nodes and edges, where the nodes are used for representing the target entities, and the edges are used for representing the relationships between the target entities.
Optionally, each candidate text in the candidate text set is a structured candidate structure, and the target text is a target structure; an entity extraction module 1302, further configured to extract a candidate entity set of a candidate graph, where the candidate graph includes at least one candidate structure;
an entity intersection determining module 1304, configured to determine a second entity intersection between the target entity set extracted by the entity extracting module 1302 and the candidate entity set of the candidate map acquired by the entity acquiring module 1303;
the novelty determination module 1305 is further configured to determine the novelty of the target text and the candidate atlas according to the difference parameters of the second entity intersection and the target entity set.
Optionally, when the candidate atlas includes at least two candidate structures, the at least two candidate structures are a first candidate structure and a second candidate structure; the apparatus also includes an association entity determination module 1310 and an association module 1311;
an associated entity determining module 1310 for determining an associated entity of the first candidate structure and the second candidate structure;
an associating module 1311, configured to associate the first candidate structure and the second candidate structure through the associated entity determined by the associated entity determining module 1310, so as to obtain the candidate atlas.
Optionally, the relationship extracting module 1306 is further configured to extract a plurality of binary relationships in the target structure, so as to obtain a target binary relationship set;
the novelty determination module 1305 is further specifically configured to:
positioning two target entities contained in each target binary relation in the target binary relation set to corresponding two entity positions in the candidate map;
calculating the distance between the two entity positions corresponding to each target binary relation;
determining a second binary relation novelty of each target binary relation relative to the candidate atlas according to the distance;
determining a second entity novelty according to the difference parameters of the second entity set and the target entity set;
determining the novelty of the target structure with the candidate atlas based on the second entity novelty and the second binary relationship novelty.
Alternatively to this, the first and second parts may,
the relationship obtaining module 1307 is further configured to obtain a candidate binary relationship set including a plurality of binary relationships in the candidate map;
a relationship intersection determining module 1308, configured to determine a second binary relationship intersection between the target binary relationship set and the candidate binary relationship set;
the novelty determination module 1305 is further specifically configured to:
determining a first binary relation novelty according to the difference parameters of the second binary relation intersection and the target binary relation set;
determining a binary relation novelty according to the first binary relation novelty, the second binary relation novelty and respective corresponding weights;
determining the novelty of the target structure and the candidate atlas according to the second entity novelty and the binary relationship novelty.
Alternatively to this, the first and second parts may,
the relationship extraction module 1306 is further configured to extract a plurality of ternary relationships in the target structure, so as to obtain a target ternary relationship set;
the novelty determination module 1305 is further specifically configured to:
positioning any two target entities contained in each target ternary relationship in the target ternary set to corresponding three entity positions in the candidate map;
calculating a distance between any two of the three physical locations;
determining a second tertiary novelty for each of the target triples relative to the candidate atlas based on the distance;
determining the novelty of the target structure and the candidate atlas according to the second entity novelty, the second binary relational novelty and the second ternary relational novelty.
Alternatively to this, the first and second parts may,
the relationship obtaining module 1307 is further configured to obtain a candidate ternary relationship set including a plurality of ternary relationships in the candidate map;
a relationship intersection determining module 1308, further configured to determine a second ternary relationship intersection between the target ternary relationship set and the candidate ternary relationship set;
the novelty determination module 1305 is further specifically configured to:
determining a first ternary relationship novelty according to the difference parameters of the ternary relationship intersection and the target ternary relationship set;
determining a ternary relationship novelty according to the first ternary relationship novelty, the second ternary relationship novelty and respective corresponding weights;
determining the novelty of the target structure and the candidate atlas according to the second entity novelty, binary relationship novelty, and ternary relationship novelty.
Referring to fig. 15, an embodiment of the present application further provides an electronic device 70, where the electronic device 70 includes: memory 710, transceiver 720, and processor 730. Those skilled in the art will appreciate that the electronic device may also include other components, such as various components commonly found in a computer. The memory 710, the transceiver 720 and the processor 730 are in communication with each other, the memory 710 is used for storing computer instructions, the transceiver 720 is used for communicating with other devices, and the computer instructions, when executed by the processor 730, cause the electronic device 70 to perform the method described in the above-mentioned method embodiments.
Embodiments of the present application further provide a computer storage medium for storing computer software instructions, which include instructions for executing the method performed by the electronic device in the method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (14)

1. A method for determining text novelty, comprising:
determining a target text;
extracting a plurality of target entities in the target text to obtain a target entity set;
acquiring a respective candidate entity set of each candidate text in a candidate text set, wherein the candidate text set comprises a plurality of candidate texts;
determining a first entity intersection of the target entity set and the candidate entity set, wherein the first entity intersection is a matching entity in the target entity set and the candidate entity set;
determining a novelty degree of the target text and the candidate text according to the difference parameters of the first entity intersection and the target entity set;
the method further comprises the following steps:
extracting a plurality of binary relations in the target text to obtain a target binary relation set, wherein the binary relations comprise two entities and relations between the two entities;
acquiring a candidate binary relation set comprising a plurality of binary relations in the candidate text;
determining a first binary relation intersection of the target binary relation set and the candidate binary relation set, wherein the first binary relation intersection comprises matched binary relations in the target binary relation set and the candidate binary relation set;
the determining the novelty degree of the target text and the candidate text according to the difference parameters of the first entity set and the target entity set comprises:
determining a first entity novelty according to the difference parameters of the first entity set and the target entity set;
determining a first binary relation novelty according to the difference parameters of the first binary relation intersection and the target binary relation set;
determining a novelty of the target text and the candidate text according to the first entity novelty and the first secondary relationship novelty.
2. The method of claim 1, further comprising:
extracting a target ternary relationship set in the target text, wherein the target ternary relationship set comprises a plurality of ternary relationships, the ternary relationship comprises two binary relationships, and the two binary relationships have the same entity;
acquiring a candidate ternary relationship set comprising a plurality of ternary relationships in the candidate text;
determining a first ternary relationship intersection of the target ternary relationship set and the candidate ternary relationship set, wherein the first ternary relationship intersection comprises matched ternary relationships in the target ternary relationship set and the candidate ternary relationship set;
the determining the novelty of the target text and the candidate text in accordance with the first entity novelty and the first secondary relationship novelty comprises:
determining a first ternary relationship novelty according to the difference parameters of the first ternary relationship intersection and the target ternary relationship set;
determining a novelty of the target text and the candidate text based on the first entity novelty, the first secondary relationship novelty, and the first tertiary relationship novelty.
3. The method of claim 1, wherein the extracting the plurality of target entities from the target text comprises:
inputting the target text into an entity extraction model, and identifying a plurality of target entities in the target text through the entity extraction model.
4. The method of claim 1, wherein the extracting the plurality of binary relationships in the target text comprises:
and inputting the target texts of the identified target entities into a relationship extraction model, and extracting the binary relationship between the target entities through the relationship extraction model.
5. The method of claim 4, comprising:
and according to the relation between the target entities, performing structured representation on the target text to generate a target structure.
6. The method of claim 5, wherein the target structure comprises nodes and edges, wherein the nodes are used for representing the target entities, and wherein the edges are used for representing relationships between the target entities.
7. A method for determining text novelty, comprising:
determining a target text;
extracting a plurality of target entities in the target text to obtain a target entity set;
acquiring a respective candidate entity set of each candidate text in a candidate text set, wherein the candidate text set comprises a plurality of candidate texts;
determining a first entity intersection of the target entity set and the candidate entity set, wherein the first entity intersection is a matching entity in the target entity set and the candidate entity set;
determining a novelty degree of the target text and the candidate text according to the difference parameters of the first entity intersection and the target entity set;
each candidate text in the candidate text set is a structured candidate structure, and the target text is a target structure, the method further comprising:
extracting a candidate entity set of a candidate atlas, wherein the candidate atlas comprises at least one candidate structure;
the determining an entity intersection of the target entity set and the candidate entity set comprises:
determining a second entity intersection of the target set of entities and a set of candidate entities of the candidate atlas;
the method further comprises the following steps:
determining a novelty degree of the target text and the candidate atlas according to the difference parameters of the second entity intersection and the target entity set;
the method further comprises the following steps:
extracting a plurality of binary relations in the target structure to obtain a target binary relation set;
positioning two target entities contained in each target binary relation in the target binary relation set to corresponding two entity positions in the candidate map;
calculating the distance between the two entity positions corresponding to each target binary relation;
determining a second binary relation novelty of each target binary relation relative to the candidate atlas according to the distance;
the determining the novelty of the target text and the candidate text according to the difference parameters of the second entity set and the target entity set comprises:
determining a second entity novelty according to the difference parameters of the second entity set and the target entity set;
determining the novelty of the target structure with the candidate atlas based on the second entity novelty and the second binary relationship novelty.
8. The method of claim 7, wherein when the candidate atlas includes at least two candidate structures, the at least two candidate structures are a first candidate structure and a second candidate structure;
determining an associated entity of the first candidate structure and the second candidate structure;
and associating the first candidate structure with the second candidate structure through the association entity to obtain the candidate map.
9. The method of claim 7, further comprising:
acquiring a candidate binary relation set comprising a plurality of binary relations in the candidate map;
determining a second binary relation intersection of the target binary relation set and the candidate binary relation set;
determining a first binary relation novelty according to the difference parameters of the second binary relation intersection and the target binary relation set;
determining a binary relation novelty according to the first binary relation novelty, the second binary relation novelty and respective corresponding weights;
the determining the novelty of the target structure and the candidate structure according to the second entity novelty and the second binary relationship novelty comprises:
determining the novelty of the target structure and the candidate atlas according to the second entity novelty and the binary relationship novelty.
10. The method of claim 9, further comprising:
extracting a plurality of ternary relations in the target structure to obtain a target ternary relation set;
positioning any two target entities contained in each target ternary relationship in the target ternary set to corresponding three entity positions in the candidate map;
calculating a distance between any two of the three physical locations;
determining a second tertiary novelty for each of the target triples relative to the candidate atlas based on the distance;
the determining the novelty of the target structure and the candidate structure according to the second entity novelty and the second binary relationship novelty comprises:
determining the novelty of the target structure and the candidate atlas according to the second entity novelty, the second binary relational novelty and the second ternary relational novelty.
11. The method of claim 10, further comprising:
acquiring a candidate ternary relationship set comprising a plurality of ternary relationships in the candidate map;
determining a second ternary relationship intersection of the target ternary relationship set and the candidate ternary relationship set;
determining a first ternary relationship novelty according to the difference parameters of the ternary relationship intersection and the target ternary relationship set;
determining a ternary relationship novelty according to the first ternary relationship novelty, the second ternary relationship novelty and respective corresponding weights;
determining the novelty of the target structure with the candidate atlas based on the second entity novelty and the binary relationship novelty, comprising:
determining the novelty of the target structure and the candidate atlas according to the second entity novelty, binary relationship novelty, and ternary relationship novelty.
12. An apparatus for determining text novelty, comprising:
the first determining module is used for determining a target text;
the extracting module is used for extracting a plurality of target entities in the target text determined by the first determining module to obtain a target entity set;
the acquisition module is used for acquiring a respective candidate entity set of each candidate text in a candidate text set, wherein the candidate text set comprises a plurality of candidate texts;
a second determining module, configured to determine a first entity intersection between the target entity set identified by the extracting module and the candidate entity set acquired by the acquiring module, where the first entity intersection is a matched entity in the target entity set and the candidate entity set;
a novelty determination module, configured to determine a degree of novelty between the target text and the candidate text according to the difference parameter between the first entity intersection determined by the second determination module and the target entity set extracted by the extraction module;
the relation extraction module is used for extracting a plurality of binary relations in the target text to obtain a target binary relation set, wherein the binary relations comprise two entities and relations between the two entities;
the relation acquisition module is used for acquiring a candidate binary relation set comprising a plurality of binary relations in the candidate text;
a relationship intersection determining module, configured to determine a first binary relationship intersection between the target binary relationship set extracted by the relationship extracting module and the candidate binary relationship set acquired by the relationship acquiring module, where the first binary relationship intersection includes a binary relationship that matches in the target binary relationship set and the candidate binary relationship set;
the novelty determination module is further specifically configured to:
determining a first entity novelty according to the difference parameters of the first entity set and the target entity set;
determining a first binary relation novelty according to the difference parameters of the first binary relation intersection and the target binary relation set;
determining a novelty of the target text and the candidate text according to the first entity novelty and the first secondary relationship novelty.
13. An electronic device, comprising:
a memory and a processor;
the memory and the processor are communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of any one of claims 1-11.
14. A computer storage medium having computer instructions stored thereon for causing a computer to perform the method of any one of claims 1-11.
CN201811348626.6A 2018-11-13 2018-11-13 Method and related device for determining text novelty Active CN109582933B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811348626.6A CN109582933B (en) 2018-11-13 2018-11-13 Method and related device for determining text novelty

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811348626.6A CN109582933B (en) 2018-11-13 2018-11-13 Method and related device for determining text novelty

Publications (2)

Publication Number Publication Date
CN109582933A CN109582933A (en) 2019-04-05
CN109582933B true CN109582933B (en) 2021-09-03

Family

ID=65922365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811348626.6A Active CN109582933B (en) 2018-11-13 2018-11-13 Method and related device for determining text novelty

Country Status (1)

Country Link
CN (1) CN109582933B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144709B (en) * 2019-12-06 2023-04-18 北京邮电大学 Method and device for determining novelty of machine-generated text
CN111708873B (en) * 2020-06-15 2023-11-24 腾讯科技(深圳)有限公司 Intelligent question-answering method, intelligent question-answering device, computer equipment and storage medium
CN111930898B (en) * 2020-09-18 2021-01-05 北京合享智慧科技有限公司 Text evaluation method and device, electronic equipment and storage medium
CN112052835B (en) 2020-09-29 2022-10-11 北京百度网讯科技有限公司 Information processing method, information processing apparatus, electronic device, and storage medium
CN113743087B (en) * 2021-09-07 2024-04-26 珍岛信息技术(上海)股份有限公司 Text generation method and system based on neural network vocabulary extension paragraph
CN115879441B (en) * 2022-11-10 2024-04-12 中国科学技术信息研究所 Text novelty detection method and device, electronic equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017123168A (en) * 2016-01-05 2017-07-13 富士通株式会社 Method for making entity mention in short text associated with entity in semantic knowledge base, and device
WO2018153295A1 (en) * 2017-02-27 2018-08-30 腾讯科技(深圳)有限公司 Text entity extraction method, device, apparatus, and storage media

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110202545A1 (en) * 2008-01-07 2011-08-18 Takao Kawai Information extraction device and information extraction system
US10585975B2 (en) * 2012-03-02 2020-03-10 Github Software Uk Ltd. Finding duplicate passages of text in a collection of text
CN104636325B (en) * 2015-02-06 2015-09-30 中南大学 A kind of method based on Maximum-likelihood estimation determination Documents Similarity
CN105653706B (en) * 2015-12-31 2018-04-06 北京理工大学 A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN107015961B (en) * 2016-01-27 2021-06-25 中文在线数字出版集团股份有限公司 Text similarity comparison method
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN107665252B (en) * 2017-09-27 2020-08-25 深圳证券信息有限公司 Method and device for creating knowledge graph
CN108763566A (en) * 2018-06-05 2018-11-06 北京玄科技有限公司 Text similarity computing method and device, intelligent robot
CN108763569A (en) * 2018-06-05 2018-11-06 北京玄科技有限公司 Text similarity computing method and device, intelligent robot

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017123168A (en) * 2016-01-05 2017-07-13 富士通株式会社 Method for making entity mention in short text associated with entity in semantic knowledge base, and device
WO2018153295A1 (en) * 2017-02-27 2018-08-30 腾讯科技(深圳)有限公司 Text entity extraction method, device, apparatus, and storage media

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A new text representation model enriched with semantic relations;Aliya Nugumanova等;《2015 15th International Conference on Control, Automation and Systems (ICCAS)》;20151228;第619-622页 *
关联数据在学术资源网相似文献发现中的应用研究;赵夷平 等;《现代图书情报技术》;20160331;第2016年卷(第3期);第41-49页 *

Also Published As

Publication number Publication date
CN109582933A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN109582800B (en) Method for training structured model and text structuring and related device
CN109597878B (en) Method for determining text similarity and related device
CN109582933B (en) Method and related device for determining text novelty
CN110188168B (en) Semantic relation recognition method and device
US20220261427A1 (en) Methods and system for semantic search in large databases
Caicedo et al. Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization
CN107818815B (en) Electronic medical record retrieval method and system
US8001139B2 (en) Using a bipartite graph to model and derive image and text associations
WO2018005203A1 (en) Leveraging information available in a corpus for data parsing and predicting
CN109635277B (en) Method and related device for acquiring entity information
CN110110800B (en) Automatic image annotation method, device, equipment and computer readable storage medium
Kelm et al. Multi-modal, multi-resource methods for placing flickr videos on the map
CN103927339B (en) Knowledge Reorganizing system and method for knowledge realignment
CN103440262A (en) Image searching system and image searching method basing on relevance feedback and Bag-of-Features
CN112749272A (en) Intelligent new energy planning text recommendation method for unstructured data
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN111104437A (en) Test data unified retrieval method and system based on object model
KR20120047622A (en) System and method for managing digital contents
CN114764566A (en) Knowledge element extraction method for aviation field
CN109635139B (en) Method and related device for acquiring image information
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
CN114281942A (en) Question and answer processing method, related equipment and readable storage medium
CN113449094A (en) Corpus obtaining method and device, electronic equipment and storage medium
Thornton et al. Feedback-based social media filtering tool for improved situational awareness
Banerjee et al. Word image based latent semantic indexing for conceptual querying in document image databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant