CN115270751A - Method and device for determining information similarity - Google Patents

Method and device for determining information similarity Download PDF

Info

Publication number
CN115270751A
CN115270751A CN202210831617.2A CN202210831617A CN115270751A CN 115270751 A CN115270751 A CN 115270751A CN 202210831617 A CN202210831617 A CN 202210831617A CN 115270751 A CN115270751 A CN 115270751A
Authority
CN
China
Prior art keywords
entities
information
processed
entity
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210831617.2A
Other languages
Chinese (zh)
Inventor
钱佳佳
刘伟棠
陈立力
周明伟
范鹏召
郑燕玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202210831617.2A priority Critical patent/CN115270751A/en
Publication of CN115270751A publication Critical patent/CN115270751A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for determining information similarity, which take the influence of some elements in information on the similarity into consideration when determining the similarity of the information and improve the accuracy of the determined similarity. The method comprises the following steps: acquiring a plurality of entities contained in information to be processed and relationships among the entities; updating the knowledge base based on matching results of relationships among the plurality of entities and the plurality of entities with the entities and the relationships among the entities contained in the knowledge base; the knowledge base is constructed based on a plurality of entities contained in the processed information; and determining the similarity between the information to be processed and the plurality of past processed information respectively according to the distances between the characterization vectors corresponding to the information to be processed in the updated knowledge base and the characterization vectors corresponding to the plurality of past processed information.

Description

Method and device for determining information similarity
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for determining information similarity.
Background
In many industries at present, in order to improve the efficiency of information processing, methods for fusing and applying information and for referencing the processing process of extracting similar past information are increasingly common. For example, when an operator in some telecommunication industry receives a complaint, the operator can acquire similar complaint cases in the past, so that the current complaint case can be rapidly solved.
Therefore, how information similar to information to be processed can be acquired becomes important. The current common mode is to convert the information texts into vectors, determine the similarity between the information texts based on the distance between the vectors, and determine the similarity of the information according to the similarity of the texts. However, in many cases, the similarity of the information texts cannot represent the similarity of the information, for example, the text contents in two cases are relatively similar, but the executive person and the executed person contained in a case are different or the location and the occurrence time are different, which results in that the two cases are different in terms of differences, and therefore, it is not enough to determine the similarity of the information only by considering the similarity of the texts.
Disclosure of Invention
The application provides a method and a device for determining information similarity, which are used for improving the accuracy of the determined information similarity.
In a first aspect, the present application provides a method for determining information similarity, including:
acquiring a plurality of entities contained in information to be processed and relations among the entities;
updating a knowledge base based on a matching result of the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities contained in the knowledge base; the knowledge base is constructed based on entities contained in a plurality of past processed information;
and determining the similarity between the information to be processed and the past processed information respectively according to the distances between the characterization vectors corresponding to the information to be processed in the updated knowledge base and the characterization vectors corresponding to the past processed information.
Based on the scheme, the method and the device provide that the similarity between the entity in the information to be processed and the relationship between the entity in the information processed in the past is compared, and the similarity between the information to be processed and the entity contained in the information processed in the past is represented in a knowledge graph mode. Further, the steps of converting the characterization vectors and calculating the similarity are performed. Compared with the scheme that the similarity between the information is determined directly according to the distance between the text vectors corresponding to the information text in the prior art, the method provided by the application fully considers the influence of the entity in the information on the similarity of the information, so that the determined similarity of the information is more accurate.
In some embodiments, the updating the knowledge base based on a matching result of the plurality of entities and the relationship with the entities and the relationships between the entities contained in the knowledge base comprises:
when a second entity successfully matched with a first entity exists in the knowledge base and the relation related to the first entity is not equal to the relation related to the second entity, adding the relation related to the first entity into the second entity;
when the entity which is successfully matched with the first entity does not exist in the knowledge base, adding the first entity and the relation related to the first entity into the knowledge base;
wherein the first entity is any one of the plurality of entities.
In some embodiments, the matching of the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities contained in the knowledge base is:
any one of the plurality of entities is matched with the entities contained in the knowledge base respectively;
and if the entity successfully matched with any entity exists in the knowledge base, matching the relation related to the successfully matched entity with the relation related to the first entity.
In some embodiments, the obtaining of the plurality of entities included in the information to be processed and the relationship between the plurality of entities includes:
converting the information to be processed into a text vector;
determining the positions of the plurality of entities and the relations among the plurality of entities contained in the text vector by adopting a pre-trained neural network model;
determining the positions of the entities and the relations among the entities in the information to be processed according to the positions of the entities and the relations among the entities in the text vector;
and extracting the entities and the relations among the entities from the information to be processed according to the positions of the entities and the relations among the entities in the information to be processed.
In some embodiments, the method further comprises:
and respectively converting the information to be processed in the updated knowledge base and the plurality of past processed information into characterization vectors by adopting a graph embedding algorithm.
In some embodiments, the method further comprises:
and outputting a set number of past processed information according to the similarity between the information to be processed and the plurality of past processed information respectively and the descending order of the similarity.
In a second aspect, an embodiment of the present application provides an apparatus for determining information similarity, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of entities contained in information to be processed and relations among the entities;
a processing unit configured to update a knowledge base based on a matching result of the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities included in the knowledge base; the knowledge base is constructed based on entities contained in a plurality of pieces of information processed in the past;
the processing unit is further configured to determine similarity between the information to be processed and the plurality of past processed information according to distances between the characterization vectors corresponding to the information to be processed in the updated knowledge base and the characterization vectors corresponding to the plurality of past processed information.
In some embodiments, the processing unit is specifically configured to:
when a second entity successfully matched with a first entity exists in the knowledge base and the relation related to the first entity is not equal to the relation related to the second entity, adding the relation related to the first entity into the second entity;
when an entity which is successfully matched with the first entity does not exist in the knowledge base, adding the first entity and the relation related to the first entity into the knowledge base;
wherein the first entity is any one of the plurality of entities.
In some embodiments, the processing unit is further configured to match the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities included in the knowledge base, and specifically is configured to:
any one of the plurality of entities is matched with the entities contained in the knowledge base respectively;
and when an entity successfully matched with any entity exists in the knowledge base, matching the relation related to the successfully matched entity with the relation related to the first entity.
In some embodiments, the obtaining unit is specifically configured to:
converting the information to be processed into a text vector through the processing unit; determining the positions of the plurality of entities and the relations among the plurality of entities contained in the text vector by adopting a pre-trained neural network model; determining the positions of the entities and the relations among the entities in the information to be processed according to the positions of the entities and the relations among the entities in the text vector;
and extracting the entities and the relations among the entities from the information to be processed according to the positions of the entities and the relations among the entities in the information to be processed.
In some embodiments, the processing unit is further configured to:
and respectively converting the information to be processed in the updated knowledge base and the plurality of past processed information into characterization vectors by adopting a graph embedding algorithm.
In some embodiments, the processing unit is further configured to:
and outputting a set number of past processed information according to the similarity between the information to be processed and the plurality of past processed information respectively and the descending order of the similarity.
In a third aspect, an electronic device is provided that includes a controller and a memory. The memory is used for storing computer-executable instructions, and the controller executes the computer-executable instructions in the memory to perform the operational steps of any one of the possible implementations of the method according to the first aspect by using hardware resources in the controller.
In a fourth aspect, a computer-readable storage medium is provided having stored therein instructions which, when executed on a computer, cause the computer to perform the method of the above aspects.
In addition, the beneficial effects of the second aspect to the fourth aspect can be referred to as the beneficial effects of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of an information similarity determining method according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a method for obtaining entities and relationships between the entities included in information to be processed according to an embodiment of the present application;
FIG. 3 is a flow chart of a method for updating a knowledge base according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for determining information similarity based on an updated knowledge base according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an information similarity determination apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
It should be noted that the terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
Deep learning is part of a broader family of machine learning methods based on artificial neural networks. Learning may be supervised, semi-supervised or unsupervised. Deep learning models, such as deep neural networks, deep belief networks, recurrent neural networks, and convolutional neural networks, have been applied in fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection, and board game programs.
Natural language processing is a discipline for studying language problems in person-to-person interactions and in person-to-computer interactions. In short, the computer accepts the input of the user in the form of natural language, and internally processes a series of operations such as processing and calculation through the algorithm defined by human, so as to simulate the understanding of the natural language by human and return the result expected by the user. Natural language processing aims to process large-scale natural language information with a computer instead of a human being. The method mainly comprises the categories of automatic word segmentation, part of speech tagging, syntactic analysis, text classification, information extraction and the like, and is knowledge of artificial intelligence, computer science, the cross field of information engineering, design statistics, linguistics and the like.
With the wide application of natural language processing technology and deep learning, the calculation of information similarity is increasingly applied to various industries. When information is processed, a user can obtain similar information processed in the past based on the calculation of the information similarity, and the efficiency and the accuracy of information processing can be improved based on the similar information in the past. However, in the related art, when calculating the information similarity, a pre-trained model is generally applied to convert the information to be processed and the past information into text vectors, and the similarity of the information is determined based on the distance between the text vectors. However, in some scenarios, the similarity of information is not a simple text similarity degree, but depends on some elements in the information. For example, some information about cases, even if the text contents of two cases are relatively close, the difference between the executant and the executed person of a case may result in the case being different in similarity.
In order to solve such a problem, an embodiment of the present application provides an information similarity determining method, which extracts elements in information to be processed and relationships between the elements. And matching the information processed in the knowledge base based on the elements and the relations, and updating the knowledge base based on the matching result. And converting the information to be processed and the past information in the updated knowledge base into characterization vectors, calculating the distance between the characterization vectors, and determining the similarity between the information to be processed and the past processed information according to the distance. It can be seen that, the similarity between the information is not determined simply based on the similarity of the text, but the information to be processed is associated with the information processed in the past based on the similarity of the elements and the relationship, and then the steps of converting the characterization vectors and calculating the similarity are performed. Compared with the method for determining the similarity in the prior art, the method for determining the similarity fully considers the influence of the elements in the information on the similarity, so that the determined similarity of the information is more accurate.
The following describes a method and an apparatus for determining information similarity according to the present application. In the following embodiments of the present application, "and/or" describes an association relationship of associated objects, indicating that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. The singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. And, unless stated to the contrary, the embodiments of the present application refer to the ordinal numbers "first", "second", etc., for distinguishing a plurality of objects, and do not limit the sequence, timing, priority, or importance of the plurality of objects. For example, the first task execution device and the second task execution device are only for distinguishing different task execution devices, and do not indicate a difference in priority, degree of importance, or the like between the two task execution devices.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically stated.
In order to facilitate understanding of the solutions provided in the embodiments of the present application, first, technical terms related to the present application will be briefly described:
(1) Named Entity Recognition (NER): also called as proper name recognition, it is meant to identify entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. It generally comprises two parts: identifying entity boundaries; an entity category (person name, place name, organization name, or other) is determined.
(2) Relationship Extraction (RE) is a natural language processing task that aims to extract relationships between entities. The relation extraction is a key technology in automatic knowledge graph construction, and new relation facts can be accumulated and extracted through the relation extraction, so that the knowledge graph is expanded. Optionally, open NRE may be used for relationship extraction. Open NRE is an Open-source and extensible toolkit that provides a unified framework to implement the relationship extraction model.
(3) Knowledge graph: the knowledge graph is essentially a semantic network, and the nodes of the knowledge graph represent entities or concepts, and the edges represent various semantic relationships among the entities or concepts, and the knowledge graph can describe various levels of cognitive knowledge such as concepts, facts and rules. The knowledge graph constructs an effective carrier for expressing information and knowledge in a cognitive world and a physical world in a computer world by rich semantic expression capability and a flexible structure, and becomes an important infrastructure for artificial intelligence application.
(4) BERT model: is a language representation model, and the process of the BERT model comprises two steps: pre-training and fine-tuning. In pre-training, the model is trained on different pre-training tasks based on label-free data, so the BERT model contains a large amount of common sense knowledge. In the fine-tuning, the model is first initialized based on pre-trained parameters, and then all parameters are fine-tuned using label data from specific tasks downstream.
(5) Sigmoid function: the value range is (0, 1) for hidden layer neuron output, and the method can map a real number to an interval of (0, 1) and can be used for two classifications. The effect is better when the characteristic phase difference is more complex or not particularly large. The functional expression can be shown in the following formula (1):
Figure BDA0003745741460000091
where z is the input of the function, σ (z) is the output of the function, and z has a value range (— infinity, + ∞) and a value range (0, 1).
(6) Graph embedding: the method is a process for mapping graph data (usually a high-dimensional dense matrix) into low-micro dense vectors, and can well solve the problem that the graph data is difficult to be input into a machine learning algorithm efficiently. Common Graph embedding representation techniques include Graph Convolutional neural Network (GCN) and Graph attention Network (GAT).
The following specifically introduces a scheme provided by the present application, and refers to fig. 1, which is a flowchart of an information similarity determining method provided in an embodiment of the present application. It should be noted that, the subject of executing the information similarity determining method is not limited in the present application, and for example, the information similarity determining method may be executed by a terminal such as a computer or a mobile phone, or may be executed by an electronic device with computing capability such as a server or a processing chip. The method flow shown in fig. 1 specifically includes:
101, acquiring a plurality of entities contained in the information to be processed and the relationship among the entities.
Alternatively, a named entity recognition technique may be used to extract a plurality of entities from the information to be processed, and a relationship extraction technique may be used to extract relationships between a plurality of entities from the information to be processed.
For example, if the to-be-processed information includes a name of an executor, and a mobile phone number of 123XXXXXXXX, the entities that can be identified from the to-be-processed information include: entity A: zhang III, and entity B:123 xxxxxxxxxx, entity a has a relationship to entity B: entity B is the mobile phone number of entity a.
The knowledge base is updated 102 based on matching results of the relationships between the plurality of entities and the plurality of entities with the entities and relationships between the entities included and in the knowledge base.
Wherein the knowledge base is constructed based on a plurality of entities contained in the information processed in the past. The knowledge base comprises a plurality of entities contained in the past processed information and the relationship among the entities. For example, after two pieces of information are processed, each piece of information includes three entities, six entities and the relationships between the six entities may be included in the knowledge base.
Optionally, the relationship between the entity and the entity included in the information to be processed may be added to the knowledge base according to the entity and the relationship in the information to be processed and the matching result of the entity and the relationship in the knowledge base, so as to complete the updating of the knowledge base.
And 103, determining similarity between the information to be processed and the plurality of past processed information respectively according to the distance between the characterization vector corresponding to the information to be processed in the updated knowledge base and the characterization vectors corresponding to the plurality of past processed information.
Optionally, after the update of the knowledge base is completed, a graph embedding algorithm may be adopted to convert each piece of information (including the information to be processed and a plurality of pieces of information processed in the past) included in the updated knowledge base into a characterization vector respectively. For example, each piece of information in the knowledge base may be considered a sub-knowledge graph that is converted into a characterization vector using graph-embedding representation techniques.
In a possible implementation manner, distances between the token vectors corresponding to the information to be processed and the token vectors corresponding to the information processed in the past can be respectively calculated. Further, a similarity between the information to be processed and each of the past processed information may be determined based on the distance.
Based on the scheme, the method and the device provide that the similarity between the entity in the information to be processed and the relationship between the entity in the information processed in the past is compared, and the similarity between the information to be processed and the entity contained in the information processed in the past is represented in a knowledge graph mode. Further, the steps of converting the characterization vectors and calculating the similarity are performed. Compared with the scheme that the similarity between the information is determined directly according to the distance between the text vectors corresponding to the information text in the prior art, the method provided by the application fully considers the influence of the entity in the information on the similarity of the information, so that the determined similarity of the information is more accurate.
In some embodiments, when a plurality of entities included in the information to be processed and a relationship between the plurality of entities are obtained, the information to be processed may be converted into a text vector. The method for converting the information to be processed into the text vector is not particularly limited in the present application, and for example, a word2vec model may be used to convert the information to be processed into the text vector. Further, a pre-trained neural network model may be employed to determine the locations of a plurality of entities contained in the text vector. And the positions of the entities in the information to be processed can be correspondingly determined according to the positions of the entities in the text vector. Still further, the plurality of entities may be extracted based on locations of the plurality of entities in the information to be processed.
In one possible implementation, an entity may be determined based on a description of the entity and a relational predicate to which the entity is related. For example, taking an entity and the relationship related to the entity as an example, the entity and the relationship related to the entity can be extracted by using a slot filling manner. Optionally, a slot template may be defined first, and the slot template format may be R (e, a), where R is a relationship predicate, e is a description of an entity, and a is an entity. For example, if the execution person included in the information to be processed is artificially open three, the generated slot template is: name of performer (performer, zhang san). Further, a corresponding natural text question may be generated based on the slot template, such as: the problems are as follows: what is the name of the actor? Answer is as follows: and (5) opening to the third place. Optionally, the question and the answer can be divided by using separators, so that recognition errors are avoided. Furthermore, the natural text question and the answer can be converted into a text vector, and the positions of a plurality of entities and the relation between the entities in the text vector are determined by adopting a pre-trained neural network model. Thereby, the positions of a plurality of entities and the relations between the entities in the information to be processed can be determined, and the entities and the relations are extracted based on the determined positions.
Optionally, before extracting the entity and the relationship for the information to be processed, some preprocessing may be performed on the information to be processed, so as to avoid the problem of identification error. For example, synonym replacement may be performed on some entities in the information to be processed, such as replacing terms such as an execution subject, an implementer, or a case processor with an executor.
To facilitate understanding of the process of acquiring entities and relationships between entities in information to be processed, which is proposed by itself, reference is made to the following description with reference to a specific embodiment, and referring to fig. 2, a flowchart of a method for acquiring entities and relationships between entities included in information to be processed is provided for this application, and includes:
201, preprocessing the information to be processed.
Optionally, the preprocessing may be to replace some ambiguous words in the information to be processed, so as to avoid errors in machine recognition. For a specific pretreatment process, reference may be made to the description in the foregoing embodiments, and details are not described herein again.
And 202, determining answers of the pre-defined question sentences according to the information to be processed to obtain a plurality of question and answer texts.
Optionally, the entities in the information to be processed and the relationship related to each entity may be obtained by first using a predefined slot template, and then a plurality of question and answer texts may be generated according to the slot template.
For example, a slot template such as: name of performer (performer, zhang san). The predefined question is: what is the name of the actor? Answers to the question may be derived based on the slot template, thereby generating a question-answer text: "problem: what is the name of the actor? Answer is as follows: zhang III'.
And 203, respectively converting the plurality of question and answer texts into text vectors.
Optionally, a BERT model may be used to deeply encode each word in the question and answer text, so as to obtain deep semantic information of the question and answer text. Further, each word may be encoded based on the context semantic information of each word, so as to obtain a word vector corresponding to each word. Still further, the resulting combination of multiple word vectors may be referred to as a text vector.
And 204, determining the positions of the entities and the related relations of the entities in the information to be processed according to each text vector.
As a possible implementation manner, the text vector may be first subjected to linear transformation through a first linear layer trained in advance, and then the result output by the first linear layer is subjected to Sigmoid function. Therefore, the score value of each word belonging to the entity starting position can be obtained, and the score value can be normalized for convenience of judgment. For example, if the value after the normalization processing is greater than 0.5, the value is set to 1, otherwise, the value is set to 0, and the word position with the value of 1 may be used as the start position of the entity. Similarly, the text vector may be subjected to linear transformation through a second linear layer trained in advance, and then the result output by the second linear layer is subjected to a Sigmoid function. So that the end position of the entity can be determined.
Further, based on the determined start position and end position of the entity, the position of the entity in the information to be processed is determined. Optionally, the method for determining the location of the entity may also be adopted to determine the location of the entity, and details are not repeated.
And 205, extracting a plurality of entities and entity-related relations from the information to be processed according to the determined positions.
In some embodiments, after the entities and relationships contained in the information to be processed are obtained, the obtained entities and relationships may be matched with the entities and relationships in the knowledge base. Optionally, the entity in the information to be processed may be matched with an entity included in the knowledge base, and if any entity in the information to be processed existing in the knowledge base is the same as the entity, it is determined that the matching is successful. For convenience of description, an entity in the to-be-processed information successfully matched is referred to as a first entity, and an entity in the knowledge base successfully matched with the first entity is referred to as a second entity. Further, after the first entity and the second entity are successfully matched, matching the relationship between the first entity and the second entity may be performed, and if the second entity has a second relationship that is the same as the first relationship related to the first entity, it is determined that the second relationship of the second entity and the first relationship of the first entity are successfully matched.
As a possible implementation manner, after matching the relationships between the plurality of entities and the plurality of entities included in the information to be processed with the relationships between the entities and the entities included in the knowledge base, the knowledge base may be updated according to the matching result.
In some embodiments, when the first entity and the second entity are successfully matched and all the relationships related to the first entity and the second entity are different, the first relationship may be added to the second entity. For example, the first entity is zhang san, and the related first relationship is: the sex of Zhang III is male. The second entity is Zhang III, and the sex of Zhang III is not included in the relation related to the second entity. Therefore, a relationship "sex of Zhang III is male" can be added to the knowledge base as a relationship related to the second entity.
In other embodiments, when there is no entity in the knowledge base that matches the first entity successfully, the first entity and the relationship associated with the first entity may be added directly to the knowledge base.
In still other embodiments, when the first entity and the second entity are successfully matched and the first relationship associated with the first entity and the second relationship associated with the second entity are successfully matched, the knowledge base is indicated to have the first entity and the first relationship, and the first entity and the first relationship do not need to be added to the knowledge base.
To further understand the process of updating the knowledge base provided by the present application, the following description is made with reference to a specific embodiment, and with reference to fig. 3, a flowchart of a method for updating the knowledge base is provided for the embodiment of the present application, which specifically includes:
301, a first entity and a plurality of entities included in a knowledge base are obtained.
The first entity is any entity included in the information to be processed.
302, determine whether there is an entity in the knowledge base that matches the first entity successfully.
If yes, the entity successfully matched with the first entity is simply referred to as the second entity, and the process continues to step 303.
If not, continue with step 305.
303, determining whether the second entity has a relationship successfully matched with the first relationship related to the first entity.
Wherein the first relationship is any one of the relationships associated with the first entity.
If so, the relationship successfully matched with the first relationship is simply referred to as the second relationship, and the process continues to step 306.
If not, continue with step 304.
A first relationship is added 304 to the knowledge base for the second entity.
The first entity and the first entity's associated relationship are added 305 to the knowledge base.
The knowledge base is not updated 306.
Optionally, after the operation of updating the knowledge base is completed, the information included in the knowledge base may be converted into the characterization vector based on the updated knowledge base. For example, a graph embedding algorithm may be used to convert each piece of information (including the piece of information to be processed and a plurality of pieces of information processed in the past) included in the updated knowledge base into a token vector.
Further, the distance between the token vector corresponding to the information to be processed and other token vectors can be calculated. Alternatively, the euclidean distance or the cosine angle distance or the like may be calculated. Illustratively, the distance may be calculated using the following equation (2):
Figure BDA0003745741460000141
wherein s is the distance between the characterization vector corresponding to the information to be processed and the characterization vector corresponding to any one of the processed information, and xiFor a corresponding characterization vector of the information to be processed, yiAnd t is a vector dimension for the characterization vector corresponding to any one of the plurality of information processed in the past.
Alternatively, after calculating distances between the token vectors of the information to be processed and the token vectors corresponding to the plurality of pieces of information processed in the past, the similarity between the information to be processed and the plurality of pieces of information processed in the past may be determined based on the calculated distances. As an example, the similarity and the distance may have some positive correlation linear relationship, that is, the closer the token vector of the information to be processed is, the higher the similarity between the information corresponding to the token vector and the information to be processed is.
To further understand the solution proposed in the present application for determining similarity based on updated knowledge base, the following description is made with reference to specific embodiments. Referring to fig. 4, a flowchart of a method for determining information similarity based on an updated knowledge base provided in an embodiment of the present application specifically includes:
401, respectively converting a plurality of pieces of information included in the updated knowledge base into the characterization vectors.
The updated knowledge base comprises information to be processed and a plurality of past processed information.
Alternatively, each piece of information can be regarded as a sub-knowledge graph in the knowledge base, and each sub-knowledge graph can be converted into a characterization vector by adopting graph embedding representation technology.
And 402, calculating distances between the characterization vectors corresponding to the information to be processed and the characterization vectors corresponding to the information processed in the past respectively.
The algorithm for calculating the distance is not limited, and for example, the Euclidean distance or the cosine included angle distance can be calculated.
And 403, determining the similarity of the information according to the plurality of calculated distances.
Specifically, the closer the characterization vector corresponding to the information to be processed is, the higher the similarity between the information corresponding to the characterization vector and the information to be processed is.
In some embodiments, after determining the similarity between the information to be processed and the plurality of past processed information, respectively, a set number of past processed information may be output in order of decreasing similarity. For example, assuming that the set number is three, the knowledge base includes four pieces of information processed in the past, where the similarity between the first information and the information to be processed is 71, the similarity between the second information and the information to be processed is 23, the similarity between the third information and the information to be processed is 80, and the similarity between the fourth information and the information to be processed is 51. The three pieces of information may be output in the order of "third information → first information → fourth information". Alternatively, the output information may be sent to the facility of the worker who processed the information to be processed for reference by the worker.
Based on the same concept as the method described above, referring to fig. 5, there is provided an apparatus 500 for determining similarity of information according to an embodiment of the present application. The apparatus 500 is used for implementing the steps of the above method, and therefore, in order to avoid repetition, the detailed description is omitted here. The apparatus 500 comprises: an acquisition unit 501 and a processing unit 502.
An obtaining unit 501, configured to obtain multiple entities included in information to be processed and relationships among the multiple entities;
a processing unit 502, configured to update a knowledge base based on a matching result of the plurality of entities and the relationships among the plurality of entities with the entities and the relationships among the entities included in the knowledge base; the knowledge base is constructed based on entities contained in a plurality of pieces of information processed in the past;
the processing unit 502 is further configured to determine similarity between the information to be processed and the plurality of past processed information according to a distance between a characterization vector corresponding to the information to be processed in the updated knowledge base and a characterization vector corresponding to the plurality of past processed information.
In some embodiments, the processing unit 502 is specifically configured to:
when a second entity successfully matched with a first entity exists in the knowledge base and the relation related to the first entity is not equal to the relation related to the second entity, adding the relation related to the first entity into the second entity;
when an entity which is successfully matched with the first entity does not exist in the knowledge base, adding the first entity and the relation related to the first entity into the knowledge base;
wherein the first entity is any one of the plurality of entities.
In some embodiments, the processing unit 502 is further configured to match the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities included in the knowledge base, specifically to:
any one of the plurality of entities is matched with the entities contained in the knowledge base respectively;
and when the entity successfully matched with any entity exists in the knowledge base, matching the relation related to the successfully matched entity with the relation related to the first entity.
In some embodiments, the obtaining unit 501 is specifically configured to:
converting the information to be processed into a text vector by the processing unit 502; determining the positions of the plurality of entities and the relations among the plurality of entities contained in the text vector by adopting a pre-trained neural network model; determining the positions of the entities and the relations among the entities in the information to be processed according to the positions of the entities and the relations among the entities in the text vector;
and extracting the entities and the relations among the entities from the information to be processed according to the positions of the entities and the relations among the entities in the information to be processed.
In some embodiments, the processing unit 502 is further configured to:
and respectively converting the information to be processed in the updated knowledge base and the plurality of past processed information into characterization vectors by adopting a graph embedding algorithm.
In some embodiments, the processing unit 502 is further configured to:
and outputting a set number of past processed information according to the similarity between the information to be processed and the plurality of past processed information respectively and the descending order of the similarity.
Fig. 6 shows a schematic structural diagram of an electronic device 600 provided in an embodiment of the present application. The electronic device 600 in this embodiment may further include a communication interface 603, where the communication interface 603 is, for example, a network interface, and the electronic device may transmit data through the communication interface 603.
In the embodiment of the present application, the memory 602 stores instructions that can be executed by the at least one controller 601, and the at least one controller 601 can be configured to execute the steps in the foregoing method by executing the instructions stored in the memory 602, for example, the controller 601 can implement part of the functions of the obtaining unit 501 and the functions of the processing unit 502 in fig. 5.
The controller 601 is a control center of the electronic device, and may connect various parts of the whole electronic device by using various interfaces and lines, by executing or executing instructions stored in the memory 602 and calling data stored in the memory 602. Alternatively, the controller 601 may include one or more processing units, and the controller 601 may integrate an application controller and a modem controller, wherein the application controller mainly handles an operating system, application programs, and the like, and the modem controller mainly handles wireless communication. It will be appreciated that the modem controller described above may not be integrated into the controller 601. In some embodiments, the controller 601 and the memory 602 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The controller 601 may be a general-purpose controller, such as a Central Processing Unit (CPU), a digital signal controller, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present application. The general controller may be a microcontroller or any conventional controller or the like. The steps executed by the data statistics platform disclosed in the embodiments of the present application may be directly executed by a hardware controller, or may be executed by a combination of hardware and software modules in the controller.
The memory 602, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 602 may include at least one type of storage medium, such as a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 602 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 602 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
By programming the controller 601, for example, the code corresponding to the training method of the neural network model described in the foregoing embodiment may be fixed in the chip, so that the chip can execute the steps of the training method of the neural network model when running.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a controller of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the controller of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (14)

1. A method for determining information similarity is characterized by comprising the following steps:
acquiring a plurality of entities contained in information to be processed and relations among the entities;
updating a knowledge base based on a matching result of the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities contained in the knowledge base; the knowledge base is constructed based on entities contained in a plurality of pieces of information processed in the past;
and determining the similarity between the information to be processed and the past processed information respectively according to the distance between the characterization vector corresponding to the information to be processed in the updated knowledge base and the characterization vectors corresponding to the past processed information.
2. The method of claim 1, wherein updating the knowledge base based on a matching of the plurality of entities and the relationships to entities and relationships between entities contained in the knowledge base comprises:
when a second entity successfully matched with a first entity exists in the knowledge base and the relation related to the first entity is not equal to the relation related to the second entity, adding the relation related to the first entity into the second entity;
when an entity which is successfully matched with the first entity does not exist in the knowledge base, adding the first entity and the relation related to the first entity into the knowledge base;
wherein the first entity is any one of the plurality of entities.
3. The method according to claim 1 or 2, wherein the matching of the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities contained in the knowledge base is performed by:
any one of the plurality of entities is matched with the entities contained in the knowledge base respectively;
and if the entity successfully matched with any entity exists in the knowledge base, matching the relation related to the successfully matched entity with the relation related to the first entity.
4. The method according to claim 1 or 2, wherein the obtaining of the plurality of entities contained in the information to be processed and the relationship between the plurality of entities comprises:
converting the information to be processed into a text vector;
determining the positions of the plurality of entities and the relations among the plurality of entities contained in the text vector by adopting a pre-trained neural network model;
determining the positions of the entities and the relations among the entities in the information to be processed according to the positions of the entities and the relations among the entities in the text vector;
and extracting the entities and the relations among the entities from the information to be processed according to the positions of the entities and the relations among the entities in the information to be processed.
5. The method according to claim 1 or 2, characterized in that the method further comprises:
and respectively converting the information to be processed in the updated knowledge base and the plurality of past processed information into characterization vectors by adopting a graph embedding algorithm.
6. The method according to claim 1 or 2, characterized in that the method further comprises:
and outputting a set number of past processed information according to the similarity between the information to be processed and the plurality of past processed information respectively and the descending order of the similarity.
7. An apparatus for determining similarity between information, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of entities contained in information to be processed and relations among the entities;
a processing unit configured to update a knowledge base based on a matching result of the plurality of entities and the relationships between the plurality of entities with the entities and the relationships between the entities included in the knowledge base; the knowledge base is constructed based on entities contained in a plurality of pieces of information processed in the past;
the processing unit is further configured to determine similarity between the information to be processed and the plurality of past processed information according to distances between the characterization vectors corresponding to the information to be processed in the updated knowledge base and the characterization vectors corresponding to the plurality of past processed information.
8. The apparatus according to claim 7, wherein the processing unit is specifically configured to:
when a second entity successfully matched with a first entity exists in the knowledge base and the relation related to the first entity is not equal to the relation related to the second entity, adding the relation related to the first entity into the second entity;
when an entity which is successfully matched with the first entity does not exist in the knowledge base, adding the first entity and the relation related to the first entity into the knowledge base;
wherein the first entity is any one of the plurality of entities.
9. The apparatus according to claim 7 or 8, wherein the processing unit is further configured to match the plurality of entities and the relationships between the plurality of entities with entities and relationships between entities contained in a knowledge base, and in particular to:
any one of the plurality of entities is matched with the entities contained in the knowledge base respectively;
and when an entity successfully matched with any entity exists in the knowledge base, matching the relation related to the successfully matched entity with the relation related to the first entity.
10. The apparatus according to claim 7 or 8, wherein the obtaining unit is specifically configured to:
converting the information to be processed into a text vector through the processing unit; determining the positions of the plurality of entities and the relations among the plurality of entities contained in the text vector by adopting a pre-trained neural network model; determining the positions of the entities and the relations among the entities in the information to be processed according to the positions of the entities and the relations among the entities in the text vector; and extracting the entities and the relations among the entities from the information to be processed according to the positions of the entities and the relations among the entities in the information to be processed.
11. The apparatus according to claim 7 or 8, wherein the processing unit is further configured to:
and respectively converting the information to be processed in the updated knowledge base and the plurality of past processed information into characterization vectors by adopting a graph embedding algorithm.
12. The apparatus according to claim 7 or 8, wherein the processing unit is further configured to:
and outputting a set number of past processed information according to the similarity between the information to be processed and the plurality of past processed information respectively and the descending order of the similarity.
13. An electronic device, comprising: a memory and a controller;
a memory for storing program instructions;
a controller for invoking program instructions stored in said memory for executing the method of any one of claims 1-6 in accordance with an obtained program.
14. A computer storage medium storing computer-executable instructions for performing the method of any one of claims 1-6.
CN202210831617.2A 2022-07-14 2022-07-14 Method and device for determining information similarity Pending CN115270751A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210831617.2A CN115270751A (en) 2022-07-14 2022-07-14 Method and device for determining information similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210831617.2A CN115270751A (en) 2022-07-14 2022-07-14 Method and device for determining information similarity

Publications (1)

Publication Number Publication Date
CN115270751A true CN115270751A (en) 2022-11-01

Family

ID=83765727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210831617.2A Pending CN115270751A (en) 2022-07-14 2022-07-14 Method and device for determining information similarity

Country Status (1)

Country Link
CN (1) CN115270751A (en)

Similar Documents

Publication Publication Date Title
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN110737758A (en) Method and apparatus for generating a model
Arshad et al. Aiding intra-text representations with visual context for multimodal named entity recognition
CN113821605B (en) Event extraction method
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
CN110598869B (en) Classification method and device based on sequence model and electronic equipment
CN113239169A (en) Artificial intelligence-based answer generation method, device, equipment and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
CN117033571A (en) Knowledge question-answering system construction method and system
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
CN112307048A (en) Semantic matching model training method, matching device, equipment and storage medium
CN113806493A (en) Entity relationship joint extraction method and device for Internet text data
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN117271724A (en) Intelligent question-answering implementation method and system based on large model and semantic graph
CN115859302A (en) Source code vulnerability detection method, device, equipment and storage medium
CN114417785A (en) Knowledge point annotation method, model training method, computer device, and storage medium
CN113705207A (en) Grammar error recognition method and device
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN113869049B (en) Fact extraction method and device with legal attribute based on legal consultation problem
CN115713082A (en) Named entity identification method, device, equipment and storage medium
CN112541357B (en) Entity identification method and device and intelligent equipment
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination