CN111553163A - Text relevance determining method and device, storage medium and electronic equipment - Google Patents

Text relevance determining method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN111553163A
CN111553163A CN202010350443.9A CN202010350443A CN111553163A CN 111553163 A CN111553163 A CN 111553163A CN 202010350443 A CN202010350443 A CN 202010350443A CN 111553163 A CN111553163 A CN 111553163A
Authority
CN
China
Prior art keywords
text
word
entity
determining
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010350443.9A
Other languages
Chinese (zh)
Inventor
徐也
常景冬
邵一峰
邹鹏飞
刘艾婷
荆宁
张红林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Wuhan Co Ltd
Original Assignee
Tencent Technology Wuhan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Wuhan Co Ltd filed Critical Tencent Technology Wuhan Co Ltd
Priority to CN202010350443.9A priority Critical patent/CN111553163A/en
Publication of CN111553163A publication Critical patent/CN111553163A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Abstract

The application discloses a text relevancy determination method and device, a storage medium and electronic equipment. The method comprises the following steps: determining a first set of entities associated with the first text and a second set of entities associated with the second text based on a knowledge base, the knowledge base including a knowledge representation of the entities, relationships between the entities, and attributes of the entities; determining entity relevance between the first set of entities and the second set of entities based on the knowledge representation; determining the attention value of each word in the first text and the second text relative to other words according to the incidence relation among each word in the first text, each word in the second text and the association relation between the word in the first text and the word in the second text; the text relevance of the first text to the second text is determined based at least on the attention value and the entity relevance. According to the scheme, the relation between each word in the text and each word in the text is concerned during the text relevancy calculation, and then useful information is concerned to ignore useless information, so that the accuracy of the text relevancy calculation result is improved.

Description

Text relevance determining method and device, storage medium and electronic equipment
Technical Field
The application relates to the technical field of information processing, in particular to a method and a device for determining text relevancy, a storage medium and electronic equipment.
Background
The text relevance is also called the matching degree of the texts, and the relevance between different texts needs to be determined in many scenes. For example, a term search scenario, generally requires determining the relevance of the text in each document to the terms in the search query when performing a search, and then presenting the ranking of each relevant document in a search results page based on the relevance. The determination of text relevance is based on the understanding of the text, not only by the semantic similarity of the two texts, but also by the degree of match between the texts. Especially for long texts, the accuracy of the calculation result is low when the text relevancy is calculated due to the problem of information diffusion.
Disclosure of Invention
The embodiment of the application provides a method and a device for determining text relevancy, a storage medium and electronic equipment, and can improve accuracy of text relevancy settlement results.
The embodiment of the application provides a method for determining text relevancy, which comprises the following steps:
determining a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, wherein the preset knowledge base comprises knowledge representations consisting of entities, relationships among the entities and entity attributes;
determining entity relevance between the first set of entities and the second set of entities from the knowledge representation;
determining attention values of each word in the first text and the second text about other words according to the incidence relation among each word in the first text, the incidence relation among each word in the second text and the incidence relation among the words in the first text and the words in the second text, wherein the attention values are used for reflecting the attention degrees of each word in the first text and the second text to other words;
and determining the text relevance of the first text and the second text at least according to the attention value and the entity relevance.
Correspondingly, the embodiment of the present application further provides a device for determining text relevancy, including:
an entity determining unit, configured to determine a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, where the preset knowledge base includes knowledge representations composed of entities, relationships between the entities, and entity attributes;
a first relevance determining unit for determining entity relevance between the first set of entities and the second set of entities based on the knowledge representation;
an attention determining unit, configured to determine an attention value of each word in the first text and the second text with respect to other words according to an association relationship between each word in the first text, an association relationship between each word in the second text, and an association relationship between a word in the first text and a word in the second text, where the attention value is used to reflect attention of each word in the first text and the second text to other words;
and the second relevance determining unit is used for determining the text relevance of the first text and the second text according to at least the attention value and the entity relevance.
Accordingly, the present application further provides a computer-readable storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform the method for determining text relevancy as described above.
Accordingly, an embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for determining text relevancy as described above.
In the embodiment of the application, a first group of entities associated with a first text and a second group of entities associated with a second text are determined based on a preset knowledge base, wherein the preset knowledge base comprises knowledge representation formed by the entities, the relations among the entities and the attributes of the entities; determining entity relevance between the first set of entities and the second set of entities based on the knowledge representation; determining the attention value of each word in the first text and the second text relative to other words according to the incidence relation among each word in the first text, each word in the second text and the association relation between the word in the first text and the word in the second text; and determining the text relevance of the first text and the second text according to at least the attention value and the entity relevance. According to the scheme, the relation between each word in the text and each word between the texts is concerned during the text relevancy calculation, and then useful information is concerned to ignore useless information, so that the accuracy of the text relevancy calculation result is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text relevance determination method according to an embodiment of the present application.
Fig. 2 is a schematic diagram of a model architecture provided in an embodiment of the present application.
Fig. 3 is a schematic structural diagram of an application scenario provided in an embodiment of the present application.
Fig. 4 is a schematic structural diagram of a device for determining text relevancy according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any inventive work, are within the scope of protection of the present application.
The embodiment of the application provides a method and a device for determining text relevancy, a storage medium and an electronic device. The device for determining the text relevance may be integrated in an electronic device having a storage unit and a microprocessor and having an arithmetic capability, such as a tablet pc (personal computer), a mobile phone, and the like.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, to obtain knowledge and to use the knowledge to obtain the best results, so that the machine has the functions of perception, reasoning and decision making.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, and the like.
In this scheme, a self-attention mechanism (self-attention) is introduced, which can imitate the internal process of biological observation behavior, i.e. a mechanism that can align the internal experience and external feeling, thereby increasing the observation fineness of partial region. Important features of sparse data in the text can be extracted rapidly through a self-attention mechanism, and text information is analyzed and processed by capturing data or internal relevance of the features, so that the purpose of intelligently processing the text is achieved.
The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples. Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a method for determining text relevancy according to an embodiment of the present disclosure. The specific process of the text relevancy determination method may be as follows:
101. a first set of entities associated with the first text and a second set of entities associated with the second text are determined based on a preset knowledge base, the preset knowledge base including knowledge representations of the entities, relationships between the entities, and attributes of the entities.
The first text and the second text may be texts with longer text lengths, respectively. In this embodiment, the first text may be a text to be retrieved, and the second text may be a candidate text that needs to be matched with the text to be retrieved. In specific implementation, taking a question-answer scenario as an example, the first text may be a question text input by a user through an electronic device, and the second text may be an answer text preset in an answer library for the question text.
For example, the question text may be "who the wife of zhang san is", and the answer text may be a series of answers "the wife of zhang san is lie si", "the husband of lie si is zhang", or "she appeared in 1990 2 months, the vocational instructor, once … …", "the wife of zhang san is wang, the vocational singer, appeared in S city … … in 1880 5 months, etc. It can be seen that for a question text, the choice of answer text is many, and in order to save the time for the user to accurately find the most relevant answer, it becomes especially important how to sort the answer text according to the relevance of the answer to the question.
A predetermined knowledge base, i.e. a knowledge graph or knowledge map. Where an entity refers to various objects and concepts present in the real world, such as people, geographic locations, organizations, brands, professions, dates, and so forth. A relationship between entities refers to an association between two entities; entity attributes refer to the nature of an entity itself. Taking a human as an example, the attributes may include occupation, birthday, representative work, age, height, weight, gender, and the like. An entity's attributes may also sometimes be considered a part-of-speech relationship for an entity, and thus the knowledge base describes one or more relationships for various entities. Taking the above-mentioned example of the question text and the answer text as an example, the entities may include characters "zhang san" and "li xi", the entity attributes may include occupation "teacher" and date "1990-2 month", and the relationship between the entities may include a couple relationship between zhang san and li xi.
For convenience of processing and understanding of the computer, knowledge in the predetermined knowledge base may be represented in the form of a triple "Subject-predicate-Object (SPO)", such as: (first entity, relationship/attribute, second entity). For example, the knowledge that "a wife of Zhang III is Li Si" can be represented by a triple as (Zhang III, wife, Li Si). Herein, a relationship or attribute (such as a wife) is also referred to as a "predicate", and two entities having a respective relationship or attribute may be referred to as a "subject" or an "object". If an entity is regarded as a node and the relationship and attribute between the entities are regarded as an edge, the knowledge base containing a large number of triples forms a huge knowledge graph. By associating knowledge elements such as entities, relationships/attributes, etc., corresponding knowledge can be easily obtained from the knowledge base.
102. An entity relevance between the first set of entities and the second set of entities is determined from the knowledge representation.
The entity correlation is a quantitative representation of the degree of matching between the first group of entities and the second group of entities, and can be represented as the degree of similarity between each entity in the first group of entities and each entity in the second group of entities and each entity in the other group. In this embodiment, the similarity degree may be specifically determined by word co-occurrence degree of the entity and the substantial co-occurrence degree of the entity. The word co-occurrence degree of the entities may represent a word coincidence rate of the first group of entities and the second group of entities, and the essential co-occurrence degree of the entities represents a coincidence rate of entity identifications corresponding to the same entity words in the first group of entities and the second group of entities. In practical application, the word co-occurrence degree of the entities and the essential co-occurrence degree of the entities can be obtained through manual calculation and can be used as shallow features for calculating the relevancy of the first text and the second text.
In this embodiment, entities in the first text and the second text may be identified by using a text entity association technology, and connected to corresponding nodes of the knowledge graph. Considering that the entity included in the knowledge graph cannot guarantee complete coverage, the relevance of the question and the answer can be described by simultaneously using the entity word (i.e., the entity segment) recognized from the text and the entity identifier (i.e., the entity ID) of the entity in the preset knowledge base. That is, when determining the entity relevance between the first set of entities and the second set of entities based on the knowledge representation, the following process may be included:
(11) determining a first number of entities of the first and second sets of entities having the same designation;
(12) determining a second number of entities in the first group of entities and the second group of entities having the same identification in the knowledge base, wherein the identification of the entity uniquely identifies the entity in the preset knowledge base;
(13) and determining entity relevance according to the first number and the second number.
In particular, the second number of entities in the first and second groups of entities having the same identity in the knowledge base may be determined based on knowledge representation determinations in a predetermined knowledge base. I.e. determining entities in the first set of entities and the second set of entities that are all the same in terms of entity relationships, entity attributes, etc. by means of knowledge representation.
Taking a question-answering scene as an example, aiming at the entity comment, the entity comment similarity of the text to be retrieved and the candidate text can be respectively calculated by utilizing pre-trained word vector comment; aiming at the entity ID, the text to be retrieved can be directly matched with the entity words in the candidate text, and the matching result is used as the similarity of the entity granularity. That is, in some embodiments, the first text is a text to be retrieved and the second text is a candidate text. The step "determining the entity relevance according to the first number and the second number" may include the following steps:
determining the entity word similarity of the text to be retrieved and the candidate text based on the first number and the entity number of the first group of entities;
determining entity identification similarity of the text to be retrieved and the candidate text based on the second number and the entity number of the first group of entities;
and determining the entity correlation degree according to the entity word similarity degree and the entity identification similarity degree.
When determining the entity correlation degree according to the entity word similarity degree and the entity identification similarity degree, the entity word similarity degree and the entity identification similarity degree can be jointly used as the entity correlation degree; and weighting the entity word similarity and the entity identification similarity based on the corresponding weight information to obtain the entity relevancy.
Regarding the entity similarity, the calculation formula is as follows:
Figure RE-GDA0002532524300000061
wherein:
Figure BDA0002471640300000062
wherein, the ention _ qi is the vector representation of the ith entity word in the first text, the ention _ dj is the vector representation of the jth entity word in the second text, n is the number of entity words in the first entity word group, m is the number of entity words in the second text, and both n and m are integers greater than or equal to 1. The above formula indicates that for any one of the vector representations of the entity words in the first text, the difference from the vector representation of the entity words of the second text is determined and then the maximum difference value is selected. And for the vector representation of all entity words in the first text, counting the sum of the selected corresponding maximum difference values, averaging the number of the entity words in the first text, and taking the value obtained by averaging as the entity element similarity of the first text and the second text.
Regarding the entity ID similarity, the calculation formula is as follows:
Figure RE-GDA0002532524300000071
wherein:
Figure BDA0002471640300000072
wherein id _ qi is the ith entity in the first group of entities, id _ dj is the jth entity in the second group of entities, n is the number of entities in the first group of entities, m is the number of entities in the second group of entities, and both n and m are integers greater than or equal to 1. The above formula indicates that for any of the first set of entities, it is determined whether there are entities in the second set of entities having the same identity. The ratio of the number of entities with the same identity in the first group of entities to the total number of entities n in the first group is then used to indicate the similarity of the entity IDs. It will be appreciated that the similarity between two groups of entities at the identification level is determined.
For example, text entries to be retrieved include: small a (ID1), university of E (ID 2); candidate text 1 notes: small a (ID3), small B (ID4), small C (ID5), small D (ID 6); candidate texts 2 notations: small a (ID1), professor (ID7), university of E (ID 8).
Then, regarding the comment similarity of the text to be retrieved and the candidate text 1: that is, the number of the text segment to be retrieved is 2 as denominator, and there are 1 (i.e. small a) intersections in the text segment to be retrieved and the candidate text 1 segment as numerators, so as to obtain the entity segment similarity: 1/2, respectively;
regarding ID similarity of the text to be retrieved and the candidate text 1: that is, the number of the text ID to be retrieved is 2, which is taken as the denominator, and there is no intersection between the text ID to be retrieved and the candidate text 1ID, so the numerator is 0, and the obtained entity ID similarity is: 0/2, respectively;
regarding the comment similarity of the text to be retrieved and the candidate text 2: namely, the number of the text comment to be retrieved is 2 as the denominator; the intersection of the text to be retrieved and the candidate text 1 piece has 1 (namely, small A) which is used as a molecule to obtain the entity piece similarity of 1/2;
regarding ID similarity of the text to be retrieved and the candidate text 2: because the number of the text ID to be searched is 2 as the denominator, and 1 intersection (namely ID1) in the text ID to be searched and the candidate text 1ID is used as a numerator, the similarity of the entity ID is obtained as follows: 1/2.
103. And determining the attention value of each word in the first text and the second text relative to other words according to the incidence relation among each word in the first text, the incidence relation among each word in the second text and the incidence relation between the word in the first text and the word in the second text, wherein the attention value is used for reflecting the attention degree of each word in the first text and the second text to other words.
In some embodiments, when the text is deeply represented, the text is processed by introducing the RNN into the representation layer, however, since the RNN only focuses on the part of speech of the word itself, there is a problem of information diffusion for the text of a long sentence.
For example, taking a beautiful wife as a beautiful question-answer scenario as an example, the questions are: "who a wonderful wife is", and the longer text for the answer is: "in 1984, she and sister participated in a certain game of a certain country to obtain a quarterly army, and then went to a certain country for beauty treatment; a plane model is made between 1985 and 1987; marriage was registered somewhere with a flower on day 23 and 6 in 2008 ". It is known that the distance between the marriage and the wonderful wife is long, and the relationship between the long-distance entities cannot be modeled by the RNN. Therefore, the self-attention mechanism can be introduced into the scene, and the self-attention mechanism can pay attention to the dependency relationship among all words in the text and learn the characteristics of the internal structure of the sentence, so that the problem of long-distance word dependency can be effectively solved. That is, in the present embodiment, the dependency relationship between words can be determined by learning the association relationship between words inside the text and the association relationship between words between texts, and the degree of importance (i.e., the degree of attention) of each word to the text is determined, and when the depth representation of the text is output, the degree of attention to the word with a high attention value is increased, and the degree of attention to the word with a low attention value is decreased, so that "useful information" with a high degree of relevance to the text itself is retained, and "useless information" with a low degree of relevance is removed.
In some embodiments, when the mosaic matrix is processed according to the association relationship between each word in the first text, the association relationship between each word in the second text, and the association relationship between a word in the first text and a word in the second text, and a processed matrix is obtained, specifically: calculating the correlation degree between each word in the first text and other words according to the correlation relationship between each word in the first text and the correlation relationship between the word in the first text and the word in the second text; and calculating the correlation degree between each word in the second text and other words according to the correlation relation between each word in the second text and the correlation relation between the word in the first text and the word in the second text. Finally, the attention value of each word in the first text and the attention value of each word in the second text relative to other words are determined according to the relevance between each word in the first text and other words and the relevance between each word in the second text and other words.
In this embodiment, the self-attention mechanism may be introduced to find the most relevant context for different words or entities in the text, and weighting is performed to obtain the final hidden layer representation. In the self-entry mechanism, each word has 3 different vectors, which are respectively a Query vector (Q), a Key vector (K) and a Value vector (V), and the length is 64. They are formed by multiplying the word's embedded vector X by three different weight matrices W through 3 different weight matricesQ、WK、WVThus obtaining the product. Wherein, three weight matrix WQ、WK、WVAre all the same, for example the dimensions may be: 512x 64.
In particular, the input word or entity can be converted into an embedded vector, and then three vectors of Q, K and V are obtained according to the embedded vector. Each word or entity is calculated to have a degree of correlation score (representing the degree of correlation) with other words or entities, i.e., socre ═ Q × K. For the stabilization of the gradient, each score may be numerically normalized using the activation function softmax. Multiplying the normalized Value points by the Value vector V of each word or entity to obtain a weighted score V of each input vector, and adding to obtain a final output result: and Z ═ sum (V) as an attention vector of the input word or entity, and the attention value of each word or entity can be obtained by processing the attention vector.
Therefore, by calculating the correlation degree between different words, the attention between the words can be represented, and then the useful information is reserved and the useless information is removed, so that the long text can be well represented by introducing the self-attention mechanism.
Referring to fig. 2, fig. 2 is a schematic diagram of a model architecture provided in the embodiment of the present application. In some embodiments, a feature matrix corresponding to the first text and a feature matrix corresponding to the second text need to be respectively constructed in advance to obtain a first feature matrix and a second feature matrix; and splicing the first characteristic matrix and the second characteristic matrix to obtain a spliced matrix. Then, when the attention value of each word in the first text and the attention value of each word in the second text with respect to other words are determined according to the correlation between each word in the first text and other words and the correlation between each word in the second text and other words, normalization processing may be specifically performed on the correlation between each word in the first text and other words and the correlation between each word in the second text and other words, weighting processing may be performed on the concatenation matrix according to the correlation after the normalization processing to obtain a weighted matrix, and the attention value of each word in the first text and the second text with respect to other words is determined based on the weighted matrix.
In some embodiments, when the feature matrix corresponding to the first text and the feature matrix corresponding to the second text are respectively constructed to obtain the first feature matrix and the second feature matrix, the following process may be specifically included:
(21) performing word segmentation processing on the first text and the second text to obtain a first group of words related to the first text and a second group of words related to the second text;
(22) constructing a first vector representation of each word in the first set of words based on each word in the first set of words and a position of each word in the first text;
(23) constructing a second vector representation of each word in the second set of words based on each word in the second set of words and a position of each word in the second text;
(24) a first feature matrix is determined from at least the constructed first vector representation and a second feature matrix is determined from at least the constructed second vector representation.
Specifically, for inputting the first text and the second text, word segmentation processing may be performed on the first text and the second text through word segment word segmentation technologies (such as Yaha word segmentation, Jieba word segmentation, and the like), and processing operations such as stop word removal, punctuation mark removal, emoticon conversion, and the like are performed to respectively segment the first text and the second text into separate words, so as to obtain a first group of words associated with the first text and a second group of words associated with the second text.
In the embodiment, vector representation can be constructed by combining the part of speech of the word and the position of the word in the text, so as to better express the actual semantic meaning of the word in the text. That is, the first vector representation and the second vector representation of the building elements each include: word embedding vectors (word embedding) and position embedding vectors (position embedding). Wherein, the word embedding vector is a vector representation that each word in the text is mapped on a real number field; the location embedding vector is a vector representation in which the location of each word in the text is mapped onto the real number domain.
Specifically, for word embeddings, entity SPO (shortest Path extension) triplet information stored in a knowledge graph is used for training word-level embedding based on a cbow mode, so that the purpose that the embedding of entities with a relation is similar is achieved. According to the scheme, the similarity of entities with similar relationships can be further drawn by utilizing the SPO triple information of the entities. Since the input SPO information is short, the word window may be set to 1 in implementations.
For position embedding, the purpose of introduction is to model word order. For example, taking the vector PE mapping the position of a word as p in a dpos dimension as an example, the calculation method is as follows:
Figure BDA0002471640300000111
wherein, the above formula identifies the value of the ith element of the vector PE as PEi(p) wherein, PE2iAt even number, PE2i+1In practical application, sin (α + β) ═ sin β 0cos β 1+ cos α sin β and cos (α + β) ═ cos α cos β -sin α sin β, so that in the present embodiment, the vector at the position p + k can be expressed as a linear transformation of the vector at the position p, and the possibility of expressing relative position information is provided.
Subsequently, the segmented words can be subjected to operations of vector representation construction, feature matrix construction, matrix transformation, matrix extraction and the like by using the depth representation layer in the preset model (refer to fig. 2), and for different words or entities in the text, the most relevant context is found, and the final hidden layer representation is output in a weighted manner.
Referring to fig. 2 and 3, in some embodiments, when determining the first feature matrix according to at least the constructed first vector representation, the following process may be specifically included:
splicing the constructed first vector representations to obtain a first sub-matrix;
the method comprises the steps of identifying a first entity word from a first group of words based on a preset knowledge base, determining a first knowledge element related to the first entity word from the preset knowledge base, and splicing vector representations of the first knowledge element according to the position of the first entity word in a first text and knowledge representation formed by the first knowledge element to obtain a second sub-matrix, wherein the first knowledge element comprises: the first related entity corresponding to the first entity word and having a relationship with the first target entity in the preset knowledge base, the relationship between the first target entity and the first related entity and/or the entity attribute of the first target entity;
a first feature matrix is determined based on the first sub-matrix and the second sub-matrix.
Specifically, in the process of entity word recognition, entity words can be recognized from the first group of words by using entity linking technology. When the first feature matrix is determined based on the first sub-matrix and the second sub-matrix, the first feature matrix can be obtained by directly splicing the first sub-matrix and the second sub-matrix. The first feature matrix fuses word embedding of each word in the first text, position of each word in the first text and entityyembedding of each entity word in the first text. It should be noted that the entity embedding vector (entity embedding) is a vector representation in which the relevant knowledge element of each entity word in the first text is mapped on the real number domain.
In the scheme, the entity SPO is modeled into an additive relationship (namely S + P ═ O) for model training, and the aim is to enable the sum of the embedding of S and the embedding of P to be equal to the embedding of O as much as possible. After the training, S, P, O describes the relationship that the entity is characterized by the vector representation of the related knowledge elements of the entity, and the vector representation of the related knowledge elements is combined to be as close as possible to the vector representation of the entity.
For example, taking the SPO triple knowledge representation as (zhang three, wife, liquad) as an example, if the entity word in the text is "zhang three", the vector representation of the related knowledge element of the entity word "zhang three" (i.e., the related entity "liquad", the relationship between the entities "wife") can be obtained as the entity embedding of the entity word "zhang three".
With continuing reference to fig. 2 and 3, when determining the second feature matrix according to at least the constructed second vector representation, the following process may be specifically included:
splicing the constructed second vector representations to obtain a third sub-matrix;
recognizing a second entity word from a second group of words based on the preset knowledge base, determining a second knowledge element related to the second entity word from the preset knowledge base, and splicing vector representations of the second knowledge element according to the position of the second entity word in a second text and knowledge representation formed by the second knowledge element to obtain a fourth sub-matrix, wherein the second knowledge element comprises: the second related entity corresponding to the second target entity in the preset knowledge base has a relationship with the second target entity, the relationship between the second target entity and the second related entity and/or the entity attribute of the second target entity;
a second feature matrix is determined based on the third sub-matrix and the fourth sub-matrix.
Similarly, in performing entity word recognition, entity linking techniques may be used to identify entity words from the second set of words. When the second feature matrix is determined based on the third sub-matrix and the fourth sub-matrix, the third sub-matrix and the fourth sub-matrix can be directly spliced to obtain the second feature matrix. The second feature matrix fuses word embedding of each word in the second text, the position of each word in the second text, and the entityyembedding of each entity word in the second text. It should be noted that the entity embedding vector is a vector representation in which the related knowledge elements of each entity word in the second text are mapped on the real number domain.
Therefore, the self-attention value in the scheme integrates vector representations of depth features such as word embedding, position embedding, entity embedding and the like of each word in the first text and the second text, a self-attention mechanism introduced by a depth representation layer is used for carrying out transformation calculation on a feature matrix formed by the vector representations of the depth features, and the output final hidden layer representation is the attention vector. Wherein the attention vector may be a one-dimensional vector of size 1 xn. The one-dimensional vector includes the attention value of each word in the first text and the second text with respect to other words.
104. And determining the text relevance of the first text and the second text according to at least the attention value and the entity relevance.
In practical applications, the more words in the question text appear in the answer text, the more relevant the answer is to the question. Since the question is relatively short compared with the answer, the co-occurrence degree feature of the question and the answer word can be constructed by utilizing the proportion of the question word covered in the answer text, and the text relevance can be calculated by combining the word co-occurrence degree between the first text and the second text.
That is, before determining the relevancy of the first text and the second text, a third number of identical words in the first text and the second text may also be determined, and the word relevancy of the first text and the second text may be determined according to the third number and the number of words in the first text. When determining the text relevance of the first text and the second text according to the attention value and the entity relevance, the text relevance of the first text and the second text may be specifically determined according to the attention value, the entity relevance and the word relevance.
In addition, in other embodiments, the feature that the first text is related to the statistical information of the second text can be introduced when the text relevance is calculated. For example, the relevant features may include a first text character length and word length, a second text character length and word length, an answer text (i.e., second text) source confidence, a question text (i.e., first text) classification and answer text (i.e., second text) classification similarity, and the like, which may be self-defined according to actual needs.
Specifically, with continued reference to fig. 3, when calculating the text relevance, shallow features of the first text and the second text (i.e., word co-occurrence of the question text and the answer text, entity co-occurrence of the question text and the answer text, word length of the question text and the answer text, character length of the question text and the answer text, confidence of the answer text, classification similarity of the question text and the answer text, etc.) may be extracted, and feature vectors (typically, one-dimensional feature vectors) may be constructed for the shallow features. And then, splicing the constructed feature vector with a feature vector output by a depth representation layer of the model, wherein the feature vector represents the attention value of each word in the first text and the second text relative to other words, and obtaining a spliced vector. And finally, carrying out normalization processing on the splicing vector by using a softmax activation function to obtain the text correlation degree of the question text and the answer text.
In practical applications, taking a scenario in which relevant answer search is performed for a question in a search library as an example, after a text relevance between a question text and each answer text is obtained, the answer texts can be ranked and displayed based on the text relevance, the answer text with a larger text relevance is displayed in the front, and the answer text with a smaller relevance is displayed in the back, so that the exposure of an accurate answer is improved.
The method for determining the text relevancy determines a first group of entities associated with a first text and a second group of entities associated with a second text based on a preset knowledge base, wherein the preset knowledge base comprises knowledge representation formed by the entities, relations among the entities and entity attributes; determining entity relatedness between the first set of entities and the second set of entities based on the knowledge representation; determining the attention value of each word in the first text and the second text relative to other words according to the incidence relation among each word in the first text, each word in the second text and the association relation between the word in the first text and the word in the second text; and determining the text relevance of the first text and the second text according to at least the attention value and the entity relevance. According to the scheme, the relation between each word in the text and each word in the text is paid attention to during the text relevance calculation, the weight of useful information is increased and the weight of useless information is reduced based on the relation, and the accuracy of the text relevance calculation result is improved.
In order to better implement the method for determining the text relevance provided by the embodiment of the present application, the embodiment of the present application further provides a device based on the method for determining the text relevance. The meaning of the noun is the same as that in the text relevancy determination method, and specific implementation details can refer to the description in the method embodiment.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a text relevancy determination apparatus according to an embodiment of the present application, where the processing apparatus may include: entity determining unit 301, first relevance determining unit 302, attention determining unit 303, and second relevance determining unit 304. Specifically, the following may be mentioned:
an entity determining unit 301, configured to determine a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, where the preset knowledge base includes knowledge representations composed of entities, relationships between the entities, and entity attributes;
a first relevance determining unit 302 for determining entity relevance between the first set of entities and the second set of entities based on the knowledge representation;
an attention determining unit 303, configured to determine an attention value of each word in the first text and the second text with respect to other words according to an association relationship between each word in the first text, an association relationship between each word in the second text, and an association relationship between a word in the first text and a word in the second text, where the attention value is used to reflect a degree of attention of each word in the first text and the second text to other words;
a second relevance determining unit 304, configured to determine a text relevance of the first text and the second text according to at least the attention value and the entity relevance.
In some embodiments, the attention determining unit 303 may be configured to:
calculating the correlation degree between each word in the first text and other words according to the correlation relationship between each word in the first text and the correlation relationship between the words in the first text and the words in the second text;
calculating the correlation degree between each word in the second text and other words according to the correlation relationship between each word in the second text and the correlation relationship between the words in the first text and the words in the second text;
determining the attention value of each word in the first text and the second text relative to other words according to the correlation degree between each word in the first text and other words and the correlation degree between each word in the second text and other words
In some embodiments, the apparatus may further comprise:
the constructing unit is used for respectively constructing a feature matrix corresponding to the first text and a feature matrix corresponding to the second text to obtain a first feature matrix and a second feature matrix;
the splicing unit is used for splicing the first characteristic matrix and the second characteristic matrix to obtain a spliced matrix;
the attention unit 303 may also be used to:
normalizing the correlation degree between each word and other words in the first text and the correlation degree between each word and other words in the second text;
weighting the splicing matrix according to the correlation degree after normalization processing to obtain a weighted matrix;
determining an attention value of each word in the first text and the second text with respect to other words based on the weighted matrix.
In some embodiments, the building unit may be specifically configured to:
performing word segmentation processing on a first text and a second text to obtain a first group of words associated with the first text and a second group of words associated with the second text;
constructing a first vector representation of each word in the first set of words based on each word in the first set of words and a position of each word in a first text;
constructing a second vector representation of each word in the second set of words based on each word in the second set of words and a position of each word in a second text;
a first feature matrix is determined from at least the constructed first vector representation and a second feature matrix is determined from at least the constructed second vector representation.
In some embodiments, the building unit may be further operable to:
splicing the constructed first vector representations to obtain a first sub-matrix;
recognizing a first entity word from a first group of words based on the preset knowledge base, determining a first knowledge element related to the first entity word from the preset knowledge base, and splicing vector representations of the first knowledge element according to the position of the first entity word in a first text and a knowledge representation formed by the first knowledge element to obtain a second sub-matrix, wherein the first knowledge element comprises: the first related entity corresponding to the first entity word and having a relationship with a first target entity in a preset knowledge base, the relationship between the first target entity and the first related entity and/or the entity attribute of the first target entity; and
splicing the constructed second vector representations to obtain a third sub-matrix;
recognizing a second entity word from a second group of words based on the preset knowledge base, determining a second knowledge element related to the second entity word from the preset knowledge base, and splicing vector representations of the second knowledge element according to the position of the second entity word in a second text and knowledge representation formed by the second knowledge element to obtain a fourth sub-matrix, wherein the second knowledge element comprises: the second related entity corresponding to the second target entity in the preset knowledge base has a relationship with the second target entity, the relationship between the second target entity and the second related entity and/or the entity attribute of the second target entity;
determining the second feature matrix based on the third sub-matrix and the fourth sub-matrix.
In some embodiments, the first correlation determination unit 302 may be configured to:
determining a first number of entities of the first and second sets of entities having the same designation;
determining a second number of entities in the first and second sets of entities having the same identity in the knowledge base, wherein the identity of an entity uniquely identifies the entity in the predetermined knowledge base;
and determining the entity relevance according to the first number and the second number.
In some embodiments, the first text is a text to be retrieved, and the second text is a candidate text; the first correlation determination unit 302 may be further configured to:
determining entity word similarity of the text to be retrieved and the candidate text based on the first number and the number of the entities of the first group of entities;
determining entity identification similarity of the text to be retrieved and the candidate text based on the second number and the entity number of the first group of entities;
and determining the entity correlation degree according to the entity word similarity degree and the entity identification similarity degree.
In some embodiments, the apparatus may further comprise a third correlation determination unit to:
determining a third number of identical words in the first text and the second text;
determining word relevancy of the first text and the second text according to the third number and the number of words in the first text;
the second correlation unit may specifically be configured to:
and determining the text relevance of the first text and the second text according to the attention value, the entity relevance and the word relevance.
The text relevancy determining method provided by the scheme is characterized in that a first group of entities related to a first text and a second group of entities related to a second text are determined based on a preset knowledge base, and the preset knowledge base comprises knowledge representation formed by the entities, the relations among the entities and entity attributes; determining entity relatedness between the first set of entities and the second set of entities according to the knowledge representation; determining the attention value of each word in the first text and the second text relative to other words according to the incidence relation among each word in the first text, each word in the second text and the association relation between the word in the first text and the word in the second text; and determining the text relevance of the first text and the second text according to at least the attention value and the entity relevance. According to the scheme, the relation between each word in the text and each word in the text is paid attention to during the text relevancy calculation, the weight of useful information is improved and the weight of useless information is reduced based on the relation, and the accuracy of the text relevancy calculation result is improved.
The embodiment of the application further provides electronic equipment which can be terminal equipment such as a smart phone and a tablet computer. As shown in fig. 5, the electronic device may include Radio Frequency (RF) circuitry 601, memory 602 including one or more computer-readable storage media, input unit 603, display unit 604, sensor 605, audio circuitry 606, Wireless Fidelity (WiFi) module 607, processor 608 including one or more processing cores, and power supply 609. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 5 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the RF circuit 601 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuit 601 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 601 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 602 may be used to store software programs and modules, and the processor 608 executes various functional applications and data processing by executing the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the stored data area may store data (such as audio data, a phonebook, etc.) created according to the use of the electronic device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 608 and the input unit 603 access to the memory 602.
The input unit 603 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, input unit 603 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 608, and can receive and execute commands sent by the processor 608. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 603 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 604 may be used to display information input by or provided to a user and various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof. The display unit 604 may include a display panel, and optionally, the display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 608 to determine the type of touch event, and the processor 608 then provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 5 the touch-sensitive surface and the display panel are implemented as two separate components for input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel for input and output functions.
The electronic device may also include at least one sensor 605, such as a light sensor, motion sensor, and other sensors. In particular, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the electronic device is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping) and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.
Audio circuitry 606, a speaker, and a microphone may provide an audio interface between a user and the electronic device. The audio circuit 606 may transmit the electrical signal converted from the received audio data to a speaker, and the audio signal is converted into a sound signal by the speaker and output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 606 and converted into audio data, which is then processed by the audio data output processor 608, and then passed through the RF circuit 601 to be sent to, for example, another electronic device, or output to the memory 602 for further processing. The audio circuitry 606 may also include an earbud jack to provide communication of a peripheral headset with the electronic device.
WiFi belongs to short-distance wireless transmission technology, and the electronic device can help the user send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 607, and it provides the user with wireless broadband internet access. Although fig. 5 shows the WiFi module 607, it is understood that it does not belong to the essential constitution of the electronic device, and it can be omitted entirely within the scope not changing the essence of the invention as needed.
The processor 608 is a control center of the electronic device, connects various parts of the entire cellular phone using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 602, and calling data stored in the memory 602, thereby monitoring the cellular phone as a whole. Optionally, processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 608.
The electronic device also includes a power supply 609 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 608 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 609 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
Although not shown, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 608 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 608 runs the application program stored in the memory 602, so as to implement various functions:
determining a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, wherein the preset knowledge base comprises knowledge representations consisting of entities, relationships among the entities and entity attributes; determining entity relatedness between the first set of entities and the second set of entities based on the knowledge representation; determining attention values of each word in the first text and the second text about other words according to the incidence relation among each word in the first text, the incidence relation among each word in the second text and the incidence relation among the words in the first text and the words in the second text, wherein the attention values are used for reflecting the attention degrees of each word in the first text and the second text to other words; and determining the text relevance of the first text and the second text at least according to the attention value and the entity relevance.
The electronic equipment provided by the embodiment of the application can pay attention to the relation between each word in the text and each word in the text during the text relevancy calculation, and can improve the weight of useful information and reduce the weight of useless information based on the relation, so that the accuracy of the text relevancy calculation result is improved.
The embodiment of the application also provides a server, and the server can be specifically an application server. As shown in fig. 6, the server may include Radio Frequency (RF) circuitry 701, memory 702 including one or more computer-readable storage media, a processor 704 including one or more processing cores, and a power supply 703. Those skilled in the art will appreciate that the server architecture shown in FIG. 6 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the RF circuit 701 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information from a base station and then sends the received downlink information to the one or more processors 704 for processing; in addition, data relating to uplink is transmitted to the base station. In general, RF circuitry 701 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 701 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 702 may be used to store software programs and modules, and the processor 704 executes various functional applications and data processing by operating the software programs and modules stored in the memory 702. The memory 702 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the stored data area may store data (such as audio data, a phonebook, etc.) created according to the use of the server, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 702 may also include a memory controller to provide the processor 704 and the input unit 703 access to the memory 702.
The processor 704 is a control center of the server, connects various parts of the entire handset by various interfaces and lines, performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 702 and calling data stored in the memory 702, thereby monitoring the handset as a whole. Optionally, processor 704 may include one or more processing cores; preferably, the processor 704 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 704.
The server also includes a power supply 703 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 704 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 703 may also include any component including one or more of a direct or alternating current power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
Specifically, in this embodiment, the processor 704 in the server loads the executable file corresponding to the process of one or more application programs into the memory 702 according to the following instructions, and the processor 704 runs the application programs stored in the memory 702, thereby implementing various functions:
determining a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, wherein the preset knowledge base comprises knowledge representations consisting of entities, relationships among the entities and entity attributes; determining entity relatedness between the first set of entities and the second set of entities based on the knowledge representation; determining attention values of each word in the first text and the second text about other words according to the incidence relation among each word in the first text, the incidence relation among each word in the second text and the incidence relation among the words in the first text and the words in the second text, wherein the attention values are used for reflecting the attention degrees of each word in the first text and the second text to other words; and determining the text relevance of the first text and the second text at least according to the attention value and the entity relevance.
The electronic equipment provided by the embodiment of the application can pay attention to the relation between each word in the text and each word in the text during the text relevancy calculation, and can improve the weight of useful information and reduce the weight of useless information based on the relation, so that the accuracy of the text relevancy calculation result is improved.
It will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by instructions or by instructions controlling associated hardware, and the instructions may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the text relevance determination methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:
determining a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, wherein the preset knowledge base comprises knowledge representations consisting of entities, relationships among the entities and entity attributes; determining entity relatedness between the first set of entities and the second set of entities based on the knowledge representation; determining attention values of each word in the first text and the second text about other words according to the incidence relation among each word in the first text, the incidence relation among each word in the second text and the incidence relation among the words in the first text and the words in the second text, wherein the attention values are used for reflecting the attention degrees of each word in the first text and the second text to other words; and determining the text relevance of the first text and the second text at least according to the attention value and the entity relevance.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium may execute the steps in any method for determining text relevancy provided in the embodiment of the present application, beneficial effects that can be achieved by any method for determining text relevancy provided in the embodiment of the present application may be achieved, for which details are given in the foregoing embodiment and are not described herein again.
The method, the apparatus, the storage medium, and the electronic device for determining text relevancy provided by the embodiment of the present application are introduced in detail, and a specific example is applied to illustrate the principle and the implementation manner of the present application, and the description of the embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (15)

1. A method for determining text relevancy, comprising:
determining a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, wherein the preset knowledge base comprises knowledge representations consisting of entities, relationships among the entities and entity attributes;
determining entity relatedness between the first set of entities and the second set of entities based on the knowledge representation;
determining attention values of each word in the first text and the second text about other words according to the incidence relation among each word in the first text, the incidence relation among each word in the second text and the incidence relation among the words in the first text and the words in the second text, wherein the attention values are used for reflecting the attention degrees of each word in the first text and the second text to other words;
and determining the text relevance of the first text and the second text at least according to the attention value and the entity relevance.
2. The method for determining the relevancy of the text according to claim 1, wherein the determining the attention value of each word in the first text and the second text with respect to other words according to the association relationship between each word in the first text, the association relationship between each word in the second text, and the association relationship between the word in the first text and the word in the second text comprises:
calculating the correlation degree between each word in the first text and other words according to the correlation relation between each word in the first text and the correlation relation between the words in the first text and the words in the second text;
calculating the correlation degree between each word in the second text and other words according to the correlation relation between each word in the second text and the correlation relation between the word in the first text and the word in the second text;
and determining the attention value of each word in the first text and the second text relative to other words according to the correlation degree between each word in the first text and other words and the correlation degree between each word in the second text and other words.
3. The text relevance determination method of claim 2, further comprising:
respectively constructing a feature matrix corresponding to the first text and a feature matrix corresponding to the second text to obtain a first feature matrix and a second feature matrix;
splicing the first characteristic matrix and the second characteristic matrix to obtain a spliced matrix;
the determining the attention value of each word in the first text and the second text with respect to other words according to the correlation between each word in the first text and other words and the correlation between each word in the second text and other words comprises:
normalizing the correlation degree between each word and other words in the first text and the correlation degree between each word and other words in the second text;
weighting the splicing matrix according to the correlation degree after normalization processing to obtain a weighted matrix; determining an attention value of each word in the first text and the second text with respect to other words based on the weighted matrix.
4. The method according to claim 3, wherein the step of respectively constructing the feature matrix corresponding to the first text and the feature matrix corresponding to the second text to obtain the first feature matrix and the second feature matrix comprises:
performing word segmentation processing on a first text and a second text to obtain a first group of words associated with the first text and a second group of words associated with the second text;
constructing a first vector representation of each word in the first set of words based on each word in the first set of words and a position of each word in a first text;
constructing a second vector representation of each word in the second set of words based on each word in the second set of words and a position of each word in a second text;
a first feature matrix is determined from at least the constructed first vector representation and a second feature matrix is determined from at least the constructed second vector representation.
5. The method for determining text relevance according to claim 4, wherein the obtaining a first feature matrix by splicing at least according to the constructed first vector representation comprises:
splicing the constructed first vector representations to obtain a first sub-matrix;
recognizing a first entity word from a first group of words based on the preset knowledge base, determining a first knowledge element related to the first entity word from the preset knowledge base, and splicing vector representations of the first knowledge element according to the position of the first entity word in the first text and knowledge representation formed by the first knowledge element to obtain a second sub-matrix, wherein the first knowledge element comprises: the first related entity corresponding to the first entity word and having a relationship with a first target entity in a preset knowledge base, the relationship between the first target entity and the first related entity and/or the entity attribute of the first target entity;
determining the first feature matrix based on the first sub-matrix and the second sub-matrix;
the determining a second feature matrix from at least the constructed second vector representation comprises:
splicing the constructed second vector representations to obtain a third sub-matrix;
recognizing a second entity word from a second group of words based on the preset knowledge base, determining a second knowledge element related to the second entity word from the preset knowledge base, and splicing vector representations of the second knowledge element according to the position of the second entity word in a second text and a knowledge representation formed by the second knowledge element to obtain a fourth sub-matrix, wherein the second knowledge element comprises: the second related entity corresponding to the second target entity in the preset knowledge base has a relationship with the second target entity, the relationship between the second target entity and the second related entity and/or the entity attribute of the second target entity;
determining the second feature matrix based on the third sub-matrix and the fourth sub-matrix.
6. The method of claim 1, wherein determining entity relevance between the first set of entities and the second set of entities based on the knowledge representation comprises:
determining a first number of entities of the first and second sets of entities having the same designation;
determining a second number of entities in the first and second sets of entities having the same identity in the knowledge base, wherein the identity of an entity uniquely identifies the entity in the predetermined knowledge base;
and determining the entity relevance according to the first number and the second number.
7. The method for determining the relevancy of a text according to claim 6, wherein the first text is a text to be retrieved, and the second text is a candidate text;
the determining the entity relevance according to the first number and the second number comprises:
determining entity word similarity of the text to be retrieved and the candidate text based on the first number and the number of the entities of the first group of entities;
determining entity identification similarity of the text to be retrieved and the candidate text based on the second number and the entity number of the first group of entities;
and determining the entity correlation degree according to the entity word similarity degree and the entity identification similarity degree.
8. The method for determining text relevance of claim 7, further comprising:
determining a third number of identical words in the first text and the second text;
determining word relevancy of the first text and the second text according to the third number and the number of words in the first text;
the determining the text relevance of the first text and the second text according to at least the attention value and the entity relevance comprises:
and determining the text relevance of the first text and the second text according to the attention value, the entity relevance and the word relevance.
9. A device for determining a degree of text relevance, comprising:
an entity determining unit, configured to determine a first group of entities associated with the first text and a second group of entities associated with the second text based on a preset knowledge base, where the preset knowledge base includes knowledge representations composed of entities, relationships between the entities, and entity attributes;
a first relevance determining unit for determining entity relevance between the first set of entities and the second set of entities based on the knowledge representation;
an attention determining unit, configured to determine an attention value of each word in the first text and the second text with respect to other words according to an association relationship between each word in the first text, an association relationship between each word in the second text, and an association relationship between a word in the first text and a word in the second text, where the attention value is used to reflect a degree of attention of each word in the first text and the second text to other words;
and the second relevance determining unit is used for determining the text relevance of the first text and the second text according to at least the attention value and the entity relevance.
10. The apparatus for determining a degree of text relevance according to claim 9, wherein the attention determination unit is configured to:
calculating the correlation degree between each word in the first text and other words according to the correlation relation between each word in the first text and the correlation relation between the words in the first text and the words in the second text;
calculating the correlation degree between each word in the second text and other words according to the correlation relation between each word in the second text and the correlation relation between the word in the first text and the word in the second text;
and determining the attention value of each word in the first text and the second text relative to other words according to the correlation degree between each word in the first text and other words and the correlation degree between each word in the second text and other words.
11. The apparatus for determining a degree of text relevance according to claim 10, further comprising:
the constructing unit is used for respectively constructing a feature matrix corresponding to the first text and a feature matrix corresponding to the second text to obtain a first feature matrix and a second feature matrix;
the splicing unit is used for splicing the first characteristic matrix and the second characteristic matrix to obtain a spliced matrix;
the attention unit is further configured to:
normalizing the correlation degree between each word and other words in the first text and the correlation degree between each word and other words in the second text;
weighting the splicing matrix according to the correlation degree after normalization processing to obtain a weighted matrix;
determining an attention value of each word in the first text and the second text with respect to other words based on the weighted matrix.
12. The apparatus for determining a degree of correlation of texts according to claim 9, wherein the first degree of correlation determination unit is configured to:
determining a first number of entities of the first and second sets of entities having the same designation;
determining a second number of entities in the first and second sets of entities having the same identity in the knowledge base, wherein the identity of an entity uniquely identifies the entity in the predetermined knowledge base;
and determining the entity relevance according to the first number and the second number.
13. The apparatus according to claim 12, wherein the first text is a text to be retrieved, and the second text is a candidate text; the first correlation determination unit is further configured to:
determining entity word similarity of the text to be retrieved and the candidate text based on the first number and the number of the entities of the first group of entities;
determining entity identification similarity of the text to be retrieved and the candidate text based on the second number and the entity number of the first group of entities;
and determining the entity correlation degree according to the entity word similarity degree and the entity identification similarity degree.
14. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform the method for determining text relevance according to any of claims 1-8.
15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for determining text relevance as claimed in any one of claims 1 to 8 when executing the program.
CN202010350443.9A 2020-04-28 2020-04-28 Text relevance determining method and device, storage medium and electronic equipment Pending CN111553163A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010350443.9A CN111553163A (en) 2020-04-28 2020-04-28 Text relevance determining method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010350443.9A CN111553163A (en) 2020-04-28 2020-04-28 Text relevance determining method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN111553163A true CN111553163A (en) 2020-08-18

Family

ID=71998248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010350443.9A Pending CN111553163A (en) 2020-04-28 2020-04-28 Text relevance determining method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111553163A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560466A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 Link entity association method and device, electronic equipment and storage medium
CN113032580A (en) * 2021-03-29 2021-06-25 浙江星汉信息技术股份有限公司 Associated file recommendation method and system and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560466A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 Link entity association method and device, electronic equipment and storage medium
CN112560466B (en) * 2020-12-24 2023-07-25 北京百度网讯科技有限公司 Link entity association method, device, electronic equipment and storage medium
CN113032580A (en) * 2021-03-29 2021-06-25 浙江星汉信息技术股份有限公司 Associated file recommendation method and system and electronic equipment

Similar Documents

Publication Publication Date Title
CN107943860B (en) Model training method, text intention recognition method and text intention recognition device
CN110852100B (en) Keyword extraction method and device, electronic equipment and medium
CN109918669B (en) Entity determining method, device and storage medium
CN109033156B (en) Information processing method and device and terminal
CN111985240B (en) Named entity recognition model training method, named entity recognition method and named entity recognition device
CN111177371B (en) Classification method and related device
CN110334344A (en) A kind of semanteme intension recognizing method, device, equipment and storage medium
CN110162600B (en) Information processing method, session response method and session response device
CN111368525A (en) Information searching method, device, equipment and storage medium
CN112131401B (en) Concept knowledge graph construction method and device
CN111177180A (en) Data query method and device and electronic equipment
CN110209810A (en) Similar Text recognition methods and device
CN110852109A (en) Corpus generating method, corpus generating device, and storage medium
CN111339737B (en) Entity linking method, device, equipment and storage medium
CN111428522B (en) Translation corpus generation method, device, computer equipment and storage medium
CN110597957B (en) Text information retrieval method and related device
CN111553163A (en) Text relevance determining method and device, storage medium and electronic equipment
CN112818080B (en) Searching method, searching device, searching equipment and storage medium
CN114357278B (en) Topic recommendation method, device and equipment
CN113822038A (en) Abstract generation method and related device
CN112488157A (en) Dialog state tracking method and device, electronic equipment and storage medium
WO2021147421A1 (en) Automatic question answering method and apparatus for man-machine interaction, and intelligent device
CN111428523B (en) Translation corpus generation method, device, computer equipment and storage medium
CN112070586B (en) Item recommendation method and device based on semantic recognition, computer equipment and medium
CN113569043A (en) Text category determination method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40029143

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination