CN113378018B - Header list entity relationship matching method based on deep learning multi-head selection model - Google Patents

Header list entity relationship matching method based on deep learning multi-head selection model Download PDF

Info

Publication number
CN113378018B
CN113378018B CN202110936805.7A CN202110936805A CN113378018B CN 113378018 B CN113378018 B CN 113378018B CN 202110936805 A CN202110936805 A CN 202110936805A CN 113378018 B CN113378018 B CN 113378018B
Authority
CN
China
Prior art keywords
head
header
deep learning
selection model
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110936805.7A
Other languages
Chinese (zh)
Other versions
CN113378018A (en
Inventor
高永伟
李曙光
宋万军
姜广栋
杨万刚
李峰
蔡晨
陈玉冰
皮乾东
黄昌彬
杜俊杰
张鑫涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fiberhome Telecommunication Technologies Co ltd
Original Assignee
Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fiberhome Telecommunication Technologies Co ltd filed Critical Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority to CN202110936805.7A priority Critical patent/CN113378018B/en
Publication of CN113378018A publication Critical patent/CN113378018A/en
Application granted granted Critical
Publication of CN113378018B publication Critical patent/CN113378018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Television Signal Processing For Recording (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a table head list entity relationship matching method based on a deep learning multi-head selection model, which comprises the following steps: defining data entity attribute categories such as time, name, company name and the like for data items of the table, and constructing a regular identification method; constructing artificial features of any two columns of combinations of the header; passing the header character sequence and the data attribute sequence corresponding to the header through respective embedding layers; the coding layer adopts a bi-lstm model structure; carrying out pairwise combination splicing on the context coding information at any position based on a multi-head selection mechanism; calculating the binary loss value of any pairwise position of the header sequence to each relation category; the loss values are converged to the optimal model retention and used as a model for prediction. The table head column entity relationship matching method based on the deep learning multi-head selection model has certain convenience through the table head and the model.

Description

Header list entity relationship matching method based on deep learning multi-head selection model
Technical Field
The invention relates to the technical field of table head list entities, in particular to a table head list entity relation matching method based on a deep learning multi-head selection model.
Background
The table head column entity relation matching technology is used for judging the corresponding relation of two columns of entities of a table, and has an important function on table information mining. Therefore, we improve the method and propose a table head list entity relationship matching method based on a deep learning multi-head selection model.
Disclosure of Invention
In order to solve the technical problems, the invention provides the following technical scheme:
the invention relates to a table head list entity relationship matching method based on a deep learning multi-head selection model, which comprises the following steps:
the method comprises the following steps: defining data entity attribute categories including time, name and company name for data items of a table, and constructing a regular identification method;
step two: constructing artificial features of any two columns of combinations of the header, wherein the construction mode of the artificial features can be selected according to the actual scene requirements, and meanwhile, recording the problem of relationship matching of any two columns of entities of the header;
step three: after the header character sequence and the data attribute sequence corresponding to the header pass through respective embedding layers, merging the vectors as the input of the next coding layer;
step four: the coding layer adopts a bi-lstm model structure, and outputs each position code which is a header sequence and fuses context coding information;
step five: carrying out pairwise combination splicing on the context coding information at any position based on a multi-head selection model, and then carrying out relation classification on splicing vectors;
step six: calculating the binary loss value of any pairwise position of the header sequence to each relation category, and then performing back propagation by using the loss value to update the model parameters;
step seven: the loss values are converged to the optimal model retention and used as a model for prediction.
As a preferred technical solution of the present invention, the first header of the step is a text composed of "name of an actor in a tv episode, name of a company to which the actor belongs, company telephone, company address, shooting time, shooting address, work mailbox, director name, name of a company to which the actor belongs, episode length, and first show time", and the header column refers to the text of each column individually.
As a preferred technical solution of the present invention, in the step, a header character sequence word and an attribute sequence attrs are respectively subjected to mapping conversion by two embedding matrices, assuming that a character at a position i is converted into w _ { i }, and an attribute at the position i is converted into a _ { i }, then a combined embedding vector at the position i is e _ { i } - [ w _ { i }: a _ { i } ].
As a preferred technical solution of the present invention, bi-lstm is used as an encoding layer of the header character sequence in the fourth step, in the encoding process, attributes of table data items are introduced, and the attributes of the data items can be obtained by a regular matching method.
As a preferred technical solution of the present invention, in the step five, the context coding information is any two positions i and j of the input header character sequence, which are finally expressed as u _ { i } and u _ { j } after bi-lstm coding, and each two position features m _ { ij } are artificially constructed.
As a preferred technical solution of the present invention, the fraction of the relation label r _ { k } between the input header character sequence calculation position i and the position j is:
score (U _ { i }, U _ { j }, m _ { ij }, r _ { k }) -V f (U _ { i } + W _ { j } + m _ { ij } + b), where V, U, W, b represent weight parameters, f (×) is an activation function, which may be a relu or tanh activation function, and the probability that the relationship between position i and position j is r _ { k } is obtained from the scores:
sigmoid (score (u _ { i }, u _ { j }, m _ { ij }, r _ { k }), where sigmoid is a sigmoid function.
As a preferred technical solution of the present invention, in the fifth step, when the multi-head selection model is used for determining, fusion vectors of different head rows need to be constructed, and a relationship between any two rows is determined according to the fusion vectors.
As a preferred technical solution of the present invention, in the second step, a relation matching problem of any two columns of entities at the head of the table is converted into a deep learning relation extraction problem, and a multi-head selection model is used as a solution of the problem.
The invention has the beneficial effects that:
the method comprises the steps of converting a relation matching problem of any two columns of entities at the head of a table into a deep learning relation extraction problem, using a multi-head selection model as a solution of the problem, using bi-lstm as an encoding layer of a character sequence at the head of the table, introducing attributes of data items of the table in the encoding process, wherein the attributes of the data items can be obtained in a regular matching mode, constructing fusion vectors of different head columns when the multi-head selection model is used for judging, judging the relation of any two columns according to the fusion vectors, and achieving more convenient and accurate matching through the entity relations of the head columns and lower starting cost.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a system diagram of a table head list entity relationship matching method based on a deep learning multi-head selection model according to the present invention;
FIG. 2 is a method step diagram of the table head list entity relationship matching method based on the deep learning multi-head selection model.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example (b): as shown in fig. 1-2, the table head list entity relationship matching method based on deep learning multi-head selection model of the present invention includes the following steps:
the method comprises the following steps: defining data entity attribute categories including time, name and company name for data items of a table, and constructing a regular identification method;
step two: constructing artificial features of any two columns of combinations of the header, wherein the construction mode of the artificial features can be selected according to the actual scene requirements, and meanwhile, recording the problem of relationship matching of any two columns of entities of the header;
step three: after the header character sequence and the data attribute sequence corresponding to the header pass through respective embedding layers, merging the vectors as the input of the next coding layer;
step four: the coding layer adopts a bi-lstm model structure, and outputs each position code which is a header sequence and fuses context coding information;
step five: carrying out pairwise combination splicing on the context coding information at any position based on a multi-head selection model, and then carrying out relation classification on splicing vectors;
step six: calculating the binary loss value of any pairwise position of the header sequence to each relation category, and then performing back propagation by using the loss value to update the model parameters;
step seven: the loss values are converged to the optimal model retention and used as a model for prediction.
The first step is that the head list is characters formed by names of TV episode actors, names of affiliated companies, company telephone numbers, company addresses, shooting time, shooting addresses, work mailboxes, director names, names of affiliated companies, episode lengths and first showing time, and the head list refers to characters in each individual row.
In the third step, header character sequences words and attribute sequences attrs are respectively subjected to mapping conversion by two embedding matrices, it is assumed that a character at a position i is converted into w _ { i }, an attribute at the position i is converted into a _ { i }, and then a combined embedding vector at the position i is e _ { i } - [ w _ { i }: a _ { i } ].
In the fourth step, bi-lstm is used as an encoding layer of the header character sequence, and in the encoding process, attributes of table data items are introduced, wherein the attributes of the data items can be acquired in a regular matching mode.
In the fifth step, the context coding information is any two positions i and j of the input header character sequence, the final representation of the entry header character sequence after bi-lstm coding is u _ { i } and u _ { j }, and pairwise position features m _ { ij } are artificially constructed.
Wherein, the score of the relation label r _ { k } between the input header character sequence calculation position i and the position j is:
score (U _ { i }, U _ { j }, m _ { ij }, r _ { k }) -V f (U _ { i } + W _ { j } + m _ { ij } + b), where V, U, W, b represent weight parameters, f (×) is an activation function, which may be a relu or tanh activation function, and the probability that the relationship between position i and position j is r _ { k } is obtained from the scores:
sigmoid (score (u _ { i }, u _ { j }, m _ { ij }, r _ { k }), where sigmoid is a sigmoid function.
And step five, when the multi-head selection model is used for judging, fusion vectors of different header rows need to be constructed, and the relation of any two rows is judged according to the fusion vectors.
In the second step, the relation matching problem of any two columns of entities at the head of the table is converted into a deep learning relation extraction problem, and a multi-head selection model is used as a solution for the problem.
The working principle is as follows: when in use, the first step: defining data entity attribute categories including time, name and company name for data items of a table, and constructing a regular identification method; the head of the list is characters formed by names of actors in the television episode, names of affiliated companies, company telephones, company addresses, shooting time, shooting addresses, work mailboxes, names of directors, names of affiliated companies, episode lengths and first showing time, the head of the list refers to each individual row of characters, and the second step is that: constructing artificial features of any two columns of combinations of the header, wherein the construction mode of the artificial features can be selected according to the actual scene requirements, and meanwhile, recording the problem of relationship matching of any two columns of entities of the header; converting the relation matching problem of any two columns of entities at the head of the table into a deep learning relation extraction problem, taking a multi-head selection model as a solution of the problem, and performing the following steps: after the header character sequence and the data attribute sequence corresponding to the header pass through respective embedding layers, merging the vectors as the input of the next coding layer; the operation steps are that a header character sequence word and an attribute sequence attrs are respectively subjected to mapping conversion of two embedding matrices, it is assumed that a character at a position i is converted into w _ { i }, an attribute at the position i is converted into a _ { i }, then a merged embedding vector at the position i is e _ { i } - [ w _ { i }: a _ { i } ], and the step four: the coding layer adopts a bi-lstm model structure, and outputs each position code which is a header sequence and fuses context coding information; using bi-lstm as an encoding layer of a header character sequence, introducing attributes of table data items in the encoding process, wherein the attributes of the data items can be obtained in a regular matching mode, and the fifth step is as follows: carrying out pairwise combination splicing on the context coding information at any position based on a multi-head selection model, and then carrying out relation classification on splicing vectors; the context coding information is any two positions i and j of an input header character sequence, the final representation of the input header character sequence after bi-lstm coding is u _ { i } and u _ { j }, every two position features m _ { ij } which are artificially constructed are provided, the fraction of a relation label between the input header character sequence calculation position i and the position j which is r _ { k } is: score (U _ { i }, U _ { j }, m _ { ij }, r _ { k }) -V f (U _ { i } + W _ { j } + m _ { ij } + b), where V, U, W, b represent weight parameters, f (×) is an activation function, which may be a relu or tanh activation function, and the probability that the relationship between position i and position j is r _ { k } is obtained from the scores:
sigmoid (score (u _ { i }, u _ { j }, m _ { ij }, r _ { k }), wherein the sigmoid is a sigmoid function, and when a multi-head selection model is used for judging, fusion vectors of different header rows need to be constructed, and the relation of any two columns is judged according to the fusion vectors, and the step six: calculating the binary loss value of any pairwise position of the header sequence to each relation category, and then performing back propagation by using the loss value to update the model parameters; step seven: the loss values are converged to the optimal model retention and used as a model for prediction.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. The table head list entity relationship matching method based on the deep learning multi-head selection model is characterized by comprising the following steps:
the method comprises the following steps: defining data entity attribute categories including time, name and company name for data items of a table, and constructing a regular identification method;
step two: constructing artificial features of any two columns of combinations of the header, wherein the construction mode of the artificial features can be selected according to the actual scene requirements, and meanwhile, recording the problem of relationship matching of any two columns of entities of the header;
step three: after the header character sequence and the data attribute sequence corresponding to the header pass through respective embedding layers, merging the vectors as the input of the next coding layer;
step four: the coding layer adopts a bi-lstm model structure, and outputs each position code which is a header sequence and fuses context coding information;
step five: carrying out pairwise combination splicing on the context coding information at any position based on a multi-head selection model, and then carrying out relation classification on splicing vectors;
step six: calculating the binary loss value of any pairwise position of the header sequence to each relation category, and then performing back propagation by using the loss value to update the model parameters;
step seven: the loss values are converged to the optimal model retention and used as a model for prediction.
2. The method for matching the entity relationship of the head list based on the deep learning multi-head selection model as claimed in claim 1, wherein the head list of the step one is a text composed of names of actors in television episode, names of affiliated companies, company telephone numbers, company addresses, shooting time, shooting addresses, work mailboxes, director names, names of affiliated companies, episode length and first showing time, and the head list refers to the text of each individual column.
3. The method for matching table head column entities based on the deep learning multi-head selection model according to claim 1, wherein in the step, a table head character sequence word, an attribute sequence attrs are respectively subjected to mapping conversion by two embedding matrices, assuming that a character at a position i is converted into w _ { i }, and an attribute at the position i is converted into a _ { i }, a merged embedding vector at the position i is e _ { i } = [ w _ { i }: a _ { i } ].
4. The method for matching table head column entity relationship based on deep learning multiple head selection model according to claim 1, wherein bi-lstm is used as an encoding layer of the table head character sequence in the fourth step, and during the encoding process, attributes of table data items are introduced, and the attributes of the data items are obtained by a regular matching mode.
5. The method for matching the entity relationship of the header list based on the deep learning multi-head selection model as claimed in claim 1, wherein in the fifth step, the context coding information is any two positions i and j of the input header character sequence, the final representation of the context coding information after bi-lstm coding is u _ { i } and u _ { j }, and pairwise position features m _ { ij } are artificially constructed.
6. The method for matching table head column entity relationship based on deep learning multi-head selection model according to claim 5, wherein the input table head character sequence calculates the fraction of the relationship label between position i and position j as r _ { k }:
score (U _ { i }, U _ { j }, m _ { ij }, r _ { k }) = V _f (U _ { i } + W _ { j } + m _ { ij } + b), where V, U, W, b represent weight parameters, f (×) is an activation function, which is a relu or tanh activation function, and the probability that the relationship between position i and position j is r _ { k } is obtained from the scores:
sigmoid (score (u _ { i }, u _ { j }, m _ { ij }, r _ { k }), where sigmoid is a sigmoid function.
7. The method for matching table head column entity relationship based on deep learning multi-head selection model according to claim 1, wherein in the step five, when the multi-head selection model is used for judgment, fusion vectors of different table head columns need to be constructed, and the relationship between any two columns is judged according to the fusion vectors.
8. The method for matching table head column entities based on the deep learning multi-head selection model as claimed in claim 1, wherein in the second step, a problem of matching the relationship between any two columns of entities in the table head is converted into a problem of extracting deep learning relationship, and the multi-head selection model is used as a solution to the problem.
CN202110936805.7A 2021-08-16 2021-08-16 Header list entity relationship matching method based on deep learning multi-head selection model Active CN113378018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110936805.7A CN113378018B (en) 2021-08-16 2021-08-16 Header list entity relationship matching method based on deep learning multi-head selection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110936805.7A CN113378018B (en) 2021-08-16 2021-08-16 Header list entity relationship matching method based on deep learning multi-head selection model

Publications (2)

Publication Number Publication Date
CN113378018A CN113378018A (en) 2021-09-10
CN113378018B true CN113378018B (en) 2021-11-16

Family

ID=77577279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110936805.7A Active CN113378018B (en) 2021-08-16 2021-08-16 Header list entity relationship matching method based on deep learning multi-head selection model

Country Status (1)

Country Link
CN (1) CN113378018B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10198774B1 (en) * 2015-10-26 2019-02-05 Intuit Inc. Systems, methods and articles for associating tax data with a tax entity
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN109614615B (en) * 2018-12-04 2022-04-22 联想(北京)有限公司 Entity matching method and device and electronic equipment
CN111428443B (en) * 2020-04-15 2022-09-13 中国电子科技网络信息安全有限公司 Entity linking method based on entity context semantic interaction

Also Published As

Publication number Publication date
CN113378018A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN112818906B (en) Intelligent cataloging method of all-media news based on multi-mode information fusion understanding
CN110321419B (en) Question-answer matching method integrating depth representation and interaction model
CN112100404B (en) Knowledge graph pre-training method based on structured context information
CN108920720B (en) Large-scale image retrieval method based on depth hash and GPU acceleration
CN106503106B (en) A kind of image hash index construction method based on deep learning
CN110765281A (en) Multi-semantic depth supervision cross-modal Hash retrieval method
CN110751224A (en) Training method of video classification model, video classification method, device and equipment
CN111125457A (en) Deep cross-modal Hash retrieval method and device
CN112818157B (en) Combined query image retrieval method based on multi-order confrontation characteristic learning
CN112463956B (en) Text abstract generation system and method based on antagonistic learning and hierarchical neural network
CN111753207A (en) Collaborative filtering model of neural map based on comments
CN112070114A (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN115526236A (en) Text network graph classification method based on multi-modal comparative learning
CN116049450A (en) Multi-mode-supported image-text retrieval method and device based on distance clustering
CN115687638A (en) Entity relation combined extraction method and system based on triple forest
CN115659279A (en) Multi-mode data fusion method based on image-text interaction
CN113378018B (en) Header list entity relationship matching method based on deep learning multi-head selection model
CN111259197B (en) Video description generation method based on pre-coding semantic features
CN110717068A (en) Video retrieval method based on deep learning
CN115617975B (en) Intention recognition method and device for few-sample multi-turn conversation
CN114911930A (en) Global and local complementary bidirectional attention video question-answering method and system
CN113342982B (en) Enterprise industry classification method integrating Roberta and external knowledge base
CN110377591A (en) Training data cleaning method, device, computer equipment and storage medium
CN114529908A (en) Offline handwritten chemical reaction type image recognition technology
CN114329005A (en) Information processing method, information processing device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant