CN111859967A

CN111859967A - Entity identification method and device and electronic equipment

Info

Publication number: CN111859967A
Application number: CN202010538406.0A
Authority: CN
Inventors: 马璐; 温丽红; 罗星池; 李超; 仙云森
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-10-30
Anticipated expiration: 2040-06-12
Also published as: CN111859967B

Abstract

The application discloses an entity identification method, belongs to the technical field of computers, and is beneficial to improving the identification performance of query entities with non-traditional meanings. The method comprises the following steps: determining semantic feature vectors and entity knowledge feature vectors matched with texts to be recognized; the entity knowledge characteristic vector is used for indicating matching information of text substrings included in the text to be recognized and a preset search log; performing fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result; and determining the entity included in the text to be recognized according to the entity labeling result. The embodiment of the application provides a new word mining method fusing search log features, the new word mining effect is optimized by using massive user search log features, and the identification accuracy of new entities in query input can be effectively improved.

Description

Entity identification method and device and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an entity identification method, an entity identification device, electronic equipment and a computer-readable storage medium.

Background

The entity recognition is a basic technical module of a search system, the entity recognition module is used for carrying out entity recognition on the input natural language, the segmented phrases and phrase types are output, and the output phrase and phrase type standards represent query entities in the input natural language. And then, the search system generates a recall grammar according to the phrase and the phrase type output by the entity recognition module and retrieves the relevant records from the database table. It can be seen that the recognition accuracy of the entity recognition module directly affects the recall accuracy of the search system.

In the prior art, in a technology for performing Entity Recognition by training a Long Short-Term Memory network (LSTM), a single word vector is used as input, a word vector is used as a bridge in a hidden layer at the head and tail of a word to train the LSTM network, large-scale linguistic data is pre-trained to learn semantic relevance, then network parameters are finely adjusted according to labeled data of a Named Entity Recognition Network (NER), a model is trained, and Entity Recognition is performed on input phrases by the model obtained through the most total training.

The named entity recognition method in the prior art uses word granularity as input, and has high dependence on training corpora. For the training corpora with rich context information and larger training data volume, the model performance is better. For example, there are statistical rules for entities in the traditional sense, such as names of people, names of places, etc., and the model learns the characteristics more easily. In a search scene in the vertical field, many entities with unobvious statistical rules, such as business names, bills of lading, and the like, exist, and in the search scene, the performance of the entity identification method in the prior art is not high.

It can be seen that there is still a need for improvements in the prior art methods of entity identification.

Disclosure of Invention

The embodiment of the application provides an entity identification method, which can improve the identification performance of the query entity with non-traditional meaning.

In order to solve the above problem, in a first aspect, an embodiment of the present application provides an entity identification method, including:

determining semantic feature vectors and entity knowledge feature vectors matched with texts to be recognized; the entity knowledge characteristic vector is used for indicating matching information of text substrings included in the text to be recognized and a preset search log;

performing fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result;

and determining the entity included in the text to be recognized according to the entity labeling result.

In a second aspect, an embodiment of the present application provides an entity identification apparatus, including:

the semantic and entity knowledge characteristic vector determining module is used for determining a semantic characteristic vector and an entity knowledge characteristic vector which are matched with the text to be recognized; the entity knowledge characteristic vector is used for indicating matching information of text substrings included in the text to be recognized and a preset search log;

The entity identification and labeling module is used for performing fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity identification model and outputting an entity labeling result of the text to be identified according to a fusion calculation result;

and the entity determining module is used for determining the entity included in the text to be recognized according to the entity labeling result.

In a third aspect, an embodiment of the present application further discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the entity identification method according to the embodiment of the present application when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the entity identification method disclosed in the present application.

The entity identification method disclosed by the embodiment of the application determines semantic feature vectors and entity knowledge feature vectors matched with texts to be identified; the entity knowledge characteristic vector is used for indicating matching information of text substrings included in the text to be recognized and a preset search log; performing fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result; and determining the entity included in the text to be recognized according to the entity labeling result, so that the recognition performance of the query entity with non-traditional meaning can be improved. The embodiment of the application provides a new word mining method fusing search log features, the new word mining effect is optimized by using massive user search log features, and the identification accuracy of new entities in query input can be effectively improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a flowchart of an entity identification method according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of an entity recognition model structure and an operation principle according to a first embodiment of the present application;

fig. 3 is a schematic structural diagram of an entity identification apparatus according to a second embodiment of the present application;

fig. 4 is a second schematic structural diagram of an entity identification apparatus according to a second embodiment of the present application;

FIG. 5 schematically shows a block diagram of an electronic device for performing a method according to the present application; and

fig. 6 schematically shows a storage unit for holding or carrying program code implementing a method according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

As shown in fig. 1, an entity identification method disclosed in an embodiment of the present application includes: step 110 to step 130.

And step 110, determining semantic feature vectors and entity knowledge feature vectors matched with the text to be recognized.

The text to be recognized in the embodiment of the present application may include one word or may include a plurality of words. For example, the text to be recognized is "hand-knitting", which includes: the four characters of manual, editing and weaving comprise: the words "handcraft" and "knit". Each word included in the text to be recognized is composed of at least two continuous words included in the text to be recognized.

The text to be recognized also comprises a text sub-string which is composed of at least two continuous words included in the text to be recognized. For example, the text "hand knitting" to be recognized also includes: three text sub-strings "handcraft", "weave", and "handweave".

In the embodiment of the present application, a word segmentation method in the prior art may be adopted to determine words and text substrings included in the text to be recognized, which is not described herein again.

The semantic feature vector described in the embodiment of the present application includes a word vector of each word included in the text to be recognized and a word vector of all words included in the text to be recognized. The word vector and the word vector in the embodiment of the application may be obtained by extracting semantic feature vectors of a single word and a single word in the prior art, and the word vector of each word included in the text to be recognized. For example, Word vectors of four words, i.e., "hand", "worker", "compose", and "weave", in the text to be recognized, "hand weaving", and Word vectors of two words, i.e., "hand" and "weave", are obtained by using the Word2vec method.

In the embodiment of the application, the entity knowledge characteristic vector is used for indicating matching information of a text sub-string included in the text to be recognized and a preset search log.

In some embodiments of the present application, the entity knowledge feature vector comprises: and the entity knowledge characteristic vector of each text sub-string in the text to be recognized is used for indicating the matching information of the corresponding text sub-string and the preset search log.

In some embodiments of the present application, determining an entity knowledge feature vector matched with a text to be recognized includes: determining an entity knowledge characteristic vector of each text sub-string included in the text to be recognized according to matching information of the text sub-strings in the text to be recognized and a preset search log; and the entity knowledge characteristic vectors of all the text substrings form the entity knowledge characteristic vector matched with the text to be recognized. In some embodiments of the present application, further, the entity knowledge feature vector includes: the entity knowledge characteristic vector of each text sub-string in the text to be recognized is determined, and the determining of the entity knowledge characteristic vector matched with the text to be recognized comprises the following steps: and respectively executing the following operations aiming at each text sub-string included in the text to be recognized: respectively matching the text substrings with document fields of query documents included in a preset search log, and determining the query documents based on matching of the document fields and the text substrings; and for each document field, determining vector values of dimensions corresponding to the document fields in the entity knowledge characteristic vectors of the text sub-strings according to click information of the query documents based on the matching of the document fields and the text sub-strings.

Typically, the query document includes a plurality of document fields. For example, the document fields include: the name of the merchant, the address, the category, the information of the bill of lading, etc.

For example, for the text to be recognized, "hand-weaving", a preset search log (such as a search log stored in a log system) with the text to be recognized "hand-weaving" as a query input is first determined, and a query document included in each search log is further determined. In some embodiments of the present application, the entity knowledge feature vector of the text sub-string comprises a plurality of dimensions, the feature vector value of each dimension matching one document field. For example, the document fields include: the merchant name, the address and the clique are single three fields, and the entity knowledge characteristic vector comprises three dimensions. In the prior art, when a search is conducted, at least one document field in a query document that generally matches the query input. Therefore, by text-matching each text sub-string included in the text to be recognized (i.e., "handcraft", "weave", and "handweave") with respective document fields of query documents in the search logs, respectively, the document field in which each text sub-string matches each query document can be determined.

The text content of the "merchant name" field of query document d1 is as follows: the contents of the text of the 'Tuanyin' field in the nice cake room are as follows: the text content of the 'address' field is made by hand: near the plasthetics factory, the text content of the "business name" field of query document d2 is: the text contents of the 'group list' field in the real hand knitting are as follows: the text content of the hand-woven and address fields is as follows: the sunny region is taken as an example to illustrate the determination process of the matching documents. For example, for query document d1, the text substring "manual" appears in the document field "bill of lading", then the text substring "manual" is determined to match the document field "bill of lading document d 1; the text sub-string "weave" appears in the document field "address", it is determined that the text sub-string "weave" matches the document field "address" of query document d 1. For query document d2, "hand-knitted" appears in both the document fields "business name" and "bill", then the text sub-string "hand-knitted" may be determined to match the document fields "business name" and "bill" of query document d 2. To this end, it may be determined that the query document matching the text substring of the text to be recognized "manually" based on the "merchant name" document field includes document d2, the query document matching the text substring of the text to be recognized "manually" based on the "bouquet" document field includes documents d1 and d2, and the query document matching the text substring of the text to be recognized "manually" based on the "address" document field includes document d 1. Similarly, query documents may be determined that each text sub-string matches based on different document fields.

In an embodiment of the present application, the search log further includes: information whether the query document is clicked. After determining each text sub-string (such as "manual", "weave", and "hand weave") included in the text to be recognized based on the query document matched with each document field (such as "business name", "bill of mass", and "address"), according to the above method, the number of times each text sub-string is clicked based on the query document matched with each document field may be further determined according to the clicked information of the query document.

And then, for any text sub-string, determining an entity knowledge characteristic vector of the text sub-string according to the number of times that the text sub-string is clicked based on the query document matched with each document field. In some embodiments of the present application, taking an example that a query document has M document fields and a search log includes N query documents, according to the above method, it may be determined whether each text sub-string (such as "handmade", "woven", and "handmade") included in the text to be recognized matches with M document fields of the N query documents, respectively, and whether each query document is clicked. Wherein N and M are integers greater than 1. For each text sub-string (e.g., any one of "handmade," "knit," and "handknitted"), a vector value for a text sub-string (e.g., "handmade") corresponding to a dimension of a document field (e.g., "merchant name") may be determined based on click information for a query document based on the document field (e.g., "merchant name") matching the text sub-string (e.g., "handmade").

The click information includes: click distribution information of whether and how many times a user clicks. In some embodiments of the present application, the determining, according to click information of the query document based on matching of the document field and the text sub-string, a vector value of a dimension corresponding to the document field in the entity knowledge feature vector of the text sub-string includes: and determining vector values of corresponding dimensions of the document fields in the entity knowledge characteristic vectors of the text sub-strings according to whether the query documents matched with the text sub-strings based on the document fields are clicked by users or not.

With N_dM_iFor example, document field i representing matching of text substring to query document d, if query document d is clicked by user under current query, then N_dM_iThe value is 1, otherwise the value is 0. In some embodiments of the present application, the entity knowledge feature vector e of a text sub-string can be calculated by the following formula 1^phraseI-dimensional characteristic of (3) phrase_i。

phrase_i＝max(N_dM_i) N is equal to or less than 1 and equal to or less than 1,2, …, M (formula 1)

In the above formula 1, only whether click behaviors exist in ith document fields (such as "business name" field, "address" field, and "bill of mass" field) in all query documents matched with the current text substring (such as "manual") is calculated, and for the document field i, as long as the document field i of one query document in N query documents is clicked, the entity knowledge feature vector value corresponding to the document field i is 1 in the ith dimension feature vector value. According to the method, an M-dimensional feature vector value of the entity knowledge feature vector of each text sub-string can be obtained.

In other embodiments of the present application, the determining, according to click information of the query document based on matching of the document field and the text sub-string, a vector value of a dimension corresponding to the document field in the entity knowledge feature vector of the text sub-string includes: and determining vector values of corresponding dimensions of the document fields in the entity knowledge characteristic vectors of the text sub-strings according to click distribution information of the query documents matched with the text sub-strings by the user based on the document fields. Wherein the click distribution information is calculated by the following method: for each document field, respectively determining the total times of clicking of the query document matched with the text substring by a user based on the document field as clicking information corresponding to the document field; and carrying out normalization processing on the click information corresponding to each document field, and taking the normalization processing result as the vector value of the entity knowledge characteristic vector of the text substring and the corresponding dimension of the document field. In some embodiments of the present application, normalizing the click information corresponding to each of the document fields includes: and performing normalization processing on the click information corresponding to each document field by adopting a softmax normalization method.

Is still N_dM_iFor example, document field i representing matching of text substring to query document d, if query document d is clicked by user under current query, then N_dM_iThe value is 1, otherwise the value is 0. In some embodiments of the present application, the entity knowledge feature vector e of a text sub-string can be calculated by the following formula 2^phraseI-dimensional characteristic of (3) phrase_i。

In the above formula 2, by the formula

The total number of clicks (i.e., click information) for all query documents that match the current text substring (e.g., "by hand") based on the ith document field (e.g., "business name" field, "address" field, "bill of things" field) is counted. Afterwards, using softmax to classifyAnd the normalization method is used for normalizing the click information corresponding to each document field. Performing an index with e as the base on the total clicked times of each document field i to obtain an index numerical value corresponding to each document field; then, accumulating the index numerical values corresponding to the M document fields i; finally, for each document field i, dividing the index value corresponding to the document field by the accumulated result to obtain the entity knowledge characteristic vector e of the text substring (such as 'manual')^phraseThe ith dimension feature vector value of (1). According to the method, an M-dimensional feature vector value of the entity knowledge feature vector of each text sub-string can be obtained.

If the text to be recognized is 'hand-woven', according to the method in the step, the M-dimensional entity knowledge characteristic vector of the text sub-string 'hand', the M-dimensional entity knowledge characteristic vector of the text sub-string 'woven', and the M-dimensional entity knowledge characteristic vector of the text sub-string 'hand-woven' can be respectively determined.

And 120, performing fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result.

In some embodiments of the present application, before performing fusion calculation on the semantic feature vector and the entity knowledge feature vector through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result, the method further includes: and training the entity recognition model. In the embodiment of the present application, the entity recognition model is an LSTM (long-time memory) model.

As shown in FIG. 2, the entity recognition model includes a word vector learning network 210, a word vector learning network 220, and an entity knowledge feature vector learning network 230. The word vector learning network 220 is used for controlling semantic association between input of word vectors and learning word vectors and between word vectors and related word vectors; the entity knowledge feature vector learning network 230 is used for controlling the input of the entity knowledge feature vector and learning entity knowledge association between the entity knowledge feature vector and the related word vector; the word vector learning network 210 is used to learn associations between word vectors of inputs, between word vectors and inputs to the word vector learning network 220 and the entity knowledge feature vector learning network 230. In the case where the word vector learning network 220 and the entity knowledge feature vector learning network 230 have no input to the word vector learning network 210, the word vector learning network 210 can only learn semantic associations between input word vector sequences.

In some embodiments of the present application, the training samples used for training the entity recognition model are samples labeled by a named entity labeling method (e.g., a biees method) in the prior art. The sample data of each training sample comprises a word vector sequence of the query text and a sample label labeled by an entity sequence labeling method (such as a sample label labeled by a BIEOS method, wherein B is used for identifying an entity initial word in the query text, I is used for identifying an entity name middle word in the query text, E is used for identifying an entity name ending word in the query text, and O is used for identifying other words in the query text).

In an embodiment of the present application, the training data further includes a word vector and an entity knowledge feature vector, which may be input to the word vector learning network 220 and the entity knowledge feature vector learning network 230 in the form of a matrix. The word vector matrix input to the word vector learning network 220 is used to indicate a corresponding relationship between a position interval composed of a start position and an end position in the query text and the word vector, and may be represented as

The entity knowledge feature vector matrix input to the entity knowledge feature vector learning network 230 is used to indicate the corresponding relationship between the location interval composed of a start location and an end location in the query text and the entity knowledge feature vector, and may be represented as

Wherein b represents the word of the initial word of the current word or the current text sub-string in the query textAnd e represents the word offset of the current word or the ending word of the current text sub-string in the query text. For example, in the query text "hand-weaving", b and e take values of 0, 1, 2 and 3,

the word "manual",

the text sub-string "manual" is indicated.

The word vector learning network 210 is constructed based on an LSTM (long-short-term memory) model, and the word vector learning network 210 may be trained alone or together with the word vector learning network 220 and the entity knowledge feature vector learning network 230.

In the model training process, each part of network of the entity recognition model predicts and inputs a corresponding labeling result according to a word vector, a word vector and an entity knowledge characteristic vector input at different moments, calculates the error between the predicted identification result and a sample label, and then adjusts model parameters through a reverse conduction and gradient descent method until the error is converged.

The process of processing the input data in the training process of the model is the same as the process of processing the input data when the model is used for entity tagging, so in the embodiment of the application, the process of training the model is not repeated, and in the specific implementation, the process of processing the input data when the model is used for entity tagging is referred to.

The following describes in detail a specific embodiment of processing input data using an entity recognition model to output a labeling result in conjunction with a model structure. As described in the foregoing steps, when entity recognition is performed, the input data includes a word vector, and an entity knowledge feature vector of a word included in the text to be recognized.

In some embodiments of the present application, the performing fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result includes: sequentially performing weighted fusion processing on the word vector input at the current moment, the word vector input at the current moment and the entity knowledge characteristic vector input at the current moment through a memory unit of a pre-trained entity recognition model until all word vectors of all words in the text to be recognized are processed, and obtaining a final moment memory state of the memory unit; and outputting an entity labeling result of the text to be recognized according to the final moment memory state, the word vector input at the final moment and the output of the entity recognition model at the moment before the final moment.

Network h of LSTM model at current moment t_tOutputting a memory state c dependent on the current time t_tAnd input x at time t_tOutput h at time t-1_t-1. For the text to be recognized (such as 'hand knitting'), the output of the entity recognition model of the word vector corresponding to the last word (such as 'knitting'), the output of the entity recognition model at the moment of the input word vector (such as 'knitting') and the previous moment (namely when the 'knitting' word is input) can be obtained according to the memory state when the word vector of the last word (such as 'knitting') is input, and the labeling result of the text to be recognized can be obtained.

In some embodiments of the present application, a first memory between a word vector semantically associated with a word vector input at a current time and the word vector input at the current time is learned through a word vector learning network 220, a second memory between an entity knowledge feature vector associated with entity knowledge and the word vector input at the current time is learned through an entity knowledge feature vector learning network 230, and the first memory and the second memory learned at the current time are fed back to the word vector learning network 210 for updating a unit state at the current time. The finally obtained unit state is fused with the information of the word vector, the word vector and the entity knowledge characteristic vector.

In some embodiments of the present application, performing, by using a memory unit of a pre-trained entity recognition model, weighted fusion processing on the word vector input at the current time, and the entity knowledge feature vector input at the current time in sequence, includes: calculating semantic features between currently input word vectors through the word vector learning network, learning semantic features between currently input word vectors and hidden layer output vectors of the currently input word vectors through the word vector learning network, and learning entity knowledge features between currently input entity knowledge feature vectors and hidden layer output vectors of the currently input word vectors through the entity knowledge feature vector learning network; and according to the input weights of the current input vectors of the word vector learning network, the word vector learning network and the entity knowledge characteristic vector learning network, carrying out weighted summation on the current memory unit state of the corresponding network, and updating the current memory unit state of the word vector learning network according to the result obtained by the weighted summation.

In some embodiments of the present application, the current memory cell state of the word vector network comprises: semantic fusion information of all current input word vectors associated with the current input word vector, the current input word vector associated with the current input word vector being: corresponding the current input word vector to the word vector of the word at the end of the word; the current memory cell state learned by the entity knowledge feature vector comprises: entity knowledge fusion information of all current entity knowledge feature vectors associated with the current input word vector, the current entity knowledge feature vectors associated with the current input word vector being: and the current input word vector corresponds to the entity knowledge characteristic vector of the text substring at the end of the word.

The following describes in detail the fusion calculation process of the word vector, the word vector and the entity knowledge feature vector, respectively, with reference to the internal structures of the word vector learning network 210, the word vector learning network 220 and the entity knowledge feature vector learning network 230.

To use e^cRepresenting word vectors (i.e. the result of the encoding of the word granularity) as an example, for the j-th word c of the input_jWhose word vector code is represented as

Expressing sigmoid function in the forward calculation process by σ, the components of the model in the word vector learning network 210 can be expressed by the following formula:

input coefficients:

output coefficient:

forgetting coefficient:

currently input cell state:

memory cell state at the present time:

output at the current moment:

wherein the content of the first and second substances,

is a model coefficient, determined by training; j represents the input time of the word in the text to be recognized and corresponds to the subscript of the text sub-vector sequence to be recognized;

a word vector representing the input at time j (e.g., "hand-knit" for the text to be recognized,

word vectors representing "hand", "work", "compose", "weave", respectively);

representing the model output at the previous time instant.

The word vector learning network 220 is constructed based on an LSTM (Long-short-time memory) model to use e ^wRepresenting word vectors (i.e., the result of encoding of word granularity) for example, for a word c consisting of the b-th word through the e-th word in the text to be recognized_b,eThe word vector code of which is expressed as

Expressing sigmoid function in the forward calculation process by σ, the components of the model in the word vector learning network 220 can be expressed by the following formula:

input coefficients:

forgetting coefficient:

currently input cell state:

memory cell state at the present time:

wherein the content of the first and second substances,

is a model coefficient, determined by training; b and e represent the positions of the characters in the text to be recognized and respectively correspond to the input b moment and the input e moment of the character vector;

a word vector representing words starting with the b-position word and ending with the e-position word (e.g., "hand-knit" for the text to be recognized,

word vectors representing "hand" and "knit", respectively);

a hidden layer output vector representing the starting word of the current word.

The word vector learning network 220 learns semantic fusion information between a word vector input at the current moment and a hidden layer output vector of an initial word of a word corresponding to the word vector input at the current moment, so that the learned semantic fusion information is carried in the state of the memory unit at the current moment. Such as learning semantic fusion information between "handmade" words and "handmade" words, learning semantic fusion information between "woven" words and "written" words through the word vector learning network 220.

The entity knowledge feature vector learning network 230 is constructed based on an LSTM (Long-short-time memory) model to use e^phraseRepresenting the entity knowledge feature vector as an example, for a text sub-string c consisting of the b th word to the e th word in the text to be recognized_b，eThe encoding of the body knowledge feature vector is expressed as

Expressing sigmoid function in the forward calculation process by σ, the components of the model in the entity knowledge feature vector learning network 230 can be expressed by the following formula:

input gate coefficient:

forgetting coefficient:

currently input cell state:

memory cell state at the present time:

wherein the content of the first and second substances,

is a model coefficient, determined by training;

an entity knowledge feature vector representing a text sub-string with the b-position word as the starting word and the e-position word as the ending word (e.g., "hand-knitting" for the text to be recognized,

entity knowledge feature vectors representing "hand", "weave", and "hand weave", respectively);

a hidden layer output vector representing the start word of the current text sub-string.

The entity knowledge feature vector learning network 230 learns entity knowledge fusion information between a current entity knowledge feature vector and a hidden layer output vector of an initial word of a text sub-string corresponding to the current entity knowledge feature vector, so that the learned entity knowledge fusion information is carried in a current memory unit state. For example, the entity knowledge fusion information between the text sub-string of "manual" and the word of "hand" and the entity knowledge fusion information between the text sub-string of "knitting" and the word of "knitting" are learned through the entity knowledge feature vector learning network 230.

And finally, fusing the input word vectors, the input word vectors and the input entity knowledge characteristic vectors through a memory unit of an entity recognition model hidden layer, and finally updating the state of the current moment.

At the current time j, the input coefficients of the word vector learning network 210 (i.e., the input coefficients of the word vector) are expressed as:

the input coefficients of the word vector learning network 220 (i.e., the input coefficients of the word vectors) are represented as:

the input coefficients of the entity knowledge feature vector learning network 230 (i.e., the input coefficients of the entity knowledge feature vector) are expressed as:

then, the fusion weight of the word vector, the word vector and the entity knowledge feature vector input at the current moment is calculated according to the input current coefficients of the word vector learning network 210, the word vector learning network 220 and the entity knowledge feature vector learning network 230.

Calculating the updated memory cell state at the current moment by the following formula:

wherein D represents a word set included in the text to be recognized; b' and b represent the position offset of the words in the text to be recognized;

a fusion weight of a memory cell state of the word vector learning network when a word vector of a word matching between the input position offsets b and j is entered;

when the word vector of the word starting from the input position offset b is represented, the word vector learns the state of the network memory unit; p represents a text substring set included in the text to be recognized;

When the entity knowledge characteristic vector of the text substring matched between the input position offsets b and j is represented, the entity knowledge characteristic vector learns the fusion weight of the state of a memory unit of the network;

when the entity knowledge characteristic vector of the text substring started by the input position offset b is represented, the entity knowledge characteristic vector learns the state of a memory unit of the network;

a fusion weight of a memory cell state of the word vector learning network when a word vector of a word of the input position j is input;

a word vector representing a word in position j learns the memory cell state of the network.

For the text to be recognized, "hand weaving", if the input vector of the word vector learning network 210 at the current time is a word vector of "weaving", the input vector of the word vector learning network 220 at the current time is a word vector of "weaving", and the input vector of the entity knowledge feature vector learning network 230 at the current time is an entity knowledge feature vector of "weaving" and "hand weaving". As can be seen from the formula for calculating the updated memory cell state at the current time, the current memory cell state of the word vector learning network 210, the memory cell state when the word vector learning network 220 inputs the word vector of "weave", and the memory cell state when the entity knowledge feature vector learning network 230 inputs the word vector of "weave" and "hand weave", are weighted and summed, so that the result of weighted and summed merges semantic features (i.e., semantic features carried by the word vector and the word vector) and entity knowledge features (i.e., entity knowledge features carried by the entity knowledge feature vector). And updating the current memory unit state of the word vector learning network 210 through the result of weighted summation, wherein the updated current memory unit state of the word vector learning network 210 integrates semantic features and entity knowledge features. Further, the output of the word vector learning network 210 is calculated based on the current memory unit state of the word vector learning network 210, and the entity labeling result of the text to be recognized can be obtained.

In some embodiments of the present application, the input weights of the current input vectors of the word vector learning network, and the entity knowledge feature vector learning network are determined by: respectively determining an input coefficient of the word vector learning network corresponding to a current input word vector, an input coefficient of the word vector learning network corresponding to each current input word vector, and an input coefficient of the entity knowledge feature vector learning network corresponding to each current input entity knowledge feature vector; respectively executing preset operation on the input coefficient of the word vector learning network corresponding to the determined current input word vector, the input coefficient of the word vector learning network corresponding to each current input word vector and the input coefficient of the entity knowledge feature vector learning network corresponding to each current input entity knowledge feature vector to obtain an initial input weight corresponding to the current input word vector, an initial input weight corresponding to each current input word vector and an initial input weight corresponding to each current input entity knowledge feature vector; dividing the initial input weight corresponding to the current input word vector, the initial input weight corresponding to each current input word vector and the initial input weight corresponding to each current input entity knowledge characteristic vector by the sum of the initial weights of all vectors input at the current moment respectively to obtain the current input word vector, each current input word vector and the weight when each current input entity knowledge characteristic vector is subjected to weighting fusion processing; wherein, all vectors input at the current moment include: the input word vector at the current moment, the input word vector at the current moment and the input entity knowledge characteristic vector at the current moment. In some embodiments of the present application, the predetermined operation comprises an exponential operation with e as a base.

For example, the word vector fusion weight for the position offset j match can be calculated by the following formula

Calculating a word vector fusion weight of a word matched between the position offsets b and j by the following formula

Calculating the fusion weight of the entity knowledge feature vector of the matched text substring between the position offsets b and j by the following formula

In the above-mentioned formula,

for a word vector input offset j matching, the word vector learns the input coefficients of the network 210;

for word vectors input for position offsets b and j matching, the word vector learns the input coefficients of the network 220;

when the entity knowledge feature vector of the text sub-string matched for input position offsets b and j is input, the entity knowledge feature vector is inputInput coefficients of the body knowledge feature vector learning network 230; d represents a word set included in the text to be recognized; b' and b represent the position offset of the words in the text to be recognized; p represents a text substring set included in the text to be recognized;

a set of starting word position offsets representing all words ending with a word corresponding to the position offset j;

representing a set of start word position offsets for all text sub-strings ending with a word corresponding to position offset j.

And step 130, determining an entity included in the text to be recognized according to the entity labeling result.

Taking training of a training sample labeled by the entity recognition model by adopting a BIOES named entity labeling method as an example, for a text to be recognized input into the entity recognition model, the entity recognition model outputs a sequence labeling result expressed by BIOES. Further, according to the sequence of each BIE in the sequence labeling result, the entity name of the text description of the corresponding position can be determined. According to the entity labeling result, a specific implementation manner of the entity included in the text to be recognized input to the entity recognition model is determined, which is referred to in the prior art and is not described in detail in the embodiment of the present application.

On the other hand, according to the embodiment that the user behavior characteristics (such as user click information) reflected in the search log are used as entity knowledge characteristics, the relevance between the identified entity and the user behavior can be further improved, and the reference value of the recalled search result is improved.

Example two

An entity identification apparatus disclosed in an embodiment of the present application, as shown in fig. 3, the apparatus includes:

a semantic and entity knowledge feature vector determination module 310, configured to determine a semantic feature vector and an entity knowledge feature vector that are matched with a text to be recognized; the entity knowledge characteristic vector is used for indicating matching information of text substrings included in the text to be recognized and a preset search log;

the entity identification labeling module 320 is configured to perform fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity identification model, and output an entity labeling result of the text to be identified according to a fusion calculation result;

and the entity determining module 330 is configured to determine, according to the entity labeling result, an entity included in the text to be recognized.

In some embodiments of the present application, the entity knowledge feature vector comprises: as shown in fig. 4, the entity knowledge feature vector of each text sub-string in the text to be recognized is determined by the semantic and entity knowledge feature vector determining module 310, which is further configured to:

A search log matching sub-module 3101, configured to, for each text sub-string included in the text to be recognized, match the text sub-string with each document field of a query document included in a preset search log, respectively, and determine the query document based on matching between each document field and the text sub-string;

the entity knowledge feature vector determination sub-module 3102 is configured to, for each text sub-string included in the text to be recognized, determine, for each document field, a vector value of a dimension corresponding to the document field in the entity knowledge feature vector of the text sub-string according to click information of the query document based on matching of the document field and the text sub-string.

In some embodiments of the present application, the entity knowledge feature vector determination sub-module 3102 is further configured to:

determining a vector value of a dimension corresponding to the document field in the entity knowledge characteristic vector of the text sub-string according to whether the query document matched with the text sub-string based on the document field is clicked by a user; alternatively, the first and second electrodes may be,

and determining vector values of corresponding dimensions of the document fields in the entity knowledge characteristic vectors of the text sub-strings according to click distribution information of the query documents matched with the text sub-strings by the user based on the document fields.

In some embodiments of the present application, the semantic feature vector comprises: the word vector of each word in the text to be recognized and the word vector of the word in the text to be recognized are obtained; the entity knowledge feature vector comprises: entity knowledge characteristic vectors of text substrings in the text to be recognized;

the entity identification tagging module 320 is further configured to:

sequentially performing weighted fusion processing on the word vector input at the current moment, the word vector input at the current moment and the entity knowledge characteristic vector input at the current moment through a memory unit of a pre-trained entity recognition model until all word vectors of all words in the text to be recognized are processed, and obtaining a final moment memory state of the memory unit;

and outputting an entity labeling result of the text to be recognized according to the final moment memory state, the word vector input at the final moment and the output of the entity recognition model at the moment before the final moment.

In some embodiments of the present application, the entity recognition model comprises: the method comprises the following steps of sequentially carrying out weighting fusion processing on a word vector input at the current moment, a word vector input at the current moment and an entity knowledge characteristic vector input at the current moment through a memory unit of a pre-trained entity recognition model, and comprises the following steps:

Calculating semantic features between currently input word vectors through the word vector learning network, learning semantic features between currently input word vectors and hidden layer output vectors of the currently input word vectors through the word vector learning network, and learning entity knowledge features between currently input entity knowledge feature vectors and hidden layer output vectors of the currently input word vectors through the entity knowledge feature vector learning network;

according to the input weights of the current input vectors of the word vector learning network, the word vector learning network and the entity knowledge characteristic vector learning network, carrying out weighted summation on the current memory unit state of the corresponding network, and updating the current memory unit state of the word vector learning network according to the result obtained by the weighted summation;

wherein the current memory cell state of the word vector network comprises: semantic fusion information of all current input word vectors associated with the current input word vector, the current input word vector associated with the current input word vector being: corresponding the current input word vector to the word vector of the word at the end of the word; the current memory cell state learned by the entity knowledge feature vector comprises: entity knowledge fusion information of all current entity knowledge feature vectors associated with the current input word vector, the current entity knowledge feature vectors associated with the current input word vector being: and the current input word vector corresponds to the entity knowledge characteristic vector of the text substring at the end of the word.

In some embodiments of the present application, the input weights of the current input vectors of the word vector learning network, and the entity knowledge feature vector learning network are determined by:

respectively determining an input coefficient of the word vector learning network corresponding to a current input word vector, an input coefficient of the word vector learning network corresponding to each current input word vector, and an input coefficient of the entity knowledge feature vector learning network corresponding to each current input entity knowledge feature vector;

respectively executing preset operation on the input coefficient of the word vector learning network corresponding to the determined current input word vector, the input coefficient of the word vector learning network corresponding to each current input word vector and the input coefficient of the entity knowledge feature vector learning network corresponding to each current input entity knowledge feature vector to obtain an initial input weight corresponding to the current input word vector, an initial input weight corresponding to each current input word vector and an initial input weight corresponding to each current input entity knowledge feature vector;

dividing the initial input weight corresponding to the current input word vector, the initial input weight corresponding to each current input word vector and the initial input weight corresponding to each current input entity knowledge characteristic vector by the sum of the initial weights of all vectors input at the current moment respectively to obtain the current input word vector, each current input word vector and the weight when each current input entity knowledge characteristic vector is subjected to weighting fusion processing;

Wherein, all vectors input at the current moment include: the input word vector at the current moment, the input word vector at the current moment and the input entity knowledge characteristic vector at the current moment.

The entity identification device disclosed in this embodiment is used to implement the entity identification method described in the first embodiment of the present application, and specific implementation of each module of the device is not described again, and reference may be made to specific implementation of corresponding steps in the method embodiments.

The entity recognition device disclosed by the embodiment of the application determines semantic feature vectors and entity knowledge feature vectors matched with texts to be recognized; the entity knowledge characteristic vector is used for indicating matching information of text substrings included in the text to be recognized and a preset search log; performing fusion calculation on the semantic feature vectors and the entity knowledge feature vectors through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result; and determining the entity included in the text to be recognized according to the entity labeling result, so that the recognition performance of the query entity with non-traditional meaning can be improved. The embodiment of the application provides a new word mining method fusing search log features, the new word mining effect is optimized by using massive user search log features, and the identification accuracy of new entities in query input can be effectively improved.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The entity identification method and the entity identification device provided by the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation manner of the application, and the description of the embodiment is only used for helping to understand the method and a core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in an electronic device according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

For example, fig. 5 shows an electronic device that may implement a method according to the present application. The electronic device can be a PC, a mobile terminal, a personal digital assistant, a tablet computer and the like. The electronic device conventionally comprises a processor 510 and a memory 520, and program code 530 stored on said memory 520 and executable on the processor 510, said processor 510 implementing the method described in the above embodiments when executing said program code 530. The memory 520 may be a computer program product or a computer readable medium. The memory 520 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 520 has a storage space 5201 for program code 530 of the computer program for performing any of the method steps of the above-described method. For example, the storage space 5201 for the program code 530 may include respective computer programs for implementing the respective steps in the above methods. The program code 530 is computer readable code. The computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform the method according to the above embodiments.

The embodiment of the present application also discloses a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the entity identification method according to the first embodiment of the present application.

Such a computer program product may be a computer-readable storage medium that may have memory segments, memory spaces, etc. arranged similarly to the memory 520 in the electronic device shown in fig. 5. The program code may be stored in a computer readable storage medium, for example, compressed in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 6. Typically, the storage unit comprises computer readable code 530 ', said computer readable code 530' being code read by a processor, which when executed by the processor, performs the steps of the method described above.

Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Moreover, it is noted that instances of the word "in one embodiment" are not necessarily all referring to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An entity identification method, comprising:

2. The method of claim 1, wherein the entity knowledge feature vector comprises: the step of determining the entity knowledge characteristic vector matched with the text to be recognized according to the entity knowledge characteristic vector of each text sub-string in the text to be recognized comprises the following steps:

and respectively executing the following operations aiming at each text sub-string included in the text to be recognized:

respectively matching the text substrings with document fields of query documents included in a preset search log, and determining the query documents based on matching of the document fields and the text substrings;

And for each document field, determining vector values of dimensions corresponding to the document fields in the entity knowledge characteristic vectors of the text sub-strings according to click information of the query documents based on the matching of the document fields and the text sub-strings.

3. The method of claim 2, wherein the step of determining vector values of dimensions corresponding to the document field in the entity knowledge feature vector of the text sub-string according to click information of the query document based on matching of the document field and the text sub-string comprises:

4. The method according to any of claims 1 to 3, wherein the semantic feature vector comprises: the word vector of each word in the text to be recognized and the word vector of the word in the text to be recognized are obtained; the entity knowledge feature vector comprises: entity knowledge characteristic vectors of text substrings in the text to be recognized;

The step of performing fusion calculation on the semantic feature vector and the entity knowledge feature vector through a pre-trained entity recognition model, and outputting an entity labeling result of the text to be recognized according to a fusion calculation result includes:

5. The method of claim 4, wherein the entity recognition model comprises: the method comprises the following steps of sequentially carrying out weighting fusion processing on a word vector input at the current moment, a word vector input at the current moment and an entity knowledge characteristic vector input at the current moment through a memory unit of a pre-trained entity recognition model, and comprises the following steps:

6. The method of claim 5, wherein the input weights of the current input vectors of each of the word vector learning network, and the entity knowledge feature vector learning network are determined by:

7. An entity identification apparatus, comprising:

8. The apparatus of claim 7, wherein the entity knowledge feature vector comprises: the entity knowledge feature vector of each text sub-string in the text to be recognized, and the semantic and entity knowledge feature vector determination module are further configured to:

the search log matching sub-module is used for respectively matching each text sub-string included in the text to be recognized with each document field of a query document included in a preset search log and determining the query document based on the matching between each document field and the text sub-string;

and the entity knowledge characteristic vector determining submodule is used for determining vector values of dimensions corresponding to the document fields in the entity knowledge characteristic vectors of the text substrings according to click information of the query documents based on the matching of the document fields and the text substrings for each document field.

9. An electronic device comprising a memory, a processor, and program code stored on the memory and executable on the processor, wherein the processor implements the entity identification method of any one of claims 1 to 6 when executing the program code.

10. A computer-readable storage medium, on which a program code is stored, characterized in that the program code realizes the steps of the entity identification method of any of claims 1 to 6 when executed by a processor.