CN111967262A

CN111967262A - Method and device for determining entity tag

Info

Publication number: CN111967262A
Application number: CN202010617196.4A
Authority: CN
Inventors: 程鸣权; 杨浩; 刘昊; 刘欢; 陈坤斌; 刘准; 何伯磊; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-11-20
Anticipated expiration: 2040-06-30
Also published as: CN111967262B

Abstract

The application discloses a method and a device for determining an entity label, which relate to the technical field of natural language processing, the technical field of big data processing and the technical field of deep learning, and the specific implementation scheme is as follows: acquiring an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type; matching the target document with an entity tag library to obtain a plurality of successfully matched candidate entity tags; acquiring attribute characteristics of a target document, and acquiring label characteristics corresponding to each candidate entity label according to the target document; inputting the attribute features and the label features into a label recognition model trained in advance, and acquiring a first confidence corresponding to each candidate entity label; and determining a target entity label of the target document from the plurality of candidate entity labels according to the first confidence degree. Therefore, the entity label is determined in a semi-automatic mode, the accuracy and the recall rate of the entity label determination are improved, and the labor cost is reduced.

Description

Method and device for determining entity tag

Technical Field

The application relates to the technical field of natural language processing, big data processing and deep learning, in particular to a method and a device for determining an entity label.

Background

With the development of internet technology, various knowledge management scenarios are also implemented in combination with internet technology, for example, management of enterprise knowledge documents is implemented online, for example, search of technical documents is implemented online, and in which no matter what scenario is implemented on a related document, tagging of entity tags of the related document is relied on.

In the related technology, the mode of making entity tags is to manually card a tag system by a service expert, and then match keywords of a document with the tag system based on a keyword matching technology to determine the entity tags of the document.

However, the above determination method of the entity label not only makes the labor cost higher, but also the accuracy of the entity label depends on the comprehensiveness and accuracy of the manual carding label system, and the accuracy and recall rate of the entity label are lower.

Disclosure of Invention

The application provides a method and a device for determining an entity label, so that the entity label is determined in a semi-automatic mode, the accuracy and the recall rate of the determination of the entity label are improved, and the labor cost is reduced.

According to an aspect of the present application, there is provided a method for determining an entity tag, including: acquiring an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type; matching the target document with the entity tag library to obtain a plurality of successfully matched candidate entity tags; acquiring attribute features of the target document, and acquiring tag features corresponding to each candidate entity tag according to the target document; inputting the attribute features and the label features into a label recognition model trained in advance, and acquiring a first confidence corresponding to each candidate entity label; determining a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence.

According to another aspect of the present application, there is provided an entity tag determination apparatus, including: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an entity tag library corresponding to a document type of a target document, and the entity tag library comprises a plurality of entity tags corresponding to the document type; the second acquisition module is used for matching the target document with the entity tag library and acquiring a plurality of successfully matched candidate entity tags; a third obtaining module, configured to obtain attribute features of the target document, and obtain, according to the target document, a tag feature corresponding to each candidate entity tag; a fourth obtaining module, configured to input the attribute features and the tag features into a pre-trained tag identification model, and obtain a first confidence corresponding to each candidate entity tag; a first determining module, configured to determine a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence.

According to still another aspect of the present application, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of entity tag determination as previously described.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute the method of determining an entity tag as described above.

The technical scheme disclosed by the application at least comprises the following additional technical characteristics:

the method comprises the steps of obtaining an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type, further matching the target document with the entity tag library, obtaining a plurality of candidate entity tags which are successfully matched, obtaining attribute characteristics of the target document, obtaining tag characteristics corresponding to each candidate entity tag according to the target document, finally inputting the attribute characteristics and the tag characteristics into a label recognition model which is trained in advance, obtaining a first confidence coefficient corresponding to each candidate entity tag, and determining the target entity tag of the target document from the candidate entity tags according to the first confidence coefficient. Therefore, the entity label is determined in a semi-automatic mode, the accuracy and the recall rate of the entity label determination are improved, and the labor cost is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a method for determining an entity tag according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a recognition scenario of a tag recognition model according to a second embodiment of the present application;

fig. 3 is a flowchart illustrating a method for determining an entity tag according to a third embodiment of the present application;

FIG. 4 is a diagram illustrating a second inverted index table according to a fourth embodiment of the present application;

FIG. 5 is a diagram illustrating an alignment scenario of a second inverted index table according to a fifth embodiment of the present application;

fig. 6 is a flowchart illustrating a method for determining an entity tag according to a sixth embodiment of the present application;

fig. 7 is a flowchart illustrating a method for determining an entity tag according to a seventh embodiment of the present application;

FIG. 8 is a diagram illustrating a structure of a first inverted index table according to an eighth embodiment of the present application;

fig. 9 is a flowchart illustrating a method for determining an entity tag according to a ninth embodiment of the present application;

FIG. 10 is a schematic diagram of an ARC-I model training process according to a tenth embodiment of the present application;

fig. 11 is a schematic flow chart illustrating an implementation of an entity tag determination system according to an eleventh embodiment of the present application;

fig. 12 is a schematic structural diagram of an entity tag determination apparatus according to a twelfth embodiment of the present application;

fig. 13 is a schematic configuration diagram of an entity tag determination apparatus according to a thirteenth embodiment of the present application;

fig. 14 is a schematic configuration diagram of an entity tag determination apparatus according to a fourteenth embodiment of the present application; and

fig. 15 is a block diagram of an electronic device for implementing a method of entity tag determination of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to solve the technical problems mentioned in the background art, the entity tag determination labor cost is high and the accuracy is low, the application provides a semi-automatic complete solution for entity tag calculation in the field of a document.

Specifically, fig. 1 is a flowchart of a method for determining an entity tag according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step 101, an entity tag library corresponding to a document type of a target document is obtained, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type.

The entity tags may be understood to be focused on related entities in the enterprise knowledge field, and the entity tag library in this embodiment corresponds to document types, for example, when the document type is an insurance type, the corresponding entity tags may be underwriting, claims, and the like, and when the document type is a sports type, the corresponding entity tags may be sports equipment, sports actions, and the like.

However, since the entity tags may be different in different document types, and even the same entity tag may represent different meanings, for example, for an entity tag whose document type is insurance, the entity tag may focus on underwriting, claims, and the like, for example, when the document type is internet, the corresponding entity tag focuses on machine learning, natural language processing, and the like.

Therefore, in order to enable accurate recall, in the present application, an entity tag library corresponding to a document type of a target document is obtained, where the entity tag library includes a plurality of entity tags corresponding to the document type.

In this implementation, a preset corresponding relationship may be queried according to a document type, and an entity tag library corresponding to the document type is obtained, where the preset corresponding relationship may store a corresponding relationship between the document type and the entity tag library, for example, in the corresponding relationship, an entity tag library corresponding to the stored document type a is 1.

And 102, matching the target document with the entity tag library to obtain a plurality of successfully matched candidate entity tags.

In this embodiment, the target document is matched with the entity tag library, such as keyword matching, semantic matching, and the like, and a plurality of candidate entity tags successfully matched are obtained.

And 103, acquiring the attribute characteristics of the target document, and acquiring the label characteristics corresponding to each candidate entity label according to the target document.

It is understood that the candidate entity tags are obtained by matching based on a coarser granularity, and therefore, in order to further ensure the recall accuracy of the entity tags, the candidate entity tags are further filtered.

In the embodiment, the attribute characteristics of the target document are obtained, the label characteristics corresponding to each candidate entity label are obtained according to the target document, the attribute features may include title features of the target document, content features, where the title features may include at least one of a length of the title, a noun attribute included in the title, a term included in the title, a part of speech of the term included in the title, and the like, the content features may include a length of the content, a term included in the content, a part of speech corresponding to the term included in the content, and the like, the tag features may include at least one of a frequency of occurrence, a position of occurrence, and the like of a corresponding candidate entity tag in the title of the target document, at least one of a frequency of occurrence, a position of occurrence, and the like of a candidate entity tag in the content of the target document, a length of the entity tag, a term included in the entity tag, a part of speech corresponding to the entity tag included term, and the like. In summary, the attribute features and the tag features in the embodiment are multidimensional features, and can comprehensively reflect the candidate entity tags and the features of the target document.

And 104, inputting the attribute characteristics and the label characteristics into a label recognition model trained in advance, and acquiring a first confidence corresponding to each candidate entity label.

It should be noted that, in this embodiment, a tag identification model is obtained through pre-training, and the tag identification model may determine a first confidence corresponding to each candidate entity tag according to the input attribute features and the tag features, where the expression form of the first confidence may be a probability value as shown in fig. 2, or may be a possible level, and the higher the first confidence is, the more consistent the candidate entity tag is with the target document is.

And step 105, determining a target entity label of the target document from the plurality of candidate entity labels according to the first confidence degree.

In this implementation, the plurality of candidate entity tags may be sorted in order from high to low according to the first confidence, and the candidate entity tags located in the first few bits may be used as the target entity tags.

Or a matching degree threshold value can be preset, and the candidate entity label with the first confidence degree larger than the matching degree threshold value is taken as the target entity label.

Therefore, on one hand, the target entity label is screened based on the multidimensional characteristics, so that the matching of the entity label and the target document is ensured, on the other hand, the entity label is searched in the entity label library corresponding to the document type of the target document, so that the influence of different noise of the document type is further avoided, and the matching of the entity label and the target document is improved.

To sum up, the method for determining entity tags according to the embodiment of the present application obtains an entity tag library corresponding to a document type of a target document, where the entity tag library includes a plurality of entity tags corresponding to the document type, further matches the target document with the entity tag library, obtains a plurality of candidate entity tags successfully matched, obtains an attribute characteristic of the target document, obtains a tag characteristic corresponding to each candidate entity tag according to the target document, and finally inputs the attribute characteristic and the tag characteristic into a pre-trained tag identification model, obtains a first confidence corresponding to each candidate entity tag, and determines the target entity tag of the target document from the plurality of candidate entity tags according to the first confidence. Therefore, the entity label is determined in a semi-automatic mode, the accuracy and the recall rate of the entity label determination are improved, and the labor cost is reduced.

It should be noted that, in different application scenarios, the target document is matched with the entity tag library, and the manner of obtaining a plurality of successfully matched candidate entity tags is different, which is illustrated as follows:

example one:

in this example, as shown in fig. 3, step 102 includes:

step 201, performing word segmentation processing on the document title and the document content of the target document to obtain a plurality of document word segments.

In this embodiment, word segmentation processing is performed on the document title and the document content of the target document according to the word segmentation attributes, so as to obtain a plurality of document word segments.

Step 202, performing word segmentation processing on the entity tags to obtain tag segmentation words, and constructing a second inverted index table corresponding to the entity tag library according to the tag segmentation words.

In this embodiment, word segmentation is performed on the entity tags according to the word segmentation attributes included in the entity tags to obtain tag segmentation, and a second inverted index table corresponding to the entity tag library is constructed according to the tag segmentation.

When the second inverted index table is constructed, construction is started according to the sequence of label participles from front to back on an entity label, the prior label participles are used as key values, and the label participles after the key value label participles are used as value values.

For example, as shown in fig. 4, the entity label is "code learning guide", the label participle obtained after word segmentation processing is "code" and "learning guide", the entity label is "code learning", and the label participle obtained after word segmentation processing is "code" and "learning", so that in the constructed second inverted index table, the Key Value is "code", and the Value values are "learning guide" and "learning", respectively.

Step 203, matching each document word in the plurality of document words with a node in the second inverted index table, and judging whether a second node path corresponding to each document word is included.

In this embodiment, each document participle in the plurality of document participles is matched with a node in the second inverted index table, whether a second node path corresponding to each document participle is included is determined, that is, each document participle is matched with a Key node in the second inverted index table one by one according to the sequence in which each document participle appears, after matching is successful, subsequent document participles are matched with Value nodes corresponding to the Key nodes one by one, if a plurality of continuous document participles are matched with a plurality of continuous nodes in the second inverted index table, it is considered that a successful second node path is obtained, wherein the number of nodes of the second node path can be calibrated according to the scene requirement.

For example, the number of nodes of the second node path is a number greater than or equal to 3, as shown in fig. 5, when the document participles A, B, C, D are consecutive and the second inverted index table includes the key node a, B, C, D are sequentially matched with the value nodes of the corresponding levels after the key node, respectively, if the path "a-B-C" is obtained by matching and the number of nodes of the second path is equal to 3, the "a-B-C" is considered as the second node path (indicated by gray filling in the figure).

Step 204, if the second node path is included, determining that the entity label corresponding to the second node path is a candidate entity label.

In this embodiment, if the second node path is included, the entity label corresponding to the second node path is determined to be a candidate entity label, where a node path segment corresponding to each entity label of the path where each second node path is located may be preset, a node path segment where the second node path is located in the corresponding path is determined, and the node path segment is determined to be the corresponding candidate entity label.

In an actual implementation process, considering that although there is a second node path, the number of times that a document word corresponding to the second node path appears in a document is small, which may cause a noise determined by a candidate entity tag to be large, in an embodiment of the present application, after the second node path is determined, the number of times that an entity tag corresponding to the second node path appears in a target document is counted, and the entity tag corresponding to the second node path is determined to be a candidate entity tag only if the number of times that the entity tag corresponding to the second node path appears is greater than a preset number threshold.

Example two:

in this example, as shown in fig. 6, step 102 includes:

step 301, a title semantic vector of a document title of a target document is calculated.

The title semantic vector may be calculated according to semantic analysis algorithms and the like in the prior art.

Step 302, calculate the tag semantic vector of each entity tag.

The tag semantic vector may be calculated according to a semantic analysis algorithm or the like in the prior art.

Step 303, calculating semantic similarity between the title semantic vector and the tag semantic vector of each entity tag, and determining the entity tags with semantic similarity greater than a preset similarity threshold as candidate entity tags.

In this embodiment, semantic similarity between the title semantic vector and the tag semantic vector of each entity tag is calculated, and the entity tags with semantic similarity greater than a preset similarity threshold are determined as candidate entity tags.

Of course, in this embodiment, a literal matching manner may also be combined to filter candidate entity tags according to semantic recalls. That is, the candidate entity tags are recalled in combination with the literal and semantic, where the literal recall mode may refer to the mode mentioned in the above example one, or may be a direct literal matching mode, etc.

In summary, the method for determining the entity tag according to the embodiment of the present application determines the candidate entity tag in the entity tag library corresponding to the document type of the target document, so that the accuracy of entity tag recall is improved.

Based on the above embodiments, the present application provides a part labeled on an entity label on an online pair, and the offline part and the online part in the present application cooperate to form an entity label determination system, which can improve the automation degree of entity label determination, reduce manual participation, and greatly improve the accuracy and recall rate of entity label calculation.

The entity tag library is determined off-line.

In this embodiment, as shown in fig. 7, before step 101, the present application further includes:

step 401, obtaining a document search log, a professional document, a knowledge graph and an associated vertical document corresponding to the document type.

In this embodiment, the corresponding entity tag library is determined in various knowledge documents corresponding to the document type, and the various knowledge documents in this embodiment include document search logs, professional documents, knowledge maps, and associated vertical documents.

Step 402, extracting search words in the document search log, performing word segmentation processing on the search words to obtain search participles, and obtaining a first reference entity tag corresponding to the document type according to the search participles.

For the document search log, the search word is cut to obtain the search word, and the first reference entity label corresponding to the document type is obtained according to the search word.

In some possible embodiments, the search term with repetition rate higher than a certain value may be directly used as the first reference entity label.

In other possible embodiments, the first inverted index table of the search terms is constructed by using the search terms of the document search logs as nodes according to the order of the node priorities from top to bottom, that is, the first inverted index table is constructed in a manner that the first search term of each search term is used as the key of the inverted index, and the search terms of all document search logs at the beginning of the search term are used as the value of the inverted index, so as to generate the first inverted index table.

After the first inverted index table is generated, one key may correspond to multiple layers of values, in order to ensure the concentration of entity labels, a target node with the node priority higher than a preset level in the first inverted index table is determined, a first node path of the target node in the first inverted index table is determined, and a first reference entity label is determined according to search participles covered by the first node path. Thus, in this embodiment, the first reference entity label with a larger granularity is retained.

For example, as shown in fig. 8, when the value corresponding to the key "banana" in the first inverted index table includes a "type" of the first level, a "growing method" of the second level, and a "distinguishing method", and the preset level is the first level, the first node path covered by the key to the node of the first level is used as the first reference entity label (the gray filling in the figure represents the first node path), and in this embodiment, the "banana type" is used as the first reference entity label.

Step 403, extracting a plurality of keywords in the professional document, and calculating an important value of each keyword in the plurality of keywords in the theme of the professional document according to a preset algorithm.

Step 404, determining a preset number of target keywords as second reference entity labels in the plurality of keywords according to the importance values.

In this embodiment, the preset algorithm may be a tf-idf algorithm, so that the frequency of occurrence of each keyword in a professional document and the like may be calculated in a manner of extracting keywords by tf-idf, an important value tf-idf is determined, and a preset number of target keywords with the maximum value tf-idf are selected as the second mined reference entity tags.

Step 405, identify proper nouns in the knowledge-graph and associated vertical documents, and determine a third reference entity tag according to the proper nouns.

In the embodiment, proper nouns in the knowledge map and the associated vertical documents are identified according to algorithms such as deep learning, and the third reference entity tag is determined according to the proper nouns.

Step 406, determining an entity tag library according to the first reference entity tag, the second reference entity tag and the third reference entity tag.

In this embodiment, the entity tag library is determined by using the first reference entity tag, the second reference entity tag and the third reference entity tag mined in multiple dimensions.

Certainly, in order to further ensure the purity of the entity tag library, in an embodiment of the present application, a neural network model may be obtained through pre-training, where the neural network model may be a text-cnn model, and each of the first reference entity tag, the second reference entity tag, and the third reference entity tag is input into the pre-trained neural network model, a second confidence corresponding to each reference entity tag is obtained, and the entity tag library is determined according to the reference entity tag whose second confidence is greater than a preset confidence value.

In the embodiment, when the neural network model is trained in advance, a large number of positive samples and a large number of negative samples can be sampled for training, and the model parameters are adjusted according to the training result, for example, when the recognition accuracy of the positive samples and the negative samples is greater than a certain ratio, the training is stopped, otherwise, the neural network model is trained in a circulating manner.

In addition, the label recognition model in this embodiment is also obtained by offline training.

In one embodiment of the present application, as shown in fig. 9, the step of training the label recognition model includes:

step 501, a candidate entity tag library is obtained, wherein the candidate entity tag library includes a plurality of standard entity tags corresponding to a plurality of document types.

The samples contained in the candidate entity tag library in the embodiment can be obtained by adopting unsupervised rule mining, and can also include manually labeled samples, wherein after the samples are obtained, the incomplete samples are cleaned, if the document type does not have the corresponding standard entity tag, the corresponding samples are cleaned, the process can be executed by a machine, and the labor cost is reduced.

Step 502, performing word segmentation processing on a plurality of standard entity labels to obtain standard label word segments.

In this embodiment, the standard tag segmentation may be obtained by performing word segmentation processing on a plurality of standard entity tags in a manner such as attribute analysis.

Step 503, constructing a third inverted index table according to the standard label participles corresponding to the plurality of standard entity labels.

The third inverted index table is constructed in a manner that the first standard label participle of each standard entity label is used as a key of the third inverted index table, and the standard label participles of all standard entity labels at the beginning of the standard label participle are used as values of the inverted index.

Step 504, a plurality of first training documents corresponding to the document types are obtained, and a training document word segmentation is obtained by performing word segmentation processing on each first training document in the plurality of first training documents.

In this embodiment, a plurality of first training documents corresponding to the document types are obtained, and a training document word segmentation process is performed on each of the plurality of first training documents to obtain a training document word segmentation.

Step 505, matching the training document word segmentation of each first training document with the nodes in the third inverted index table, and judging whether a third node path corresponding to the training document word segmentation is included.

In this embodiment, the training document word segmentation of each first training document is matched with the nodes in the third inverted index table, and whether a third node path corresponding to the training document word segmentation is included is determined, where the first training document may be understood as a machine-excavated training document.

Step 506, if the third node path is included, determining that the standard entity label corresponding to the third node path is the entity label corresponding to the first training document.

In this embodiment, if the third node path is included, it is determined that the standard entity label corresponding to the third node path is the entity label corresponding to the first training document, that is, the entity label is marked in a machine training manner for the mined training documents.

Step 507, obtaining a second training document corresponding to the document type and a pre-labeled entity label of the second training document.

The second training document can understand a manually labeled training document, wherein the second training document is labeled with an entity label in advance.

And step 508, training and generating a label recognition model according to the first training document and the entity label corresponding to the first training document, and the second training document and the entity label corresponding to the second training document.

In this embodiment, as shown in fig. 10, when the tag identification model is an improved ARC-I model, N-dimensional features of each training document may be extracted, and the improved ARC-I model is input for training, where the N-dimensional features include a document title and a content of the corresponding training document, a word segmentation feature obtained by segmenting a word of an entity tag, a feature of the number of occurrences of the entity tag of the training document in the document title, a feature of the number of occurrences in the document content, a length feature of the entity tag, and the like (corresponding to features 1 to N in the drawing). And obtaining a label identification model which can be used for entity label recalling according to the training result.

In order to make the entity tag determination system of the embodiment of the present application more clearly understood by those skilled in the art, the following description is given with reference to specific application scenarios:

as shown in fig. 11, the entity tag determination system according to the embodiment of the present application includes an online part and an offline part, wherein the offline part is to construct an entity tag candidate set according to knowledge documents corresponding to document types (including internal knowledge documents such as content part search logs, internal knowledge documents, internal trial maps, and the like, and external knowledge documents such as vertical site data and vertical lexicon data), and further, to filter tags in the entity tag candidate set based on a text-cnn model to determine an entity tag library corresponding to document types.

In the off-line part, a training data set can be constructed based on training knowledge documents mined by an unsupervised machine and manually labeled training knowledge documents, the training data set comprises corresponding training knowledge documents and corresponding entity labels, and multi-dimensional feature extraction training is carried out on the training knowledge documents and the entity labels in the training data set to obtain an improved ARC-I model, so that after the improved ARC-I model is trained, the corresponding entity labels can be obtained according to the multi-dimensional features of the knowledge documents and the entity labels.

And the online part determines candidate entity tags matched with the target document after literal and semantic recalls are carried out according to the entity tag library constructed by the offline part, extracts the candidate entity tags and the multi-dimensional characteristics of the target document, inputs the multi-dimensional characteristics into the improved ARC-I model, and determines the target entity tags of the target document from the candidate entity tags.

In summary, the method for determining the entity tag greatly reduces the cost of manually constructing and maintaining the entity tag system on one hand, and improves the ARC-I model on the other hand, and the improved ARC-I model can introduce document title information and related statistical characteristics, thereby greatly improving the accuracy and recall rate of entity tag calculation.

In order to implement the foregoing embodiment, the present application further provides an apparatus for determining an entity tag, fig. 12 is a schematic structural diagram of the apparatus for determining an entity tag according to an embodiment of the present application, and as shown in fig. 12, the apparatus for determining an entity tag includes: a first obtaining module 10, a second obtaining module 20, a third obtaining module 30, a fourth obtaining module 40, and a first determining module 50, wherein,

a first obtaining module 10, configured to obtain an entity tag library corresponding to a document type of a target document, where the entity tag library includes a plurality of entity tags corresponding to the document type;

a second obtaining module 20, configured to match the target document with the entity tag library, and obtain a plurality of candidate entity tags that are successfully matched;

a third obtaining module 30, configured to obtain attribute features of a target document, and obtain, according to the target document, a tag feature corresponding to each candidate entity tag;

a fourth obtaining module 40, configured to input the attribute features and the tag features into a pre-trained tag identification model, and obtain a first confidence corresponding to each candidate entity tag;

a first determining module 50, configured to determine a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence.

in a possible implementation form of the present application, the second obtaining module 20 is specifically configured to:

performing word segmentation processing on a document title and document contents of a target document to obtain a plurality of document word segments;

the entity label word segmentation processing is carried out to obtain label word segmentation, and a second inverted index table corresponding to the entity label library is constructed according to the label word segmentation;

matching each document word in the document words with a node in the second inverted index table, and judging whether a second node path corresponding to each document word is included;

and if the second node path is included, determining the entity label corresponding to the second node path as a candidate entity label.

calculating a title semantic vector of a document title of a target document;

calculating a tag semantic vector of each entity tag;

and calculating the semantic similarity of the title semantic vector and the label semantic vector of each entity label, and determining the entity labels with the semantic similarity larger than a preset similarity threshold as candidate entity labels.

In summary, the apparatus for determining an entity tag according to the embodiment of the present application determines a candidate entity tag in an entity tag library corresponding to a document type of a target document, so that accuracy of entity tag recall is improved.

In one embodiment of the present application, as shown in fig. 13, the apparatus further includes a fifth obtaining module 60, a sixth obtaining module 70, a calculating module 80, a second determining module 90, a third determining module 100 and a fourth determining module 110 on the basis of the method shown in fig. 12, wherein,

a fifth obtaining module 60, configured to obtain a document search log, a professional document, a knowledge graph, and an associated vertical document corresponding to a document type;

a sixth obtaining module 70, configured to extract a search word in the document search log, perform word segmentation processing on the search word to obtain a search segmentation word, and obtain a first reference entity tag corresponding to the document type according to the search segmentation word;

the calculation module 80 is configured to extract a plurality of keywords in the professional document, and calculate an importance value of each keyword in the plurality of keywords in the professional document according to a preset algorithm;

a second determining module 90, configured to determine, according to the importance value, a preset number of target keywords in the plurality of keywords as second reference entity tags;

a third determining module 100, configured to identify proper nouns in the knowledge-graph and the associated vertical documents, and determine a third reference entity tag according to the proper nouns;

a fourth determining module 110, configured to determine the entity tag library according to the first reference entity tag, the second reference entity tag, and the third reference entity tag.

In one embodiment of the present application, as shown in fig. 14, on the basis of that shown in fig. 12, the apparatus further comprises: a seventh acquisition module 120, an eighth acquisition module 130, a construction module 140, a ninth acquisition module 150, an interpretation module 160, a fifth determination module 170, a tenth acquisition module 180, and a training module 190, wherein,

a seventh obtaining module 120, configured to obtain a candidate entity tag library, where the candidate entity tag library includes a plurality of standard entity tags corresponding to a plurality of document types;

an eighth obtaining module 130, configured to perform word segmentation processing on multiple standard entity labels to obtain standard label word segments;

the building module 140 is configured to build a third inverted index table according to the standard tag segmentation corresponding to the plurality of standard entity tags;

a ninth obtaining module 150, configured to obtain a plurality of first training documents corresponding to the document types, and perform word segmentation processing on each of the plurality of first training documents to obtain a training document word segmentation;

the interpretation module 160 is configured to match the training document word segmentation of each first training document with a node in the third inverted index table, and determine whether a third node path corresponding to the training document word segmentation is included;

a fifth determining module 170, configured to determine, when a third node path is included, that a standard entity tag corresponding to the third node path is an entity tag corresponding to the first training document;

a tenth obtaining module 180, configured to obtain a second training document corresponding to the document type and a pre-labeled entity label of the second training document;

the training module 190 is configured to generate a label recognition model according to the first training document and the entity label corresponding to the first training document, and the second training document and the entity label corresponding to the second training document.

In summary, the device for determining the entity tag of the implementation class of the application greatly reduces the cost of manually constructing and maintaining the entity tag system on one hand, and improves the ARC-I model on the other hand, and the improved ARC-I model can introduce document title information and related statistical characteristics, thereby greatly improving the accuracy and recall rate of entity tag calculation.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 15 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 15, the electronic apparatus includes: one or more processors 1501, memory 1502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 15 illustrates an example of a processor 1501.

The memory 1502 is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of entity tag determination provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of entity tag determination provided herein.

The memory 1502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for entity tag determination in the embodiments of the present application (for example, the first obtaining module 10, the second obtaining module 20, the third obtaining module 30, the fourth obtaining module 40, and the first determining module 50 shown in fig. 12). The processor 1501 executes various functional applications of the server and data processing, i.e., a method of implementing the entity tag determination in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 1502.

The memory 1502 may include a program storage area that may store an operating system, an application program required for at least one function, and a data storage area; the storage data area may store data created according to the use of the electronic device determined by the entity tag, and the like. Further, the memory 1502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1502 may optionally include memory located remotely from the processor 1501, which may be connected to the entity tag-determined electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for entity tag determination may further include: an input device 1503 and an output device 1504. The processor 1501, the memory 1502, the input device 1503, and the output device 1504 may be connected by a bus or other means, such as the bus connection shown in fig. 15.

The input device 1503 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device as determined by the physical label, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 1504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for determining an entity tag, comprising:

acquiring an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type;

matching the target document with the entity tag library to obtain a plurality of successfully matched candidate entity tags;

acquiring attribute features of the target document, and acquiring tag features corresponding to each candidate entity tag according to the target document;

inputting the attribute features and the label features into a label recognition model trained in advance, and acquiring a first confidence corresponding to each candidate entity label;

determining a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence.

2. The method of claim 1, prior to said obtaining an entity tag library corresponding to a document type of a target document, further comprising:

acquiring a document search log, a professional document, a knowledge graph and an associated vertical document corresponding to the document type;

extracting search words in the document search log, carrying out word segmentation processing on the search words to obtain search participles, and obtaining a first reference entity tag corresponding to the document type according to the search participles;

extracting a plurality of keywords in the professional document, and calculating the importance value of each keyword in the plurality of keywords in the professional document according to a preset algorithm;

determining a preset number of target keywords in the plurality of keywords as second reference entity labels according to the importance values;

identifying proper nouns in the knowledge-graph and the associated vertical documents, and determining a third reference entity tag according to the proper nouns;

determining the entity tag library from the first, second, and third reference entity tags.

3. The method of claim 2, wherein said determining the library of entity tags from the first, second, and third reference entity tags comprises:

inputting each reference entity label in the first reference entity label, the second reference entity label and the third reference entity label into a pre-trained neural network model, and acquiring a second confidence corresponding to each reference entity label;

and determining the entity tag library according to the reference entity tag with the second confidence degree larger than a preset confidence value.

4. The method of claim 2, wherein the obtaining a first reference entity tag corresponding to the document type according to the search segmentation comprises:

constructing a first inverted index table of the search terms according to the search segmentation of the document search log;

determining a target node of which the node priority in the first inverted index table is greater than a preset level;

determining a first node path of the target node in the first inverted index table, and determining the first reference entity label according to the search segmentation covered by the first node path.

5. The method of claim 1, wherein matching the target document with the entity tag library to obtain a plurality of candidate entity tags successfully matched comprises:

word segmentation processing is carried out on the document title and the document content of the target document, and a plurality of document word segments are obtained;

the entity label word segmentation processing is carried out to obtain label word segmentation, and a second reverse index table corresponding to the entity label library is constructed according to the label word segmentation;

and if the second node path is included, determining that the entity label corresponding to the second node path is a candidate entity label.

6. The method of claim 5, further comprising, before the determining that the entity label corresponding to the second node path is the candidate entity label:

counting the occurrence times of entity labels corresponding to the second node paths in the target document;

and determining that the occurrence times are greater than a preset time threshold.

7. The method of claim 1, wherein matching the target document with the entity tag library to obtain a plurality of candidate entity tags successfully matched comprises:

calculating a title semantic vector of a document title of the target document;

calculating a tag semantic vector for each of the entity tags;

and calculating the semantic similarity between the title semantic vector and the label semantic vector of each entity label, and determining the entity labels with the semantic similarity larger than a preset similarity threshold value as candidate entity labels.

8. The method of claim 1, prior to said inputting said attribute features and said tag features into a pre-trained tag recognition model, comprising:

obtaining a candidate entity tag library, wherein the candidate entity tag library comprises a plurality of standard entity tags corresponding to a plurality of document types;

performing word segmentation processing on the plurality of standard entity labels to obtain standard label word segmentation;

constructing a third inverted index table according to the standard label word segmentation corresponding to the plurality of standard entity labels;

obtaining a plurality of first training documents corresponding to the document types, and performing word segmentation processing on each first training document in the plurality of first training documents to obtain training document word segmentation;

matching the training document word segmentation of each first training document with the nodes in the third inverted index table, and judging whether a third node path corresponding to the training document word segmentation is included;

if the third node path is included, determining that a standard entity label corresponding to the third node path is an entity label corresponding to a first training document;

acquiring a second training document corresponding to the document type and a pre-labeled entity label of the second training document;

and training and generating the label recognition model according to the first training document and the entity label corresponding to the first training document, and the second training document and the entity label corresponding to the second training document.

9. An apparatus for determining an entity tag, comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an entity tag library corresponding to a document type of a target document, and the entity tag library comprises a plurality of entity tags corresponding to the document type;

the second acquisition module is used for matching the target document with the entity tag library and acquiring a plurality of successfully matched candidate entity tags;

a third obtaining module, configured to obtain attribute features of the target document, and obtain, according to the target document, a tag feature corresponding to each candidate entity tag;

a fourth obtaining module, configured to input the attribute features and the tag features into a pre-trained tag identification model, and obtain a first confidence corresponding to each candidate entity tag;

a first determining module, configured to determine a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence.

10. The apparatus of claim 9, further comprising:

the fifth acquisition module is used for acquiring a document search log, a professional document, a knowledge graph and an associated vertical document corresponding to the document type;

a sixth obtaining module, configured to extract a search word in the document search log, perform word segmentation processing on the search word to obtain a search word, and obtain a first reference entity tag corresponding to the document type according to the search word;

the calculation module is used for extracting a plurality of keywords in the professional document and calculating the important value of each keyword in the plurality of keywords in the theme of the professional document according to a preset algorithm;

a second determining module, configured to determine, according to the importance value, a preset number of target keywords in the plurality of keywords as second reference entity tags;

a third determining module, configured to identify proper nouns in the knowledge-graph and the associated vertical documents, and determine a third reference entity tag according to the proper nouns;

a fourth determining module to determine the entity tag library according to the first reference entity tag, the second reference entity tag, and the third reference entity tag.

11. The apparatus of claim 9, wherein the second obtaining module is specifically configured to:

12. The apparatus of claim 9, wherein the second obtaining module is specifically configured to:

calculating a title semantic vector of a document title of the target document;

calculating a tag semantic vector for each of the entity tags;

13. The apparatus of claim 9, further comprising:

a seventh obtaining module, configured to obtain a candidate entity tag library, where the candidate entity tag library includes a plurality of standard entity tags corresponding to a plurality of document types;

the eighth obtaining module is used for performing word segmentation processing on the plurality of standard entity labels to obtain standard label word segmentation;

the building module is used for building a third inverted index table according to the standard label participles corresponding to the plurality of standard entity labels;

a ninth obtaining module, configured to obtain a plurality of first training documents corresponding to the document types, and perform word segmentation processing on each of the plurality of first training documents to obtain a training document word segmentation;

the interpretation module is used for matching the training document word segmentation of each first training document with the nodes in the third inverted index table and judging whether a third node path corresponding to the training document word segmentation is included;

a fifth determining module, configured to determine, when the third node path is included, that a standard entity tag corresponding to the third node path is an entity tag corresponding to a first training document;

a tenth obtaining module, configured to obtain a second training document corresponding to the document type and a pre-labeled entity tag of the second training document;

and the training module is used for training and generating the label recognition model according to the first training document and the entity label corresponding to the second training document.

14. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of entity tag determination of any one of claims 1-8.

15. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method for determining an entity tag of any one of claims 1-8.