CN111967262B

CN111967262B - Determination method and device for entity tag

Info

Publication number: CN111967262B
Application number: CN202010617196.4A
Authority: CN
Inventors: 程鸣权; 杨浩; 刘昊; 刘欢; 陈坤斌; 刘准; 何伯磊; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2024-01-12
Anticipated expiration: 2040-06-30
Also published as: CN111967262A

Abstract

The application discloses a method and a device for determining an entity tag, which relate to the technical field of natural language processing, the technical field of big data processing and the technical field of deep learning, and specifically comprise the following steps: acquiring an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type; matching the target document with the entity tag library to obtain a plurality of candidate entity tags successfully matched; acquiring attribute characteristics of a target document, and acquiring label characteristics corresponding to each candidate entity label according to the target document; inputting the attribute features and the tag features into a pre-trained tag identification model, and acquiring a first confidence coefficient corresponding to each candidate entity tag; and determining the target entity label of the target document from the plurality of candidate entity labels according to the first confidence. Therefore, the determination of the entity tag is realized in a semi-automatic mode, the accuracy and recall rate of the determination of the entity tag are improved, and the labor cost is reduced.

Description

Determination method and device for entity tag

Technical Field

The application relates to the technical field of natural language processing, big data processing and deep learning, in particular to a method and a device for determining an entity tag.

Background

With the development of internet technology, various knowledge management scenarios are also implemented in combination with internet technology, for example, management of enterprise knowledge documents is implemented online, for example, searching of technical documents is implemented online, where application of the relevant documents in any scenario depends on labeling of entity labels of the relevant documents.

In the related art, the method of marking the entity label is that a business expert is required to manually comb a label system, and then the keywords of the document are matched with the label system based on a keyword matching technology, so that the determination of the entity label of the document is performed.

However, the above-mentioned determination method of the entity tag not only makes the labor cost higher, but also the accuracy of the entity tag depends on the comprehensiveness and accuracy of the manual tag combing system, and the accuracy and recall rate of the entity tag are lower.

Disclosure of Invention

The utility model provides a method and a device for determining the entity tag, which can realize the determination of the entity tag in a semi-automatic mode, improve the accuracy and recall rate of the determination of the entity tag and reduce the labor cost.

According to an aspect of the present application, there is provided a method for determining an entity tag, including: acquiring an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type; matching the target document with the entity tag library to obtain a plurality of successfully matched candidate entity tags; acquiring attribute characteristics of the target document, and acquiring label characteristics corresponding to each candidate entity label according to the target document; inputting the attribute features and the tag features into a pre-trained tag identification model, and acquiring a first confidence coefficient corresponding to each candidate entity tag; and determining a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence coefficient.

According to another aspect of the present application, there is provided an apparatus for determining an entity tag, including: the first acquisition module is used for acquiring an entity tag library corresponding to the document type of the target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type; the second acquisition module is used for matching the target document with the entity tag library to acquire a plurality of successfully matched candidate entity tags; the third acquisition module is used for acquiring attribute characteristics of the target document and acquiring label characteristics corresponding to each candidate entity label according to the target document; a fourth obtaining module, configured to input the attribute feature and the tag feature into a pre-trained tag identification model, and obtain a first confidence coefficient corresponding to each candidate entity tag; and the first determining module is used for determining the target entity label of the target document from the plurality of candidate entity labels according to the first confidence coefficient.

According to still another aspect of the present application, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of determining an entity tag as described above.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of determining an entity tag as described above.

The technical scheme disclosed by the application at least comprises the following additional technical characteristics:

obtaining an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type, further, matching the target document with the entity tag library, obtaining a plurality of candidate entity tags successfully matched, obtaining attribute features of the target document, obtaining tag features corresponding to each candidate entity tag according to the target document, finally, inputting the attribute features and the tag features into a pre-trained tag identification model, obtaining a first confidence coefficient corresponding to each candidate entity tag, and determining the target entity tag of the target document from the plurality of candidate entity tags according to the first confidence coefficient. Therefore, the determination of the entity tag is realized in a semi-automatic mode, the accuracy and recall rate of the determination of the entity tag are improved, and the labor cost is reduced.

It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flow chart of a method for determining an entity tag according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a recognition scenario of a tag recognition model according to a second embodiment of the present application;

FIG. 3 is a flow chart of a method for determining an entity tag according to a third embodiment of the present application;

FIG. 4 is a schematic diagram of a second inverted index table according to a fourth embodiment of the present application;

FIG. 5 is a schematic diagram of an alignment scenario of a second inverted index table according to a fifth embodiment of the present application;

FIG. 6 is a flowchart of a method for determining an entity tag according to a sixth embodiment of the present application;

FIG. 7 is a flowchart of a method for determining an entity tag according to a seventh embodiment of the present application;

fig. 8 is a schematic structural view of a first inverted index table according to an eighth embodiment of the present application;

FIG. 9 is a flowchart of a method for determining an entity tag according to a ninth embodiment of the present application;

FIG. 10 is a schematic illustration of an ARC-I model training process according to a tenth embodiment of the present application;

FIG. 11 is a schematic flow chart illustrating the execution of a determination system of an entity tag according to an eleventh embodiment of the present application;

fig. 12 is a schematic structural view of a determining device for an entity tag according to a twelfth embodiment of the present application;

fig. 13 is a schematic structural view of a determination device of an entity tag according to a thirteenth embodiment of the present application;

fig. 14 is a schematic structural view of a determination device of an entity tag according to a fourteenth embodiment of the present application; and

fig. 15 is a block diagram of an electronic device for implementing a method of determination of an entity tag of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to solve the technical problems of high labor cost and low accuracy of determining the entity tag in the background technology, the application provides a semi-automatic complete solution for calculating the entity tag in the field of the document, and the method is aimed at the knowledge field of the current document, and based on deep learning and other technologies, a tag system covering the current knowledge field is constructed by directional mining.

Specifically, fig. 1 is a flowchart of a method for determining an entity tag according to one embodiment of the present application, as shown in fig. 1, the method includes:

step 101, obtaining an entity tag library corresponding to the document type of the target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type.

The entity tag may be understood as an entity related to the field of enterprise knowledge, and in this embodiment, the entity tag library corresponds to a document type, for example, when the document type is an insurance type, the corresponding entity tag may be a core, a core claim, or the like, for example, when the document type is a sports type, the corresponding entity tag may be sports equipment, sports action, or the like.

However, since the directions in which the entity tags may be focused are different in different document types, even the meaning that the same entity tag may represent is different, for example, for an entity tag whose document type is insurance, the entity tag may be focused on a core, a core claim, or the like, for example, when the document type is the internet, the corresponding entity tag is focused on a machine learning, a natural language processing, or the like.

Therefore, in order to enable accurate recall, in the present application, an entity tag library corresponding to a document type of a target document is acquired, wherein the entity tag library includes a plurality of entity tags corresponding to the document type.

In this embodiment, a preset correspondence may be queried according to a document type, and an entity tag library corresponding to the document type may be obtained, where the preset correspondence may be stored in a correspondence between the document type and the entity tag library, for example, in the correspondence, the entity tag library corresponding to the document type a is stored as 1.

Step 102, matching the target document with the entity tag library to obtain a plurality of successfully matched candidate entity tags.

In this embodiment, matching the target document with the entity tag library, such as keyword matching, semantic matching, and the like, to obtain multiple candidate entity tags that are successfully matched.

And step 103, acquiring attribute characteristics of the target document, and acquiring label characteristics corresponding to each candidate entity label according to the target document.

It will be appreciated that the candidate entity tags are matched based on a relatively coarse granularity, and thus, to further ensure recall accuracy of the entity tags, the candidate entity tags are further screened.

In this embodiment, an attribute feature of the target document is obtained, and a tag feature corresponding to each candidate entity tag is obtained according to the target document, where the attribute feature may include at least one of a title feature and a content feature of the target document, where the title feature may include at least one of a length of a title, a noun attribute included in the title, a word included in the title, a part of speech of a word included in the title, and the like, the content feature may include at least one of a length of content, a word included in the content, a part of speech corresponding to the word included in the content, and the like, and the tag feature may include at least one of a frequency of occurrence, a position of occurrence, and the like of the corresponding candidate entity tag in the title of the target document, and at least one of a length of the candidate entity tag, a word included in the entity tag, a part of speech corresponding to the word included in the entity tag, and the like. In summary, the attribute features and the tag features in this embodiment are multidimensional features, and may comprehensively reflect features of the candidate entity tag and the target document.

And 104, inputting the attribute features and the tag features into a pre-trained tag identification model, and acquiring a first confidence coefficient corresponding to each candidate entity tag.

In this embodiment, a tag recognition model is trained in advance, where the tag recognition model may determine a first confidence coefficient corresponding to each candidate entity tag according to the input attribute features and the tag features, and the first confidence coefficient may represent a probability value as shown in fig. 2, a possible level, or the like, where the higher the first confidence coefficient is, the more consistent the candidate entity tag and the target document is.

Step 105, determining a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence level.

In this embodiment, the plurality of candidate entity tags may be ranked in order from high to low according to the first confidence, and the candidate entity tags located in the first few bits may be used as the target entity tags.

The matching degree threshold value may be preset, and the candidate entity tag with the first confidence degree greater than the matching degree threshold value may be used as the target entity tag.

Therefore, on one hand, the target entity tags are screened based on multidimensional characteristics, so that the matching of the entity tags and the target document is ensured, and on the other hand, the entity tags are searched in the entity tag library corresponding to the document type of the target document, so that the influence of different noise of the document type is further avoided, and the matching of the entity tags and the target document is improved.

In summary, according to the method for determining the entity tag of the embodiment of the present application, an entity tag library corresponding to a document type of a target document is obtained, where the entity tag library includes a plurality of entity tags corresponding to the document type, and further, the target document is matched with the entity tag library to obtain a plurality of candidate entity tags successfully matched, attribute features of the target document are obtained, tag features corresponding to each candidate entity tag are obtained according to the target document, finally, the attribute features and the tag features are input into a pre-trained tag identification model to obtain a first confidence coefficient corresponding to each candidate entity tag, and the target entity tag of the target document is determined from the plurality of candidate entity tags according to the first confidence coefficient. Therefore, the determination of the entity tag is realized in a semi-automatic mode, the accuracy and recall rate of the determination of the entity tag are improved, and the labor cost is reduced.

It should be noted that, under different application scenarios, the ways of matching the target document with the entity tag library and obtaining multiple candidate entity tags successfully matched are different, and the examples are as follows:

example one:

in this example, as shown in fig. 3, step 102 includes:

Step 201, word segmentation processing is performed on the document title and the document content of the target document, and a plurality of document word segments are obtained.

In this embodiment, the document title and the document content of the target document are subjected to word segmentation processing according to the word segmentation attribute, so as to obtain a plurality of document word segments.

Step 202, performing word segmentation processing on the entity tag to obtain tag word segmentation, and constructing a second inverted index table corresponding to the entity tag library according to the tag word segmentation.

In this embodiment, according to the word segmentation attribute included in the entity tag, the entity tag word segmentation process is performed to obtain a tag word segmentation, and a second inverted index table corresponding to the entity tag library is constructed according to the tag word segmentation.

When the second inverted index table is constructed, the construction is started according to the sequence of the label word segmentation from front to back in the entity label, the previous label word segmentation is used as a key value, and the label word segmentation after the key value label word segmentation is used as a value.

For example, as shown in fig. 4, for the entity tag, the word segmentation obtained after the word segmentation process is "code" and "study guide", and for the entity tag, the word segmentation obtained after the word segmentation process is "code" and "study", so that in the constructed second inverted index table, the Key Value is "code", and the Value is "study guide" and "study", respectively.

Step 203, matching each document word in the plurality of document words with a node in the second inverted index table, and judging whether a second node path corresponding to each document word is included.

In this embodiment, each document word in the plurality of document words is matched with a node in the second inverted index table, whether a second node path corresponding to each document word is included is determined, that is, after matching is successful, a subsequent document word is matched with a Value node corresponding to the Key node one by one according to the sequence in which each document word appears, if matching occurs between a plurality of continuous document words and a plurality of continuous nodes in the second inverted index table, a successful second node path is considered to be obtained, wherein the number of nodes of the second node path can be calibrated according to scene requirements.

For example, as shown in fig. 5, when the number of nodes in the second node path is 3 or more, and the document segmentation A, B, C, D is continuous and the second inverted index table includes the key node a, B, C, D is matched with the value nodes of the corresponding level after the key node in sequence, if the paths "a-B-C" obtained by matching are obtained, and the number of nodes in the second path is 3, the "a-B-C" is considered as the second node path (represented by gray filling in the figure).

In step 204, if the second node path is included, it is determined that the entity tag corresponding to the second node path is a candidate entity tag.

In this embodiment, if the second node path is included, determining the entity tag corresponding to the second node path as a candidate entity tag, where the node path segment corresponding to each entity tag of the path where each second node path is located may be preset, and the node path segment where the second node path is located in the corresponding path is determined and determined as the corresponding candidate entity tag.

In the actual implementation process, the fact that although the second node path exists, the number of times that the second node path corresponding divided document word appears in the document is small is considered, and the noise of the determination of the candidate entity tag is possibly large, so in one embodiment of the application, after the second node path is determined, the number of times that the entity tag corresponding to the second node path appears in the target document is counted, and the entity tag corresponding to the second node path is used as the candidate entity tag after the number of times that the occurrence number is larger than a preset number threshold is determined.

Example two:

in this example, as shown in fig. 6, step 102 includes:

in step 301, a header semantic vector of a document header of a target document is calculated.

The heading semantic vector may be calculated according to semantic analysis algorithms or the like in the prior art.

In step 302, a tag semantic vector is computed for each entity tag.

The tag semantic vector may be calculated according to semantic analysis algorithms or the like in the prior art.

Step 303, calculating the semantic similarity between the header semantic vector and the label semantic vector of each entity label, and determining the entity label with the semantic similarity greater than the preset similarity threshold as the candidate entity label.

In this embodiment, the semantic similarity between the header semantic vector and the tag semantic vector of each entity tag is calculated, and the entity tag with the semantic similarity greater than the preset similarity threshold is determined to be the candidate entity tag.

Of course, in this embodiment, candidate entity tags according to semantic recall may also be screened in combination with a literal matching approach. I.e. recall candidate entity tags in combination with literal and semantic, wherein the way of literal recall may be referred to as mentioned in example one above, or may be a direct literal match way, etc.

In summary, according to the method for determining the entity tag, the candidate entity tag is determined in the entity tag library corresponding to the document type of the target document, so that the accuracy of entity tag recall is improved.

Based on the above embodiment, the application provides a part of the online pair marked on the entity tag, and the offline part and the online part in the application are matched, so that the formed entity tag determining system can improve the automation degree of entity tag determination, reduce the manual participation, and greatly improve the accuracy rate and recall rate of entity tag calculation.

The entity tag library is determined offline.

In this embodiment, as shown in fig. 7, before step 101, the present application further includes:

step 401, obtaining a document search log, a professional document, a knowledge graph and an associated vertical document corresponding to the document type.

In this embodiment, a corresponding entity tag library is determined among various knowledge documents corresponding to the document type, where the various knowledge documents in this embodiment include a document search log, a professional document, a knowledge graph, and an associated vertical document.

Step 402, extracting search words in a document search log, performing word segmentation processing on the search words to obtain search word segments, and obtaining a first reference entity tag corresponding to the document type according to the search word segments.

For a document search log, search word segmentation processing is carried out to obtain search word segmentation, and a first reference entity tag corresponding to the document type is obtained according to the search word segmentation.

In some possible embodiments, the search term with the repetition rate higher than a certain value may be directly used as the first reference entity tag.

In other possible embodiments, the first inverted index table of the search terms is constructed by taking the search terms of the document search logs as nodes according to the order of the node priority from top to bottom and taking the search terms of the document search logs as the nodes according to the order of the node priority from top to bottom, that is, the first inverted index table is constructed by taking the first search term of each search term as a key of an inverted index, and taking the search terms of all the document search logs beginning with the search terms as values of the inverted index, so as to generate the first inverted index table.

After the first inverted index table is generated, one key may correspond to a multi-layer value, in order to ensure the concentration of the entity label, a target node with the node priority higher than a preset level in the first inverted index table is determined, a first node path of the target node in the first inverted index table is determined, and a first reference entity label is determined according to search segmentation covered by the first node path. Thus, in this embodiment, the first reference entity tag with a relatively large granularity is reserved.

For example, as shown in fig. 8, when the value corresponding to the key "banana" in the first inverted index table includes the "category" of the first level, the "cultivation method" of the second level, and the "distinguishing method" of the second level, and the preset level is the first level, the first node path covered by the node from the key to the first level is used as the first reference entity tag (the first node path is filled with gray in the figure), and in this embodiment, the "banana category" is used as the first reference entity tag.

Step 403, extracting a plurality of keywords in the professional document, and calculating the importance value of each keyword in the keywords in the theme of the professional document according to a preset algorithm.

Step 404, determining a preset number of target keywords as second reference entity labels from the plurality of keywords according to the importance values.

In this embodiment, the preset algorithm may be tf-idf algorithm, so that the frequency of occurrence of each keyword in the professional document may be calculated by extracting the keywords from tf-idf, an important value tf-idf is determined, and a preset number of target keywords with the largest value tf-idf are selected as the mined second reference entity label.

Step 405, identifying proper nouns in the knowledge graph and the associated vertical documents, and determining a third reference entity tag according to the proper nouns.

In this embodiment, proper nouns in the knowledge graph and the related vertical documents are identified according to algorithms such as deep learning, and a third reference entity tag is determined according to the proper nouns.

Step 406, determining an entity tag library according to the first reference entity tag, the second reference entity tag and the third reference entity tag.

In this embodiment, the first reference entity tag, the second reference entity tag, and the third reference entity tag that are mined in multiple dimensions determine an entity tag library.

Of course, in order to further ensure the purity of the entity tag library, in an embodiment of the present application, a neural network model may be obtained by pre-training, where the neural network model may be a text-cnn model, and each of the first reference entity tag, the second reference entity tag, and the third reference entity tag is input into the pre-trained neural network model, a second confidence coefficient corresponding to each reference entity tag is obtained, and the entity tag library is determined according to the reference entity tag with the second confidence coefficient being greater than a preset confidence value.

When the neural network model in the embodiment is trained in advance, a large number of positive samples and a large number of negative samples can be sampled for training, and model parameters are adjusted according to training results, for example, when the identification accuracy of the positive and negative samples is greater than a certain ratio, training is stopped, and otherwise, the neural network model is trained in a circulating mode.

In addition, the tag recognition model in this embodiment is also obtained through offline training.

In one embodiment of the present application, as shown in fig. 9, the step of training the tag recognition model includes:

step 501, a candidate entity tag library is obtained, wherein the candidate entity tag library contains a plurality of standard entity tags corresponding to a plurality of document types.

Samples contained in the candidate entity tag library in the embodiment can be obtained by adopting an unsupervised rule mining, and also can comprise manually marked samples, wherein after the samples are obtained, incomplete samples are cleaned, if the document type does not have a corresponding standard entity tag, the corresponding samples are cleaned, and the process can be executed by a machine, so that the labor cost is reduced.

Step 502, performing word segmentation processing on a plurality of standard entity tags to obtain standard tag word segmentation.

In this embodiment, the standard tag word segmentation may be obtained by performing word segmentation processing on a plurality of standard entity tags in a manner of attribute analysis or the like.

And step 503, constructing a third inverted index table according to standard tag word segmentation corresponding to the standard entity tags.

The third inverted index table is constructed by using the first standard tag word of each standard entity tag as the key of the third inverted index table and using the standard tag words of all standard entity tags beginning with the standard tag word as the value of the inverted index.

Step 504, a plurality of first training documents corresponding to the document type are obtained, and a training document word segmentation process is performed on each first training document word segmentation in the plurality of first training documents to obtain training document word segmentation.

In this embodiment, a plurality of first training documents corresponding to a document type are acquired, and a training document word segmentation is acquired for each of the plurality of first training documents.

Step 505, matching the training document word segment of each first training document with the node in the third inverted index table, and judging whether a third node path corresponding to the training document word segment is included.

In this embodiment, the training document word segment of each first training document is matched with a node in the third inverted index table, and whether a third node path corresponding to the training document word segment is included is determined, where the first training document may be understood as a training document mined by a machine.

If the third node path is included, step 506 determines that the standard entity tag corresponding to the third node path is the entity tag corresponding to the first training document.

In this embodiment, if the third node path is included, it is determined that the standard entity tag corresponding to the third node path is the entity tag corresponding to the first training document, that is, the entity tag is marked by adopting a machine training manner for the mined training document.

Step 507, obtaining a second training document corresponding to the document type and entity labels of the second training document marked in advance.

The second training document can understand the manually marked training document, wherein the second training document is marked with an entity label in advance.

Step 508, training and generating a label recognition model according to the first training document and the corresponding entity label thereof, and the second training document and the corresponding entity label thereof.

In this embodiment, as shown in fig. 10, when the tag recognition model is an improved ARC-I model, N-dimensional features of each training document may be extracted, and the improved ARC-I model is input to perform training, where the N-dimensional features include a document title, content, and a word segmentation feature obtained after the entity tag is segmented respectively for the training document, a number of occurrences feature of the entity tag of the training document in the document title, a number of occurrences feature of the entity tag in the document content, a length feature of the entity tag, and the like (corresponding to features 1 to feature N in the figure). And obtaining a label identification model which can be used for entity label recall according to the training result.

In order to enable those skilled in the art to more clearly understand the entity tag determination system of the embodiments of the present application, the following is illustrated in connection with a specific application scenario:

as shown in fig. 11, the entity tag determining system in the embodiment of the present application includes an online portion and an offline portion, where the offline portion constructs an entity tag candidate set according to knowledge documents corresponding to a document type (including an internal knowledge document such as a content search log, an internal knowledge document, an internal trial graph, and an external knowledge document such as a vertical site data and a vertical word stock data), and further screens tags in the entity tag candidate set based on a text-cnn model to determine an entity tag library corresponding to the document type.

In the offline part, a training data set can be constructed based on training knowledge documents mined by an unsupervised machine and manually marked training knowledge documents, the training data set comprises corresponding training knowledge documents and corresponding entity labels, and the training knowledge documents and the entity labels in the training data set are subjected to multidimensional feature extraction to train an improved ARC-I model, so that after the improved ARC-I model is trained, the corresponding entity labels can be obtained according to multidimensional features of the knowledge documents and the entity labels.

And the online part is used for determining candidate entity tags matched with the target document after recalling literal and semantic according to the entity tag library constructed by the offline part, extracting the multi-dimensional characteristics of the candidate entity tags and the target document, inputting the multi-dimensional characteristics into the improved ARC-I model, and determining the target entity tag of the target document from the candidate entity tags.

In summary, the method for determining the entity tag of the implementation class of the application greatly reduces the cost of manually constructing and maintaining the entity tag system, improves the ARC-I model on the one hand, and can introduce document title information and related statistical characteristics after the improvement of the ARC-I model, thereby greatly improving the accuracy and recall rate of the calculation of the entity tag.

In order to achieve the above embodiment, the present application further provides an entity tag determining apparatus, and fig. 12 is a schematic structural diagram of an entity tag determining apparatus according to an embodiment of the present application, as shown in fig. 12, where the entity tag determining apparatus includes: a first acquisition module 10, a second acquisition module 20, a third acquisition module 30, a fourth acquisition module 40, and a first determination module 50, wherein,

a first obtaining module 10, configured to obtain an entity tag library corresponding to a document type of a target document, where the entity tag library includes a plurality of entity tags corresponding to the document type;

the second obtaining module 20 is configured to match the target document with the entity tag library, and obtain a plurality of candidate entity tags that are successfully matched;

a third obtaining module 30, configured to obtain attribute features of the target document, and obtain tag features corresponding to each candidate entity tag according to the target document;

a fourth obtaining module 40, configured to input the attribute feature and the tag feature into a pre-trained tag identification model, and obtain a first confidence coefficient corresponding to each candidate entity tag;

the first determining module 50 is configured to determine a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence.

in one possible implementation form of the present application, the second obtaining module 20 is specifically configured to:

Word segmentation processing is carried out on the document title and the document content of the target document to obtain a plurality of document word segments;

performing word segmentation processing on the entity tag to obtain tag word segmentation, and constructing a second inverted index table corresponding to the entity tag library according to the tag word segmentation;

matching each document word in the plurality of document words with a node in a second inverted index table, and judging whether a second node path corresponding to each document word is contained or not;

and if the second node path is included, determining the entity label corresponding to the second node path as a candidate entity label.

calculating a title semantic vector of a document title of the target document;

calculating a label semantic vector of each entity label;

calculating semantic similarity of the title semantic vector and the label semantic vector of each entity label, and determining the entity label with the semantic similarity larger than a preset similarity threshold as a candidate entity label.

In summary, the entity tag determining device in the embodiment of the application determines the candidate entity tag in the entity tag library corresponding to the document type of the target document, thereby improving the accuracy of entity tag recall.

In one embodiment of the present application, as shown in fig. 13, the apparatus further includes a fifth acquisition module 60, a sixth acquisition module 70, a calculation module 80, a second determination module 90, a third determination module 100, and a fourth determination module 110, on the basis of the one shown in fig. 12, wherein,

a fifth obtaining module 60, configured to obtain a document search log, a professional document, a knowledge graph, and an associated vertical document corresponding to the document type;

a sixth obtaining module 70, configured to extract a search word in the document search log, obtain a search word by performing word segmentation processing on the search word, and obtain a first reference entity tag corresponding to the document type according to the search word;

a calculating module 80, configured to extract a plurality of keywords in the professional document, and calculate an importance value of each keyword in the plurality of keywords in the professional document according to a preset algorithm;

A second determining module 90, configured to determine, according to the importance value, a preset number of target keywords as second reference entity tags from the plurality of keywords;

a third determining module 100, configured to identify proper nouns in the knowledge graph and the related vertical documents, and determine a third reference entity tag according to the proper nouns;

the fourth determining module 110 is configured to determine an entity tag library according to the first reference entity tag, the second reference entity tag, and the third reference entity tag.

In one embodiment of the present application, as shown in fig. 14, the apparatus further includes, on the basis of that shown in fig. 12: a seventh acquisition module 120, an eighth acquisition module 130, a construction module 140, a ninth acquisition module 150, an interpretation module 160, a fifth determination module 170, a tenth acquisition module 180, and a training module 190, wherein,

a seventh obtaining module 120, configured to obtain a candidate entity tag library, where the candidate entity tag library includes a plurality of standard entity tags corresponding to a plurality of document types;

an eighth obtaining module 130, configured to perform word segmentation processing on a plurality of standard entity tags to obtain standard tag word segmentation;

a construction module 140, configured to construct a third inverted index table according to standard tag word segmentation corresponding to the plurality of standard entity tags;

A ninth obtaining module 150, configured to obtain a plurality of first training documents corresponding to a document type, and obtain training document word segmentation for each first training document word segmentation process in the plurality of first training documents;

the interpretation module 160 is configured to match the training document word segment of each first training document with a node in the third inverted index table, and determine whether a third node path corresponding to the training document word segment is included;

a fifth determining module 170, configured to determine, when the third node path is included, that a standard entity tag corresponding to the third node path is an entity tag corresponding to the first training document;

a tenth obtaining module 180, configured to obtain a second training document corresponding to the document type, and an entity tag of the second training document labeled in advance;

the training module 190 is configured to train to generate a tag recognition model according to the first training document and the corresponding entity tag thereof, and the second training document and the corresponding entity tag thereof.

In summary, the device for determining the entity tag of the implementation class of the application greatly reduces the cost of manually constructing and maintaining the entity tag system, improves the ARC-I model on the one hand, and can introduce document title information and related statistical characteristics after the improvement, thereby greatly improving the accuracy rate and recall rate of the calculation of the entity tag.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 15, a block diagram of an electronic device according to a method of entity tag determination according to an embodiment of the present application is shown. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 15, the electronic device includes: one or more processors 1501, memory 1502, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). In fig. 15, a processor 1501 is taken as an example.

Memory 1502 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of entity tag determination provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of entity tag determination provided herein.

The memory 1502 serves as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the first acquisition module 10, the second acquisition module 20, the third acquisition module 30, the fourth acquisition module 40, and the first determination module 50 shown in fig. 12) corresponding to the method of determining an entity tag in the embodiments of the present application. The processor 1501 executes various functional applications of the server and data processing, i.e., implements the method of entity tag determination in the above-described method embodiments by running non-transitory software programs, instructions, and modules stored in the memory 1502.

Memory 1502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functionality; the storage data area may store data created from the use of the electronic device determined by the entity tag, and the like. In addition, the memory 1502 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1502 may optionally include memory located remotely from processor 1501, which may be connected to the physical tag-specific electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for determining an entity tag may further include: an input device 1503 and an output device 1504. The processor 1501, memory 1502, input device 1503, and output device 1504 may be connected by a bus or otherwise, for example in fig. 15.

The input device 1503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device as determined by the physical label, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, or the like. The output device 1504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of determining an entity tag, comprising:

acquiring an entity tag library corresponding to a document type of a target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type;

matching the target document with the entity tag library to obtain a plurality of successfully matched candidate entity tags;

acquiring attribute characteristics of the target document, and acquiring label characteristics corresponding to each candidate entity label according to the target document, wherein the attribute characteristics comprise title characteristics and content characteristics, and the label characteristics comprise the frequency and the position of occurrence of the corresponding candidate entity label in the title of the target document;

obtaining a candidate entity tag library, wherein the candidate entity tag library comprises a plurality of standard entity tags corresponding to a plurality of document types;

Performing word segmentation processing on the plurality of standard entity tags to obtain standard tag word segmentation;

constructing a third inverted index table according to standard tag word segmentation corresponding to the standard entity tags;

acquiring a plurality of first training documents corresponding to the document type, and acquiring training document word segmentation for each first training document word segmentation process in the plurality of first training documents;

matching the training document word segmentation of each first training document with the nodes in the third inverted index table, and judging whether a third node path corresponding to the training document word segmentation is included or not;

if the third node path is included, determining that the standard entity label corresponding to the third node path is the entity label corresponding to the first training document;

acquiring a second training document corresponding to the document type and an entity tag of the second training document marked in advance;

training according to the first training document and the corresponding entity label thereof and the second training document and the corresponding entity label thereof to generate a label identification model;

inputting the attribute features and the tag features into a pre-trained tag identification model, and acquiring a first confidence coefficient corresponding to each candidate entity tag;

And determining a target entity tag of the target document from the plurality of candidate entity tags according to the first confidence coefficient.

2. The method of claim 1, further comprising, prior to said obtaining an entity tag library corresponding to a document type of a target document:

acquiring a document search log, a professional document, a knowledge graph and an associated vertical document corresponding to the document type;

extracting search words in the document search log, performing word segmentation processing on the search words to obtain search word segmentation, and obtaining a first reference entity tag corresponding to the document type according to the search word segmentation;

extracting a plurality of keywords in the professional document, and calculating the importance value of each keyword in the plurality of keywords in the professional document according to a preset algorithm;

determining a preset number of target keywords as second reference entity tags in the plurality of keywords according to the importance value;

identifying proper nouns in the knowledge graph and the associated vertical documents, and determining a third reference entity label according to the proper nouns;

and determining the entity tag library according to the first reference entity tag, the second reference entity tag and the third reference entity tag.

3. The method of claim 2, wherein the determining the entity tag library from the first reference entity tag, the second reference entity tag, and the third reference entity tag comprises:

inputting each of the first reference entity tag, the second reference entity tag and the third reference entity tag into a pre-trained neural network model, and obtaining a second confidence coefficient corresponding to each reference entity tag;

and determining the entity tag library according to the reference entity tag with the second confidence coefficient larger than a preset confidence value.

4. The method of claim 2, wherein the obtaining, according to the search term, the first reference entity tag corresponding to the document type includes:

constructing a first inverted index table of the search words according to the search word segmentation of the document search log;

determining a target node with node priority greater than a preset level in the first inverted index table;

and determining a first node path of the target node in the first inverted index table, and determining the first reference entity tag according to search segmentation covered by the first node path.

5. The method of claim 1, wherein the matching the target document with the entity tag library to obtain a plurality of candidate entity tags that match successfully comprises:

6. The method of claim 5, further comprising, prior to said determining that the entity tag corresponding to the second node path is the candidate entity tag:

counting the occurrence times of the entity labels corresponding to the second node paths in the target document;

and determining that the occurrence number is larger than a preset number threshold.

7. The method of claim 1, wherein the matching the target document with the entity tag library to obtain a plurality of candidate entity tags that match successfully comprises:

Calculating a title semantic vector of a document title of the target document;

calculating a label semantic vector of each entity label;

calculating the semantic similarity of the title semantic vector and the label semantic vector of each entity label, and determining the entity label with the semantic similarity larger than a preset similarity threshold as a candidate entity label.

8. An entity tag determination apparatus, comprising:

the first acquisition module is used for acquiring an entity tag library corresponding to the document type of the target document, wherein the entity tag library comprises a plurality of entity tags corresponding to the document type;

the second acquisition module is used for matching the target document with the entity tag library to acquire a plurality of successfully matched candidate entity tags;

the third acquisition module is used for acquiring attribute characteristics of the target document and acquiring label characteristics corresponding to each candidate entity label according to the target document, wherein the attribute characteristics comprise title characteristics and content characteristics of the target document, and the label characteristics comprise frequency and position of occurrence of the corresponding candidate entity label in the title of the target document;

a fourth obtaining module, configured to input the attribute feature and the tag feature into a pre-trained tag identification model, and obtain a first confidence coefficient corresponding to each candidate entity tag;

The first determining module is configured to determine, according to the first confidence coefficient, a target entity tag of the target document from the plurality of candidate entity tags, and further includes:

a seventh obtaining module, configured to obtain a candidate entity tag library, where the candidate entity tag library includes a plurality of standard entity tags corresponding to a plurality of document types;

an eighth obtaining module, configured to perform word segmentation processing on the plurality of standard entity tags, to obtain standard tag word segmentation;

the construction module is used for constructing a third inverted index table according to standard tag word segmentation corresponding to the standard entity tags;

a ninth obtaining module, configured to obtain a plurality of first training documents corresponding to the document type, and obtain training document word segmentation for each first training document word segmentation process in the plurality of first training documents;

the judging module is used for matching the training document word segmentation of each first training document with the nodes in the third inverted index table and judging whether a third node path corresponding to the training document word segmentation is included or not;

a fifth determining module, configured to determine, when the third node path is included, that a standard entity tag corresponding to the third node path is an entity tag corresponding to the first training document;

A tenth acquisition module, configured to acquire a second training document corresponding to the document type, and an entity tag of the second training document that is labeled in advance;

and the training module is used for training and generating the tag identification model according to the first training document and the corresponding entity tag thereof and the second training document and the corresponding entity tag thereof.

9. The apparatus of claim 8, further comprising:

a fifth acquisition module, configured to acquire a document search log, a professional document, a knowledge graph, and an associated vertical document corresponding to the document type;

the sixth acquisition module is used for extracting search words in the document search log, performing word segmentation processing on the search words to acquire search word segmentation, and acquiring a first reference entity tag corresponding to the document type according to the search word segmentation;

the calculating module is used for extracting a plurality of keywords in the professional document and calculating the important value of each keyword in the keywords in the theme of the professional document according to a preset algorithm;

the second determining module is used for determining a preset number of target keywords as second reference entity tags in the plurality of keywords according to the importance value;

The third determining module is used for identifying proper nouns in the knowledge graph and the related vertical documents and determining a third reference entity label according to the proper nouns;

and a fourth determining module, configured to determine the entity tag library according to the first reference entity tag, the second reference entity tag, and the third reference entity tag.

10. The apparatus of claim 8, wherein the second acquisition module is specifically configured to:

11. The apparatus of claim 8, wherein the second acquisition module is specifically configured to:

calculating a title semantic vector of a document title of the target document;

Calculating a label semantic vector of each entity label;

12. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of determining an entity tag of any one of claims 1-7.

13. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of determining an entity tag of any one of claims 1-7.