CN111274404B

CN111274404B - Small sample entity multi-field classification method based on man-machine cooperation

Info

Publication number: CN111274404B
Application number: CN202010088532.0A
Authority: CN
Inventors: 高汕; 李健; 宗畅; 吴海燕
Original assignee: Hangzhou Liangzhi Data Technology Co ltd
Current assignee: Hangzhou Liangzhi Data Technology Co ltd
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2023-07-14
Anticipated expiration: 2040-02-12
Also published as: CN111274404A

Abstract

The invention discloses a method for classifying entities in multiple fields, which comprises the steps of firstly obtaining attribute semantic words of the entities in each field through a crowdsourcing mode, then utilizing the semantic words to match attribute texts of the entities, obtaining a matching result, calculating scores through a calculation formula and comparing the scores with a threshold value to obtain a classification result, further generating a small-batch training sample through the correctness of expert knowledge verification results, automatically adjusting formula coefficients through grid search on the basis of the small sample to improve recall rate and accuracy, and solving the problem that a large number of texts need to be checked in manual entity classification through optimizing continuous automatic processing classification effects. The invention fully utilizes crowd-sourcing, man-machine cooperation and semi-supervised learning modes to solve the problem of entity classification, and can rapidly implement multi-domain classification of the entity under the condition of lacking marking data.

Description

Small sample entity multi-field classification method based on man-machine cooperation

Technical Field

The invention relates to the fields of computer technology, artificial intelligence, natural language processing and label classification, in particular to a human-computer collaborative multi-source text content cognition method in a classification scene of the industrial chain field.

Background

Industry chain analysis plays an important role in the development of regional economy and industry. However, the classification attribution of various entities on the industrial chain is not a good method at present. At present, the attribution of the labeling entity can be judged manually only through the attribute description of the entity.

The domain description of the entity in the manual labeling process has different description words in different attribute texts, for example, the description of the computer vision domain in the patent is called a 'vision algorithm', the description in the product is called a 'face recognition', and the description in the recruitment post is called a 'CV algorithm engineer'. An artificial exhaustion of these words containing domain semantics would create a huge effort.

The automatic classification method of the keywords is specified by adopting simple rules, the classification accuracy and the recall rate cannot be simultaneously considered, recall is often easy to be low if the selected keywords are not fully covered, and the precision is not high if the selected keywords are fully covered. The feature description which can assist in judging the domain classification of the entity can be embodied in text data of each attribute dimension, and the association tightness degree of the keywords and the domain is reasonably quantified through an analysis method of statistical probability.

If the deep learning and machine learning algorithms are purely used for classifying the entity field, three main defects exist, namely, firstly, a large amount of annotation corpus is needed for training, and secondly, the text is needed to be specially preprocessed and quantized into computable data before being used; third, deep-learned black box models can lead to poor interpretability of the final result, and classification basis is difficult to trace back.

Therefore, how to provide a semi-supervised entity domain classification method that uses group wisdom to collect semantics and uses a small amount of corpus training to obtain high classification accuracy is a problem that needs to be solved by technicians.

Disclosure of Invention

In view of the above, the invention provides a statistical probability text matching algorithm based on a man-machine cooperation mode, and the method solves the problem of multi-field classification of entities by combining crowd-sourced collection, expert verification and other modes, has high classification accuracy, and can be used in the fields of various different types of entities and different industries.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a small sample entity multi-domain classification method based on man-machine cooperation comprises the following steps:

s1: acquiring semantic words related to the entity in a crowdsourcing mode, wherein the semantic words returned by crowdsourcing comprise three dimensions of the domain to which the semantic words belong, the attribute to which the semantic words belong and the degree of semantic association with the domain to which the semantic words belong;

s2: initializing various parameters required by entity domain classification, wherein the initialized parameters comprise attribute score A _i Weight coefficient B of semantic association degree _ni And a classification threshold;

s3: acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic words of different fields obtained in the step S1, and calculating scores of each entity in different fields according to matching results;

s4: comparing and judging the score obtained in the step S3 with the classification threshold value to obtain a classification result, and generating training data after the classification result is checked;

s5: determining optimal parameters through grid search based on the training data;

s6: based on the optimal parameters, predicting the unknown entity to be classified to the field.

Based on the technical scheme, the steps can be realized in the following preferable mode:

preferably, the specific method of step S1 is as follows:

s11: in a crowdsourcing solving platform, semantic words in multi-attribute texts of the entities are obtained in a crowdsourcing mode, wherein the crowdsourcing mode adopts the steps of drawing the semantic words from each attribute text of the entities, or directly providing the semantic words and marking the sources; the crowd-sourced return result comprises three dimensions of semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field; a semantic vocabulary belongs to one or more attribute dimensions;

s12: checking the crowd-sourced return result, and writing the checked crowd-sourced return result into a database; all semantic vocabularies belonging to the j-th field in the database form dictionary D _j J=1, 2, …, M is the total number of domain classification categories for an entity.

Preferably, the specific method of step S2 is as follows:

s21: initializing and setting the total score of each field to be 100, and averaging the total score of each field to each attribute dimension, wherein the attribute score A of the ith attribute _i =100/I, I is the number of attributes;

s22: initializing the weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the belonging field is, the higher the weight coefficient is.

S23: initializing a classification threshold to equal the classification threshold to A _i 。

Preferably, in step S2, the association degree between the semantic vocabulary and the domain is divided into high, middle and low levels; when the degree of association is high, the weight coefficient B _1i =1.0; when the degree of association is medium, the weight coefficient B _2i =0.8; when the degree of association is low, the weight coefficient B _3i ＝0.4。

Preferably, the specific method of step S3 is as follows:

for each field in turn, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1 _j The score of each entity in the j-th field is calculated, j=1, 2, …, M, and the calculation method is as follows:

s31: acquiring multi-attribute text of an entity, and then combining each attribute text with a dictionary D _j Each semantic word in the dictionary D is output by matching _j The number of occurrences of each semantic vocabulary in the attribute text; in an attribute text, if the same semantic vocabulary appears for a plurality of times, the appearance frequency is recorded as 1 time;

s32: in the matching result obtained in S31, according to the dictionary D _j Counting the total occurrence times of all semantic vocabularies of each semantic association degree in each attribute text of the entity;

s33: according to the statistical result obtained in S32, the score of the j-th domain of the entity is calculated, and the calculation formula is:

wherein: a is that _i Attribute score representing the ith attribute, B _ni An nth semantic association degree weight representing an ith attribute, C _ni The total occurrence number of all semantic words representing the nth semantic association degree in the ith attribute text of the entity; if it is

The value of (2) is greater than 1, let->

Equal to 1 to ensure that the final all attribute dimension score accumulated value is the same.

Preferably, the specific method of step S4 is as follows:

s41: comparing the score of each entity belonging to each field with the classification threshold, and judging the entity belonging to the field if the score of the entity belonging to the field is higher than the classification threshold;

s42: and verifying the judging result based on expert knowledge, and obtaining the correct entity in each field according to the verified result data to serve as training data.

Preferably, the specific method of step S5 is as follows:

determining optimal parameters through grid search based on the training data obtained in the step S4, wherein the parameters of the grid search comprise attribute scores A _i Weight coefficient B of semantic association degree _ni And a classification threshold; the evaluation index of the optimal parameter is selected from jaccard coefficients, and the calculation formula of the jaccard coefficients is as follows:

wherein x represents a domain label of entity prediction; y represents a real domain label of an entity; x and y represent the number of intersections of the predicted tag and the real tag; x U y represents the number of union sets of the predictive label and the real label; and finally, the grid search selects the parameter corresponding to the maximum value of the average jaccard coefficient of all the samples as the optimal parameter.

Preferably, the semantic vocabulary library is expanded through multiple rounds, the training sample is expanded through expert knowledge verification, and the grid search in the step S5 is repeated after each expansion to determine new optimal parameters.

Preferably, the specific method of step S6 is as follows:

s61: according to the method of the step S3, acquiring multi-attribute texts of the unknown entity to be classified, matching each attribute text of the unknown entity with semantic vocabularies of different fields obtained in the step S1, and calculating scores of the unknown entity in different fields according to matching results;

s62: and comparing the score of the unknown entity belonging to each field with the classification threshold value in the optimal parameter, and judging the entity belonging to the field if the score of the entity belonging to the field is higher than the classification threshold value in the optimal parameter.

Preferably, when acquiring the multi-attribute text of the entity, if a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text.

Compared with the prior art, the invention discloses a method for acquiring a semantic library by utilizing a crowdsourcing mode, quantifying semantic classification, counting the score of an entity in a certain field according to whether the attribute of the entity comprises semantic lexicon in the field, and finally setting a threshold value to judge a classification result. When the method is used for classifying the entities, only a semantic vocabulary library and a database of various parameters are required to be maintained, and the entity attribute text to be classified is transmitted into the system, so that the classification result can be obtained.

The business entities in the database are classified by the classification method, recall rate and accuracy rate are randomly sampled and calculated, and after parameters are adjusted, the recall rate is finally obtained by more than 80%, and the accuracy rate is obtained by more than 90%. The method can be applied to classification of enterprise entities and expert entities in the fields of artificial intelligence and geographic information industry chains, and can obtain good application effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of an embodiment of an entity multi-domain classification algorithm.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention mainly creates the mode of classifying the keywords by directly and hard matching, softens the keywords in a statistical probability mode, improves the accumulation efficiency of semantic words by utilizing a crowdsourcing mode, obtains training data by expert verification of classification results, searches for optimization parameters by machine learning grids, and fully utilizes the advantages of man-machine cooperation to improve the classification effect. The method fully utilizes knowledge precipitation and reduces the dependence on the labeling data.

The following details a specific implementation manner of the small sample entity multi-field classification method based on man-machine cooperation, which comprises the following steps:

s1: and acquiring semantic words related to the entity in a crowdsourcing mode, wherein the semantic words returned by crowdsourcing comprise three dimensions of the domain to which the semantic words belong, the attribute to which the semantic words belong and the degree of semantic association with the domain to which the semantic words belong.

In this implementation, the specific method of step S1 is as follows:

s11: in a crowdsourcing solving platform, semantic words in multi-attribute texts (containing multiple attribute texts) of the entity are obtained in a crowdsourcing mode, wherein the crowdsourcing mode adopts the steps of dividing the semantic words from each attribute text of the entity, or directly providing the semantic words and marking the source; the crowd-sourced return result comprises three dimensions of semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field; a semantic vocabulary belongs to one or more attribute dimensions. For example, with the semantic vocabulary "visual algorithm" in the patent text, the domain to which the semantic vocabulary belongs may be labeled as "computer visual domain" in the crowd-sourced results, the attribute is "patent", the semantic association degree is "high", and these crowd-sourced results may be returned for subsequent verification. The crowdsourcing solving platform can comprise an open source tool and a specific scene tool which is developed independently, and when a crowdsourcing task is issued, a plurality of fixed fields, attribute dimensions and semantic association degrees can be preset, so that the returned crowdsourcing result meets the requirements.

S2: initializing various parameters required by entity domain classification, wherein the initialized parameters comprise attribute score A _i Weight coefficient B of semantic association degree _ni And a classification threshold.

In this implementation, the specific method of step S2 is as follows:

s21: initializing and setting the total score of each field to be 100, and averaging the total score of each field to each attribute dimension, wherein the attribute score A of the ith attribute _i =100/I, I is the number of attributes.

In the present invention, specific attributes are different according to different entities. For example, a business entity may include attributes such as a business profile, business name, patent, soft-copy, recruitment post, etc.; expert entities may include papers, patents, personal profiles, research areas, works, and the like.

S22: initializing the weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the belonging field is, the higher the weight coefficient is. Wherein the semantic vocabulary relates to the fieldThe level of the association degree can be modified according to the situation, and the level of 2-5 is proper. For example, in this implementation, the association degree may be divided into three levels, i.e., high, medium, and low; when the degree of association is high, the weight coefficient B _1i =1.0; when the degree of association is medium, the weight coefficient B _2i =0.8; when the degree of association is low, the weight coefficient B _3i ＝0.4。

S3: and (3) acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic vocabularies of different fields obtained in the step (S1), and calculating the scores of each entity in different fields according to the matching results.

In this implementation, the specific method of step S3 is as follows:

for each field in turn, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1 _j The score (j values are sequentially 1,2, … and M) of each entity in the j-th field is calculated, and the calculation method is as follows:

s31: firstly, acquiring multi-attribute texts of entities, wherein the attribute texts are different according to different entity dimensions. For example, when the entity to be classified is a business entity, the attribute text may include a business profile, a business name, a patent, a soft work, and a recruitment post; and when the entity to be classified is an expert entity, the attribute text can comprise papers, patents, personal profiles, research fields and works. If a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text. The attribute text may be crawled from the web or otherwise obtained.

Each attribute text is then passed to dictionary D _j Each semantic word in the dictionary is matched, and whether the text contains the semantic word to be matched or not is output by using regular matching, namely an output dictionary D _j The number of occurrences of each semantic vocabulary in the attribute text. In a text of an attribute, if the same semantic vocabulary appears multiple times, the number of occurrences is only recorded as 1.

Counting each language under each attribute by matching resultsThe number of words under sense association degree is denoted as C _ni Subscript I denotes the I-th attribute, n denotes the n-th semantic vocabulary association, i=1, 2, …, I; n=1, 2, …, N. N represents the total number of association degree grades of semantic vocabularies and the field, and is generally 2-5. In this implementation, since the association degree is divided into three levels, i.e., high, medium, and low, n=3.

The value of (2) is greater than 1, let->

It should be noted that when calculating score of the entity belonging to the jth field, the number C is counted _ni Dictionary D corresponding to the j-th domain of the entity should be counted _j The total number of occurrences of all semantic vocabularies in (1). That is, the present invention counts the scores of the entities in a domain according to whether the attributes of the entities contain semantic lexicons in the domain.

S4: and (3) comparing the score obtained in the step (S3) with the classification threshold value to obtain a classification result, and checking the classification result to generate training data.

In this implementation, the specific method of step S4 is as follows:

s42: and verifying the judging result based on expert knowledge, removing data which are not verified, and obtaining correct entities in each field according to the verified result data to serve as small sample training data.

S5: based on the training data in S42 described above, the optimal parameters are determined by the grid search.

In this implementation, the specific method of step S5 is as follows:

determining optimal parameters by grid search based on the training data obtained in S4, wherein the parameters of the grid search comprise attribute score A _i Weight coefficient B of semantic association degree _ni And a classification threshold; the evaluation index of the optimal parameter is selected from jaccard coefficients, and the calculation formula of the jaccard coefficients is as follows:

wherein x represents a domain label of entity prediction; y represents a real domain label of an entity; x and y represent the number of intersections of the predicted tag and the real tag; and x U y represents the number of union sets of the predicted tag and the real tag. The general parameter ranges are set as follows: attribute score a _i The total score of all the attributes is 100, and the adjustment interval is 5 each time during grid search; weight coefficient B of semantic association degree _ni The range is 0-1.5, and the adjustment interval is 0.1 each time when grid searching; the range of the classification threshold is 100/N-100 (N is the number of attributes), and the adjustment interval is 5 each time during grid search. And finally, the grid search selects the parameter corresponding to the maximum value of the average jaccard coefficient of all the samples as the optimal parameter.

In practical use, the semantic vocabulary library should be expanded through multiple rounds, and the training samples should be checked and expanded through expert knowledge, and after each time the semantic vocabulary library is expanded or the training samples are expanded, the grid search in step S5 needs to be repeated to determine new optimal parameters.

S6: based on the determined optimal parameters, the unknown entity to be classified is predicted to belong to the field.

In this implementation, the specific method of step S6 is as follows:

s61: according to the method of the step S3, multi-attribute texts of the unknown entity to be classified are obtained, each attribute text of the unknown entity is matched with semantic vocabularies of different fields obtained in the step S1, and scores of the unknown entity in the different fields are calculated according to matching results, specifically, the steps S31-S33 are seen.

S62: and comparing the score of the unknown entity belonging to each field with the classification threshold value in the latest optimal parameter, and judging the entity belonging to the field if the score of the entity belonging to the field is higher than the classification threshold value in the optimal parameter. Thus, a prediction result of the domain of the unknown entity is obtained, and the domain may be one or more domains or may not be corresponding domains.

The following shows a specific implementation thereof by way of example based on the above method. Specific steps in this embodiment are as described above, and detailed description thereof will be omitted, mainly to show specific parameter settings and technical effects.

Examples

Referring to fig. 1, the embodiment specifically provides a method for classifying multiple fields of entities, where the method includes steps S1 to S6, and the specific implementation process of each step is as follows:

step 1: crowd-sourced method for obtaining semantic vocabulary

In the embodiment, semantic vocabularies belonging to different fields in texts with different attributes are obtained through a crowdsourcing platform, and the association importance of the vocabularies is distinguished. And writing the checked semantic vocabulary into a database.

Step 2: initializing individual parameters in a calculation formula

In this embodiment, the attribute dimension takes a business entity as an example, and the name, brief introduction, patent, soft writing and recruitment data of the enterprise are collected on the internet, which has 5 dimensions in total. The total dimension fraction is set to 100 points, each attribute is assigned to 20 points, and the high, medium and low weight coefficients of each attribute dimension are initialized to be 1.0 high, 0.8 medium and 0.4 low.

Step 3: and acquiring the multi-attribute text of the entity, matching the multi-attribute text with the semantic vocabulary, and calculating the domain category score according to the formula.

In this embodiment, the attribute text of the entity is spliced first, where the patent uses a patent name and a patent abstract for splicing, the soft literature uses a soft literature name for splicing, and the recruitment uses recruitment post and post details for splicing. Finally, after each attribute text is matched with the corresponding semantic vocabulary, the vocabulary quantity under three levels of high, medium and low under each attribute is counted. And the matching result storage database is convenient for inquiring, counting and result analysis.

The calculation formula in this embodiment is:

wherein A is _i Attribute score representing the ith attribute, B _ni An nth semantic association degree weight representing an ith attribute, C _ni The total occurrence number of all semantic terms representing the nth semantic association degree in the ith attribute text of the entity. In particular, if

The value of (2) is greater than 1, let->

Step 4: and judging the threshold value to obtain a classification result, and generating training data by using an expert knowledge verification result.

In this embodiment, according to the initial threshold value of 20 points, the domain score of more than 20 points is classified into the domain, and the classified domain of the entity is counted and checked by an expert. The checked data are arranged into training data for subsequent grid search optimization parameters.

Step 5: using training data for grid search of optimal parameters

The parameters of the grid search in this embodiment include an attribute score A _i Weight coefficient B of semantic association degree _ni And a classification threshold. The evaluation index adopts a jaccard coefficient. Setting the range of the general attribute score to be 0-100 in the parameter range, and adjusting the interval 5 each time, wherein the condition is that the total score is 100; the weight coefficient range of the semantic association degree is 0-1.5, and the adjustment interval is 0.1 each time; the range of the classification threshold is 100/N-100 (N is the number of attributes), and the adjustment interval is 5 each time. And finally, the grid search selects parameters corresponding to the maximum value of the average jaccard coefficients of all samples as a final optimization result.

In this embodiment, the training samples are checked and expanded by multiple rounds of semantic library expansion and expert, the grid search optimization parameters in step 5 are repeated, final parameters are determined, and the parameters and corresponding versions adjusted each time are stored in the database.

Step 6: predicting unknown entities using finalized parameters

In this embodiment, the final parameters are read from the database according to the version number, all semantic vocabularies are acquired, the attribute text of the entity is input, the domain to which the output entity belongs, and the domain to which the output entity belongs may be single value, multi-value or null value.

It should be noted that if the attribute is missing in the entity, the entity with the missing data should be processed separately.

In order to ensure the reliability of parameter adjustment, the accuracy of training data should be ensured as much as possible, and entities known in the field can be selected. For example, well-known business soups in the computer vision field of the artificial intelligence industry are used as training data for business entity classification.

The business entities in the database are classified by the classification method, recall rate and accuracy rate are randomly sampled and calculated, and after parameters are adjusted, the recall rate is finally obtained by more than 80%, and the accuracy rate is obtained by more than 90%.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The small sample entity multi-field classification method based on man-machine cooperation is characterized by comprising the following steps of:

s1: acquiring semantic words related to the entity in a crowdsourcing mode, wherein a crowdsourcing return result comprises the semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field;

s12: checking the crowd-sourced return result, and writing the checked crowd-sourced return result into a database; all semantic vocabularies belonging to the j-th field in the database form dictionary D _j J=1, 2, …, M is the total number of domain classification categories for an entity;

s6: predicting the domain of the unknown entity to be classified based on the optimal parameters;

the specific method of step S3 is as follows:

for each field in turn, based on the semantic vocabulary dictionary D corresponding to the field obtained in S12 _j The score of each entity in the j-th field is calculated, j=1, 2, …, M, and the calculation method is as follows:

The value of (2) is greater than 1, let->

2. The method according to claim 1, wherein the specific method of step S1 is as follows:

s11: in a crowdsourcing solving platform, semantic words in multi-attribute texts of the entities are obtained in a crowdsourcing mode, wherein the crowdsourcing mode adopts the steps of drawing the semantic words from each attribute text of the entities, or directly providing the semantic words and marking the sources; the crowd-sourced return result comprises semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field; a semantic vocabulary belongs to one or more attribute dimensions.

3. The method according to claim 1, characterized in that the specific method of step S2 is as follows:

s22: initializing a weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the belonging field is, the higher the weight coefficient is;

4. A method according to claim 3, wherein in step S2, the degree of association between the semantic vocabulary and the domain is divided into three levels, namely high level, medium level and low level; when the degree of association is high, the weight coefficient B _1i =1.0; when the degree of association is medium, the weight coefficient B _2i =0.8; when the degree of association is low, the weight coefficient B _3i ＝0.4。

5. The method according to claim 1, wherein the specific method of step S4 is as follows:

s41: comparing the score of each entity belonging to each field with the classification threshold, and judging the entity belonging to the field if the score of the entity belonging to a certain field is higher than the classification threshold;

6. The method according to claim 1, wherein the specific method of step S5 is as follows:

7. The method of claim 1, wherein the new optimal parameters are determined by repeating the grid search in step S5 after each expansion by expanding the semantic vocabulary library in multiple rounds and by expanding the training samples by expert knowledge verification.

8. The method according to claim 1, characterized in that the specific method of step S6 is as follows:

s62: and comparing the score of the unknown entity belonging to each field with the classification threshold value in the optimal parameter, and judging the entity belonging to the field if the score of the entity belonging to a certain field is higher than the classification threshold value in the optimal parameter.

9. The method of claim 1, wherein when obtaining the multi-attribute text of the entity, if there are multiple texts under the same attribute, the multiple texts are spliced to obtain the attribute text.