CN111274404B - Small sample entity multi-field classification method based on man-machine cooperation - Google Patents

Small sample entity multi-field classification method based on man-machine cooperation Download PDF

Info

Publication number
CN111274404B
CN111274404B CN202010088532.0A CN202010088532A CN111274404B CN 111274404 B CN111274404 B CN 111274404B CN 202010088532 A CN202010088532 A CN 202010088532A CN 111274404 B CN111274404 B CN 111274404B
Authority
CN
China
Prior art keywords
attribute
entity
semantic
field
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010088532.0A
Other languages
Chinese (zh)
Other versions
CN111274404A (en
Inventor
高汕
李健
宗畅
吴海燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Liangzhi Data Technology Co ltd
Original Assignee
Hangzhou Liangzhi Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Liangzhi Data Technology Co ltd filed Critical Hangzhou Liangzhi Data Technology Co ltd
Priority to CN202010088532.0A priority Critical patent/CN111274404B/en
Publication of CN111274404A publication Critical patent/CN111274404A/en
Application granted granted Critical
Publication of CN111274404B publication Critical patent/CN111274404B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The invention discloses a method for classifying entities in multiple fields, which comprises the steps of firstly obtaining attribute semantic words of the entities in each field through a crowdsourcing mode, then utilizing the semantic words to match attribute texts of the entities, obtaining a matching result, calculating scores through a calculation formula and comparing the scores with a threshold value to obtain a classification result, further generating a small-batch training sample through the correctness of expert knowledge verification results, automatically adjusting formula coefficients through grid search on the basis of the small sample to improve recall rate and accuracy, and solving the problem that a large number of texts need to be checked in manual entity classification through optimizing continuous automatic processing classification effects. The invention fully utilizes crowd-sourcing, man-machine cooperation and semi-supervised learning modes to solve the problem of entity classification, and can rapidly implement multi-domain classification of the entity under the condition of lacking marking data.

Description

Small sample entity multi-field classification method based on man-machine cooperation
Technical Field
The invention relates to the fields of computer technology, artificial intelligence, natural language processing and label classification, in particular to a human-computer collaborative multi-source text content cognition method in a classification scene of the industrial chain field.
Background
Industry chain analysis plays an important role in the development of regional economy and industry. However, the classification attribution of various entities on the industrial chain is not a good method at present. At present, the attribution of the labeling entity can be judged manually only through the attribute description of the entity.
The domain description of the entity in the manual labeling process has different description words in different attribute texts, for example, the description of the computer vision domain in the patent is called a 'vision algorithm', the description in the product is called a 'face recognition', and the description in the recruitment post is called a 'CV algorithm engineer'. An artificial exhaustion of these words containing domain semantics would create a huge effort.
The automatic classification method of the keywords is specified by adopting simple rules, the classification accuracy and the recall rate cannot be simultaneously considered, recall is often easy to be low if the selected keywords are not fully covered, and the precision is not high if the selected keywords are fully covered. The feature description which can assist in judging the domain classification of the entity can be embodied in text data of each attribute dimension, and the association tightness degree of the keywords and the domain is reasonably quantified through an analysis method of statistical probability.
If the deep learning and machine learning algorithms are purely used for classifying the entity field, three main defects exist, namely, firstly, a large amount of annotation corpus is needed for training, and secondly, the text is needed to be specially preprocessed and quantized into computable data before being used; third, deep-learned black box models can lead to poor interpretability of the final result, and classification basis is difficult to trace back.
Therefore, how to provide a semi-supervised entity domain classification method that uses group wisdom to collect semantics and uses a small amount of corpus training to obtain high classification accuracy is a problem that needs to be solved by technicians.
Disclosure of Invention
In view of the above, the invention provides a statistical probability text matching algorithm based on a man-machine cooperation mode, and the method solves the problem of multi-field classification of entities by combining crowd-sourced collection, expert verification and other modes, has high classification accuracy, and can be used in the fields of various different types of entities and different industries.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a small sample entity multi-domain classification method based on man-machine cooperation comprises the following steps:
s1: acquiring semantic words related to the entity in a crowdsourcing mode, wherein the semantic words returned by crowdsourcing comprise three dimensions of the domain to which the semantic words belong, the attribute to which the semantic words belong and the degree of semantic association with the domain to which the semantic words belong;
s2: initializing various parameters required by entity domain classification, wherein the initialized parameters comprise attribute score A i Weight coefficient B of semantic association degree ni And a classification threshold;
s3: acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic words of different fields obtained in the step S1, and calculating scores of each entity in different fields according to matching results;
s4: comparing and judging the score obtained in the step S3 with the classification threshold value to obtain a classification result, and generating training data after the classification result is checked;
s5: determining optimal parameters through grid search based on the training data;
s6: based on the optimal parameters, predicting the unknown entity to be classified to the field.
Based on the technical scheme, the steps can be realized in the following preferable mode:
preferably, the specific method of step S1 is as follows:
s11: in a crowdsourcing solving platform, semantic words in multi-attribute texts of the entities are obtained in a crowdsourcing mode, wherein the crowdsourcing mode adopts the steps of drawing the semantic words from each attribute text of the entities, or directly providing the semantic words and marking the sources; the crowd-sourced return result comprises three dimensions of semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field; a semantic vocabulary belongs to one or more attribute dimensions;
s12: checking the crowd-sourced return result, and writing the checked crowd-sourced return result into a database; all semantic vocabularies belonging to the j-th field in the database form dictionary D j J=1, 2, …, M is the total number of domain classification categories for an entity.
Preferably, the specific method of step S2 is as follows:
s21: initializing and setting the total score of each field to be 100, and averaging the total score of each field to each attribute dimension, wherein the attribute score A of the ith attribute i =100/I, I is the number of attributes;
s22: initializing the weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the belonging field is, the higher the weight coefficient is.
S23: initializing a classification threshold to equal the classification threshold to A i
Preferably, in step S2, the association degree between the semantic vocabulary and the domain is divided into high, middle and low levels; when the degree of association is high, the weight coefficient B 1i =1.0; when the degree of association is medium, the weight coefficient B 2i =0.8; when the degree of association is low, the weight coefficient B 3i =0.4。
Preferably, the specific method of step S3 is as follows:
for each field in turn, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1 j The score of each entity in the j-th field is calculated, j=1, 2, …, M, and the calculation method is as follows:
s31: acquiring multi-attribute text of an entity, and then combining each attribute text with a dictionary D j Each semantic word in the dictionary D is output by matching j The number of occurrences of each semantic vocabulary in the attribute text; in an attribute text, if the same semantic vocabulary appears for a plurality of times, the appearance frequency is recorded as 1 time;
s32: in the matching result obtained in S31, according to the dictionary D j Counting the total occurrence times of all semantic vocabularies of each semantic association degree in each attribute text of the entity;
s33: according to the statistical result obtained in S32, the score of the j-th domain of the entity is calculated, and the calculation formula is:
Figure BDA0002382909130000031
wherein: a is that i Attribute score representing the ith attribute, B ni An nth semantic association degree weight representing an ith attribute, C ni The total occurrence number of all semantic words representing the nth semantic association degree in the ith attribute text of the entity; if it is
Figure BDA0002382909130000032
The value of (2) is greater than 1, let->
Figure BDA0002382909130000033
Equal to 1 to ensure that the final all attribute dimension score accumulated value is the same.
Preferably, the specific method of step S4 is as follows:
s41: comparing the score of each entity belonging to each field with the classification threshold, and judging the entity belonging to the field if the score of the entity belonging to the field is higher than the classification threshold;
s42: and verifying the judging result based on expert knowledge, and obtaining the correct entity in each field according to the verified result data to serve as training data.
Preferably, the specific method of step S5 is as follows:
determining optimal parameters through grid search based on the training data obtained in the step S4, wherein the parameters of the grid search comprise attribute scores A i Weight coefficient B of semantic association degree ni And a classification threshold; the evaluation index of the optimal parameter is selected from jaccard coefficients, and the calculation formula of the jaccard coefficients is as follows:
Figure BDA0002382909130000041
wherein x represents a domain label of entity prediction; y represents a real domain label of an entity; x and y represent the number of intersections of the predicted tag and the real tag; x U y represents the number of union sets of the predictive label and the real label; and finally, the grid search selects the parameter corresponding to the maximum value of the average jaccard coefficient of all the samples as the optimal parameter.
Preferably, the semantic vocabulary library is expanded through multiple rounds, the training sample is expanded through expert knowledge verification, and the grid search in the step S5 is repeated after each expansion to determine new optimal parameters.
Preferably, the specific method of step S6 is as follows:
s61: according to the method of the step S3, acquiring multi-attribute texts of the unknown entity to be classified, matching each attribute text of the unknown entity with semantic vocabularies of different fields obtained in the step S1, and calculating scores of the unknown entity in different fields according to matching results;
s62: and comparing the score of the unknown entity belonging to each field with the classification threshold value in the optimal parameter, and judging the entity belonging to the field if the score of the entity belonging to the field is higher than the classification threshold value in the optimal parameter.
Preferably, when acquiring the multi-attribute text of the entity, if a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text.
Compared with the prior art, the invention discloses a method for acquiring a semantic library by utilizing a crowdsourcing mode, quantifying semantic classification, counting the score of an entity in a certain field according to whether the attribute of the entity comprises semantic lexicon in the field, and finally setting a threshold value to judge a classification result. When the method is used for classifying the entities, only a semantic vocabulary library and a database of various parameters are required to be maintained, and the entity attribute text to be classified is transmitted into the system, so that the classification result can be obtained.
The business entities in the database are classified by the classification method, recall rate and accuracy rate are randomly sampled and calculated, and after parameters are adjusted, the recall rate is finally obtained by more than 80%, and the accuracy rate is obtained by more than 90%. The method can be applied to classification of enterprise entities and expert entities in the fields of artificial intelligence and geographic information industry chains, and can obtain good application effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of an embodiment of an entity multi-domain classification algorithm.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention mainly creates the mode of classifying the keywords by directly and hard matching, softens the keywords in a statistical probability mode, improves the accumulation efficiency of semantic words by utilizing a crowdsourcing mode, obtains training data by expert verification of classification results, searches for optimization parameters by machine learning grids, and fully utilizes the advantages of man-machine cooperation to improve the classification effect. The method fully utilizes knowledge precipitation and reduces the dependence on the labeling data.
The following details a specific implementation manner of the small sample entity multi-field classification method based on man-machine cooperation, which comprises the following steps:
s1: and acquiring semantic words related to the entity in a crowdsourcing mode, wherein the semantic words returned by crowdsourcing comprise three dimensions of the domain to which the semantic words belong, the attribute to which the semantic words belong and the degree of semantic association with the domain to which the semantic words belong.
In this implementation, the specific method of step S1 is as follows:
s11: in a crowdsourcing solving platform, semantic words in multi-attribute texts (containing multiple attribute texts) of the entity are obtained in a crowdsourcing mode, wherein the crowdsourcing mode adopts the steps of dividing the semantic words from each attribute text of the entity, or directly providing the semantic words and marking the source; the crowd-sourced return result comprises three dimensions of semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field; a semantic vocabulary belongs to one or more attribute dimensions. For example, with the semantic vocabulary "visual algorithm" in the patent text, the domain to which the semantic vocabulary belongs may be labeled as "computer visual domain" in the crowd-sourced results, the attribute is "patent", the semantic association degree is "high", and these crowd-sourced results may be returned for subsequent verification. The crowdsourcing solving platform can comprise an open source tool and a specific scene tool which is developed independently, and when a crowdsourcing task is issued, a plurality of fixed fields, attribute dimensions and semantic association degrees can be preset, so that the returned crowdsourcing result meets the requirements.
S12: checking the crowd-sourced return result, and writing the checked crowd-sourced return result into a database; all semantic vocabularies belonging to the j-th field in the database form dictionary D j J=1, 2, …, M is the total number of domain classification categories for an entity.
S2: initializing various parameters required by entity domain classification, wherein the initialized parameters comprise attribute score A i Weight coefficient B of semantic association degree ni And a classification threshold.
In this implementation, the specific method of step S2 is as follows:
s21: initializing and setting the total score of each field to be 100, and averaging the total score of each field to each attribute dimension, wherein the attribute score A of the ith attribute i =100/I, I is the number of attributes.
In the present invention, specific attributes are different according to different entities. For example, a business entity may include attributes such as a business profile, business name, patent, soft-copy, recruitment post, etc.; expert entities may include papers, patents, personal profiles, research areas, works, and the like.
S22: initializing the weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the belonging field is, the higher the weight coefficient is. Wherein the semantic vocabulary relates to the fieldThe level of the association degree can be modified according to the situation, and the level of 2-5 is proper. For example, in this implementation, the association degree may be divided into three levels, i.e., high, medium, and low; when the degree of association is high, the weight coefficient B 1i =1.0; when the degree of association is medium, the weight coefficient B 2i =0.8; when the degree of association is low, the weight coefficient B 3i =0.4。
S23: initializing a classification threshold to equal the classification threshold to A i
S3: and (3) acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic vocabularies of different fields obtained in the step (S1), and calculating the scores of each entity in different fields according to the matching results.
In this implementation, the specific method of step S3 is as follows:
for each field in turn, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1 j The score (j values are sequentially 1,2, … and M) of each entity in the j-th field is calculated, and the calculation method is as follows:
s31: firstly, acquiring multi-attribute texts of entities, wherein the attribute texts are different according to different entity dimensions. For example, when the entity to be classified is a business entity, the attribute text may include a business profile, a business name, a patent, a soft work, and a recruitment post; and when the entity to be classified is an expert entity, the attribute text can comprise papers, patents, personal profiles, research fields and works. If a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text. The attribute text may be crawled from the web or otherwise obtained.
Each attribute text is then passed to dictionary D j Each semantic word in the dictionary is matched, and whether the text contains the semantic word to be matched or not is output by using regular matching, namely an output dictionary D j The number of occurrences of each semantic vocabulary in the attribute text. In a text of an attribute, if the same semantic vocabulary appears multiple times, the number of occurrences is only recorded as 1.
Counting each language under each attribute by matching resultsThe number of words under sense association degree is denoted as C ni Subscript I denotes the I-th attribute, n denotes the n-th semantic vocabulary association, i=1, 2, …, I; n=1, 2, …, N. N represents the total number of association degree grades of semantic vocabularies and the field, and is generally 2-5. In this implementation, since the association degree is divided into three levels, i.e., high, medium, and low, n=3.
S32: in the matching result obtained in S31, according to the dictionary D j Counting the total occurrence times of all semantic vocabularies of each semantic association degree in each attribute text of the entity;
s33: according to the statistical result obtained in S32, the score of the j-th domain of the entity is calculated, and the calculation formula is:
Figure BDA0002382909130000071
wherein: a is that i Attribute score representing the ith attribute, B ni An nth semantic association degree weight representing an ith attribute, C ni The total occurrence number of all semantic words representing the nth semantic association degree in the ith attribute text of the entity; if it is
Figure BDA0002382909130000072
The value of (2) is greater than 1, let->
Figure BDA0002382909130000073
Equal to 1 to ensure that the final all attribute dimension score accumulated value is the same.
It should be noted that when calculating score of the entity belonging to the jth field, the number C is counted ni Dictionary D corresponding to the j-th domain of the entity should be counted j The total number of occurrences of all semantic vocabularies in (1). That is, the present invention counts the scores of the entities in a domain according to whether the attributes of the entities contain semantic lexicons in the domain.
S4: and (3) comparing the score obtained in the step (S3) with the classification threshold value to obtain a classification result, and checking the classification result to generate training data.
In this implementation, the specific method of step S4 is as follows:
s41: comparing the score of each entity belonging to each field with the classification threshold, and judging the entity belonging to the field if the score of the entity belonging to the field is higher than the classification threshold;
s42: and verifying the judging result based on expert knowledge, removing data which are not verified, and obtaining correct entities in each field according to the verified result data to serve as small sample training data.
S5: based on the training data in S42 described above, the optimal parameters are determined by the grid search.
In this implementation, the specific method of step S5 is as follows:
determining optimal parameters by grid search based on the training data obtained in S4, wherein the parameters of the grid search comprise attribute score A i Weight coefficient B of semantic association degree ni And a classification threshold; the evaluation index of the optimal parameter is selected from jaccard coefficients, and the calculation formula of the jaccard coefficients is as follows:
Figure BDA0002382909130000074
wherein x represents a domain label of entity prediction; y represents a real domain label of an entity; x and y represent the number of intersections of the predicted tag and the real tag; and x U y represents the number of union sets of the predicted tag and the real tag. The general parameter ranges are set as follows: attribute score a i The total score of all the attributes is 100, and the adjustment interval is 5 each time during grid search; weight coefficient B of semantic association degree ni The range is 0-1.5, and the adjustment interval is 0.1 each time when grid searching; the range of the classification threshold is 100/N-100 (N is the number of attributes), and the adjustment interval is 5 each time during grid search. And finally, the grid search selects the parameter corresponding to the maximum value of the average jaccard coefficient of all the samples as the optimal parameter.
In practical use, the semantic vocabulary library should be expanded through multiple rounds, and the training samples should be checked and expanded through expert knowledge, and after each time the semantic vocabulary library is expanded or the training samples are expanded, the grid search in step S5 needs to be repeated to determine new optimal parameters.
S6: based on the determined optimal parameters, the unknown entity to be classified is predicted to belong to the field.
In this implementation, the specific method of step S6 is as follows:
s61: according to the method of the step S3, multi-attribute texts of the unknown entity to be classified are obtained, each attribute text of the unknown entity is matched with semantic vocabularies of different fields obtained in the step S1, and scores of the unknown entity in the different fields are calculated according to matching results, specifically, the steps S31-S33 are seen.
S62: and comparing the score of the unknown entity belonging to each field with the classification threshold value in the latest optimal parameter, and judging the entity belonging to the field if the score of the entity belonging to the field is higher than the classification threshold value in the optimal parameter. Thus, a prediction result of the domain of the unknown entity is obtained, and the domain may be one or more domains or may not be corresponding domains.
The following shows a specific implementation thereof by way of example based on the above method. Specific steps in this embodiment are as described above, and detailed description thereof will be omitted, mainly to show specific parameter settings and technical effects.
Examples
Referring to fig. 1, the embodiment specifically provides a method for classifying multiple fields of entities, where the method includes steps S1 to S6, and the specific implementation process of each step is as follows:
step 1: crowd-sourced method for obtaining semantic vocabulary
In the embodiment, semantic vocabularies belonging to different fields in texts with different attributes are obtained through a crowdsourcing platform, and the association importance of the vocabularies is distinguished. And writing the checked semantic vocabulary into a database.
Step 2: initializing individual parameters in a calculation formula
In this embodiment, the attribute dimension takes a business entity as an example, and the name, brief introduction, patent, soft writing and recruitment data of the enterprise are collected on the internet, which has 5 dimensions in total. The total dimension fraction is set to 100 points, each attribute is assigned to 20 points, and the high, medium and low weight coefficients of each attribute dimension are initialized to be 1.0 high, 0.8 medium and 0.4 low.
Step 3: and acquiring the multi-attribute text of the entity, matching the multi-attribute text with the semantic vocabulary, and calculating the domain category score according to the formula.
In this embodiment, the attribute text of the entity is spliced first, where the patent uses a patent name and a patent abstract for splicing, the soft literature uses a soft literature name for splicing, and the recruitment uses recruitment post and post details for splicing. Finally, after each attribute text is matched with the corresponding semantic vocabulary, the vocabulary quantity under three levels of high, medium and low under each attribute is counted. And the matching result storage database is convenient for inquiring, counting and result analysis.
The calculation formula in this embodiment is:
Figure BDA0002382909130000091
wherein A is i Attribute score representing the ith attribute, B ni An nth semantic association degree weight representing an ith attribute, C ni The total occurrence number of all semantic terms representing the nth semantic association degree in the ith attribute text of the entity. In particular, if
Figure BDA0002382909130000092
The value of (2) is greater than 1, let->
Figure BDA0002382909130000093
Equal to 1 to ensure that the final all attribute dimension score accumulated value is the same.
Step 4: and judging the threshold value to obtain a classification result, and generating training data by using an expert knowledge verification result.
In this embodiment, according to the initial threshold value of 20 points, the domain score of more than 20 points is classified into the domain, and the classified domain of the entity is counted and checked by an expert. The checked data are arranged into training data for subsequent grid search optimization parameters.
Step 5: using training data for grid search of optimal parameters
The parameters of the grid search in this embodiment include an attribute score A i Weight coefficient B of semantic association degree ni And a classification threshold. The evaluation index adopts a jaccard coefficient. Setting the range of the general attribute score to be 0-100 in the parameter range, and adjusting the interval 5 each time, wherein the condition is that the total score is 100; the weight coefficient range of the semantic association degree is 0-1.5, and the adjustment interval is 0.1 each time; the range of the classification threshold is 100/N-100 (N is the number of attributes), and the adjustment interval is 5 each time. And finally, the grid search selects parameters corresponding to the maximum value of the average jaccard coefficients of all samples as a final optimization result.
In this embodiment, the training samples are checked and expanded by multiple rounds of semantic library expansion and expert, the grid search optimization parameters in step 5 are repeated, final parameters are determined, and the parameters and corresponding versions adjusted each time are stored in the database.
Step 6: predicting unknown entities using finalized parameters
In this embodiment, the final parameters are read from the database according to the version number, all semantic vocabularies are acquired, the attribute text of the entity is input, the domain to which the output entity belongs, and the domain to which the output entity belongs may be single value, multi-value or null value.
It should be noted that if the attribute is missing in the entity, the entity with the missing data should be processed separately.
In order to ensure the reliability of parameter adjustment, the accuracy of training data should be ensured as much as possible, and entities known in the field can be selected. For example, well-known business soups in the computer vision field of the artificial intelligence industry are used as training data for business entity classification.
The business entities in the database are classified by the classification method, recall rate and accuracy rate are randomly sampled and calculated, and after parameters are adjusted, the recall rate is finally obtained by more than 80%, and the accuracy rate is obtained by more than 90%.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. The small sample entity multi-field classification method based on man-machine cooperation is characterized by comprising the following steps of:
s1: acquiring semantic words related to the entity in a crowdsourcing mode, wherein a crowdsourcing return result comprises the semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field;
s12: checking the crowd-sourced return result, and writing the checked crowd-sourced return result into a database; all semantic vocabularies belonging to the j-th field in the database form dictionary D j J=1, 2, …, M is the total number of domain classification categories for an entity;
s2: initializing various parameters required by entity domain classification, wherein the initialized parameters comprise attribute score A i Weight coefficient B of semantic association degree ni And a classification threshold;
s3: acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic words of different fields obtained in the step S1, and calculating scores of each entity in different fields according to matching results;
s4: comparing and judging the score obtained in the step S3 with the classification threshold value to obtain a classification result, and generating training data after the classification result is checked;
s5: determining optimal parameters through grid search based on the training data;
s6: predicting the domain of the unknown entity to be classified based on the optimal parameters;
the specific method of step S3 is as follows:
for each field in turn, based on the semantic vocabulary dictionary D corresponding to the field obtained in S12 j The score of each entity in the j-th field is calculated, j=1, 2, …, M, and the calculation method is as follows:
s31: acquiring multi-attribute text of an entity, and then combining each attribute text with a dictionary D j Each semantic word in the dictionary D is output by matching j The number of occurrences of each semantic vocabulary in the attribute text; in an attribute text, if the same semantic vocabulary appears for a plurality of times, the appearance frequency is recorded as 1 time;
s32: in the matching result obtained in S31, according to the dictionary D j Counting the total occurrence times of all semantic vocabularies of each semantic association degree in each attribute text of the entity;
s33: according to the statistical result obtained in S32, the score of the j-th domain of the entity is calculated, and the calculation formula is:
Figure FDA0004231241030000021
wherein: a is that i Attribute score representing the ith attribute, B ni An nth semantic association degree weight representing an ith attribute, C ni The total occurrence number of all semantic words representing the nth semantic association degree in the ith attribute text of the entity; if it is
Figure FDA0004231241030000022
The value of (2) is greater than 1, let->
Figure FDA0004231241030000023
Equal to 1 to ensure that the final all attribute dimension score accumulated value is the same.
2. The method according to claim 1, wherein the specific method of step S1 is as follows:
s11: in a crowdsourcing solving platform, semantic words in multi-attribute texts of the entities are obtained in a crowdsourcing mode, wherein the crowdsourcing mode adopts the steps of drawing the semantic words from each attribute text of the entities, or directly providing the semantic words and marking the sources; the crowd-sourced return result comprises semantic words, the belonging field of the semantic words, the belonging attribute and the semantic association degree with the belonging field; a semantic vocabulary belongs to one or more attribute dimensions.
3. The method according to claim 1, characterized in that the specific method of step S2 is as follows:
s21: initializing and setting the total score of each field to be 100, and averaging the total score of each field to each attribute dimension, wherein the attribute score A of the ith attribute i =100/I, I is the number of attributes;
s22: initializing a weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the belonging field is, the higher the weight coefficient is;
s23: initializing a classification threshold to equal the classification threshold to A i
4. A method according to claim 3, wherein in step S2, the degree of association between the semantic vocabulary and the domain is divided into three levels, namely high level, medium level and low level; when the degree of association is high, the weight coefficient B 1i =1.0; when the degree of association is medium, the weight coefficient B 2i =0.8; when the degree of association is low, the weight coefficient B 3i =0.4。
5. The method according to claim 1, wherein the specific method of step S4 is as follows:
s41: comparing the score of each entity belonging to each field with the classification threshold, and judging the entity belonging to the field if the score of the entity belonging to a certain field is higher than the classification threshold;
s42: and verifying the judging result based on expert knowledge, and obtaining the correct entity in each field according to the verified result data to serve as training data.
6. The method according to claim 1, wherein the specific method of step S5 is as follows:
determining optimal parameters through grid search based on the training data obtained in the step S4, wherein the parameters of the grid search comprise attribute scores A i Weight coefficient B of semantic association degree ni And a classification threshold; the evaluation index of the optimal parameter is selected from jaccard coefficients, and the calculation formula of the jaccard coefficients is as follows:
Figure FDA0004231241030000031
wherein x represents a domain label of entity prediction; y represents a real domain label of an entity; x and y represent the number of intersections of the predicted tag and the real tag; x U y represents the number of union sets of the predictive label and the real label; and finally, the grid search selects the parameter corresponding to the maximum value of the average jaccard coefficient of all the samples as the optimal parameter.
7. The method of claim 1, wherein the new optimal parameters are determined by repeating the grid search in step S5 after each expansion by expanding the semantic vocabulary library in multiple rounds and by expanding the training samples by expert knowledge verification.
8. The method according to claim 1, characterized in that the specific method of step S6 is as follows:
s61: according to the method of the step S3, acquiring multi-attribute texts of the unknown entity to be classified, matching each attribute text of the unknown entity with semantic vocabularies of different fields obtained in the step S1, and calculating scores of the unknown entity in different fields according to matching results;
s62: and comparing the score of the unknown entity belonging to each field with the classification threshold value in the optimal parameter, and judging the entity belonging to the field if the score of the entity belonging to a certain field is higher than the classification threshold value in the optimal parameter.
9. The method of claim 1, wherein when obtaining the multi-attribute text of the entity, if there are multiple texts under the same attribute, the multiple texts are spliced to obtain the attribute text.
CN202010088532.0A 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation Active CN111274404B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010088532.0A CN111274404B (en) 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010088532.0A CN111274404B (en) 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation

Publications (2)

Publication Number Publication Date
CN111274404A CN111274404A (en) 2020-06-12
CN111274404B true CN111274404B (en) 2023-07-14

Family

ID=70997015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010088532.0A Active CN111274404B (en) 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation

Country Status (1)

Country Link
CN (1) CN111274404B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506671B (en) * 2020-03-17 2021-02-12 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for processing attribute of entity object

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254883A (en) * 1997-03-10 1998-09-25 Mitsubishi Electric Corp Automatic document sorting method
CN103324692A (en) * 2013-06-04 2013-09-25 北京大学 Classified knowledge acquiring method and device
CN106682128A (en) * 2016-12-13 2017-05-17 成都数联铭品科技有限公司 Method for automatic establishment of multi-field dictionaries
CN106897371A (en) * 2017-01-18 2017-06-27 南京云思创智信息科技有限公司 Chinese text classification system and method
CN106934020A (en) * 2017-03-10 2017-07-07 东南大学 A kind of entity link method based on multiple domain entity index

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4189968B2 (en) * 2004-03-04 2008-12-03 株式会社エネルギア・コミュニケーションズ How to match professionals and prospects
US20060136467A1 (en) * 2004-12-17 2006-06-22 General Electric Company Domain-specific data entity mapping method and system
TW201126430A (en) * 2010-01-26 2011-08-01 Univ Nat Taiwan Science Tech Expert list recommendation methods and systems
CN105260482A (en) * 2015-11-16 2016-01-20 金陵科技学院 Network new word discovery device and method based on crowdsourcing technology
CN106339806A (en) * 2016-08-24 2017-01-18 北京创业公社征信服务有限公司 Industry holographic image constructing method and industry holographic image constructing system for enterprise information
CN109101477B (en) * 2018-06-04 2023-01-31 东南大学 Enterprise field classification and enterprise keyword screening method
CN109783818B (en) * 2019-01-17 2023-04-07 上海三零卫士信息安全有限公司 Enterprise industry classification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254883A (en) * 1997-03-10 1998-09-25 Mitsubishi Electric Corp Automatic document sorting method
CN103324692A (en) * 2013-06-04 2013-09-25 北京大学 Classified knowledge acquiring method and device
CN106682128A (en) * 2016-12-13 2017-05-17 成都数联铭品科技有限公司 Method for automatic establishment of multi-field dictionaries
CN106897371A (en) * 2017-01-18 2017-06-27 南京云思创智信息科技有限公司 Chinese text classification system and method
CN106934020A (en) * 2017-03-10 2017-07-07 东南大学 A kind of entity link method based on multiple domain entity index

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《小规模知识库指导下的细分领域实体关系发现研究》;陈果,许天祥;《情报学报》;第38卷(第11期);全文 *

Also Published As

Publication number Publication date
CN111274404A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
CN104820629A (en) Intelligent system and method for emergently processing public sentiment emergency
CN110717654B (en) Product quality evaluation method and system based on user comments
CN107688870B (en) Text stream input-based hierarchical factor visualization analysis method and device for deep neural network
CN113378565B (en) Event analysis method, device and equipment for multi-source data fusion and storage medium
CN113673254B (en) Knowledge distillation position detection method based on similarity maintenance
Milea et al. Prediction of the msci euro index based on fuzzy grammar fragments extracted from european central bank statements
US20210397790A1 (en) Method of training a natural language search system, search system and corresponding use
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN111274404B (en) Small sample entity multi-field classification method based on man-machine cooperation
CN111625578B (en) Feature extraction method suitable for time series data in cultural science and technology fusion field
CN116522912B (en) Training method, device, medium and equipment for package design language model
CN116341521B (en) AIGC article identification system based on text features
CN112286799A (en) Software defect positioning method combining sentence embedding and particle swarm optimization algorithm
CN115841269A (en) Periodical dynamic evaluation method based on multi-dimensional index analysis
CN114820074A (en) Target user group prediction model construction method based on machine learning
CN103646017A (en) Acronym generating system for naming and working method thereof
CN114265931A (en) Big data text mining-based consumer policy perception analysis method and system
CN110414819B (en) Work order scoring method
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN112488593A (en) Auxiliary bid evaluation system and method for bidding
CN114282875A (en) Flow approval certainty rule and semantic self-learning combined judgment method and device
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
CN116610592B (en) Customizable software test evaluation method and system based on natural language processing technology
CN116342167B (en) Intelligent cost measurement method and device based on sequence labeling named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant