WO2023178903A1

WO2023178903A1 - Industry professional text automatic labeling method and apparatus, terminal, and storage medium

Info

Publication number: WO2023178903A1
Application number: PCT/CN2022/109617
Authority: WO
Inventors: 沈浩; 吴优
Original assignee: 上海帜讯信息技术股份有限公司
Priority date: 2022-03-24
Filing date: 2022-08-02
Publication date: 2023-09-28
Also published as: CN114386424A; CN114386424B

Abstract

Provided are an industry professional text automatic labeling method and apparatus, a terminal, and a storage medium. A semi-supervised entity recognition algorithm is used, and an external professional knowledge text library is combined, such that the labor cost of text entity labeling is reduced to the maximum extent, and the quality and efficiency of modeling in a text entity recognition process are improved. In addition, a universal entity recognition algorithm is also used, and a specific bag of words high-dimensional vectorization similarity calculation technique is combined, such that the early stage of the text entity recognition process can be carried out automatically, and high-quality entity information extraction can be achieved in a plurality of different professional fields. In addition, on the basis of a data augmentation technique, by means of noise entity feature interpolation and noise statement feature extrapolation techniques, the problem of an insufficient generalization capability of a traditional unsupervised automatic labeling algorithm is mitigated, such that a labeling model can implement semi-supervised automatic labeling on various different industry professional texts.

Description

Industry professional text automatic annotation methods, devices, terminals and storage media

Technical field

The invention belongs to a text annotation solution, specifically an industry professional text automatic annotation method, device, terminal and storage medium based on data enhancement, and relates to the technical field of text entity recognition.

Background technique

In the field of text entity recognition, text annotation is an important technology, which can semantically annotate text and construct a mapping from words to semantic concepts. In the subsequent text processing process, even if the operator only conducts a shallow analysis of the text, he can still judge the distribution of the text in the semantic concept space based on the mapping, thereby providing a practical basis for text management, search and recommendation.

At present, artificial intelligence algorithms such as deep learning and machine learning have gradually become mainstream technologies in the field of text entity recognition and have been widely used in the field. However, when this type of technology deals with text annotation of professional text entities in professional fields and industries, the final effect of the algorithm often fails to meet expectations. This is because for these highly professional texts, factors such as the quantity, quality and generalization ability of the annotated corpus directly determine the effect of text annotation, and various automated technical means at the current stage do not perform well in terms of professionalism. Not as good as manual annotation. Because of this, at this stage, the industry's annotation of professional texts in various industries is still dominated by manual annotation.

Predictably, the manual annotation method has several obvious shortcomings in the actual operation process:

First of all, the manual annotation method requires too much professionalism from the annotators. Manual annotation not only requires the annotator to have a certain degree of professional knowledge in the professional field, but also requires the annotator to perform a large amount of manual professional information extraction, judgment and annotation, thus greatly increasing the cost of text annotation in the professional field.

Secondly, the efficiency and quality of manual annotation methods are difficult to guarantee. Manual annotation requires the annotator to perform full-text information retrieval in the text, find and locate specific entity locations, and then annotate entity types. The entire operation process is inefficient and prone to problems such as entity omissions and type errors.

Finally, manual annotation methods have poor generalization capabilities. In the process of classifying professional texts in professional fields and industries, there are often dynamics, ambiguities and uncertainties in the boundaries of fields and industries. Once manual annotation is carried out, it will be difficult to classify the number, type and definition of annotated entities in the process. etc., which ultimately leads to poor generalization ability of the entire labeled sample.

To sum up, if we can build on the existing various automated text annotation solutions, make use of open knowledge and knowledge base information on the Internet, and combine it with technical means such as data enhancement and semi-supervised entity recognition, we can achieve industry-wide recognition in professional fields. Automatic annotation of professional texts will greatly improve the annotation efficiency and quality of professional texts in the industry.

Contents of the invention

In view of the above-mentioned deficiencies in the existing technology, the purpose of the present invention is to propose a method, device, terminal and storage medium for automatic annotation of industry professional texts based on data enhancement, specifically as follows.

An automatic annotation method for industry professional texts, including:

Perform a keyword search in a professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded bag of words from an external think tank;

Compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain the external think tank expanded word bag and the external think tank expanded noise word bag based on the comparison results;

Use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples, and perform noise on the external think tank expanded bag of words based on the initial keyword bag. Extrapolate sentence features to obtain generalized samples of extrapolated sentences;

Perform reverse text annotation and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity annotation samples.

Preferably, the keyword search is performed in a professional text database based on the initial keyword bag to obtain expanded text information, and entity recognition is performed on the expanded text information to obtain an initial expanded word bag from an external think tank, including:

Obtain vocabulary entities and sort out initial keyword bags, which are divided into multiple types of small bags according to part-of-speech classification requirements;

Using the initial keyword bag, perform keyword retrieval in the professional text library to obtain expanded text information. The number of expanded samples in the expanded text information corresponding to the small bag of words of each type is similar;

Use entity recognition technology to perform entity information recognition on the expanded text information to obtain the initial expanded word bag of the external think tank.

Preferably, the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are compared one by one, and the external think tank expanded word bag and the external think tank expanded noise are respectively obtained based on the comparison results. Bag of words, including:

The convolutional neural network is used to perform vector calculations in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag, and the real entity annotation samples are identified based on the calculation results. Noisy entity labeling samples;

All the real entity labeled samples are summarized and sorted to obtain the external think tank expanded word bag, and all the noise entity labeled samples are summarized and sorted to obtain the external think tank expanded noise word bag.

Preferably, the external think tank is used to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples, and the external think tank is used to expand the bag of keywords based on the initial keyword bag. Expand the bag of words to extrapolate noise sentence features and obtain generalization samples of extrapolated sentences, including:

Randomly select noise entity samples from the external think tank-expanded noise word bag and insert them into the real entity annotation samples in the external think tank expanded word bag to obtain interpolated entity generalization samples;

Select noise sentences containing lexical entities in the initial keyword bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain extrapolated sentence generalization samples.

An industry professional text automatic annotation device, including:

The initial expanded word bag generation module is configured to perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded word bag from an external think tank;

The expanded word bag and expanded noise word bag generation module is configured to compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain external Think tanks expand the bag of words and external think tanks expand the bag of noise words;

A generalization sample generation module configured to use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples based on the initial keyword bag of words. The external think tank expands the bag of words to extrapolate noise sentence features to obtain generalization samples of extrapolated sentences;

The professional entity labeling module is configured to perform reverse text labeling and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity labeling samples.

Preferably, the initial expanded word bag generation module includes:

The initial keyword bag acquisition unit is configured to acquire vocabulary entities and organize to obtain initial keyword bags. The initial keyword bags are divided into multiple types of small bags according to part-of-speech classification requirements;

The expanded text information acquisition unit is configured to use the initial keyword bag to perform keyword retrieval in the professional text library to obtain expanded text information, and the expanded samples in the expanded text information correspond to each type of the small bag of words. The sample sizes are similar;

The external think tank initial expanded word bag generation unit is configured to use entity recognition technology to perform entity information recognition on the expanded text information to obtain the external think tank initial expanded word bag.

Preferably, the expanded bag of words and expanded noise bag of words generation modules include:

The expanded word bag generation unit is configured to use a convolutional neural network to perform vector calculation in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag. The calculation results identify real entity labeled samples and noise entity labeled samples;

The expanded noise word bag generation unit is configured to summarize and organize all the real entity labeled samples to obtain an external think tank expanded word bag, and summarize and organize all the noise entity labeled samples to obtain an external think tank expanded noise word bag.

Preferably, the generalization sample generation module includes:

The interpolated entity generalization sample generation unit is configured to randomly select noise entity samples from the external think tank expanded noise word bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain interpolated entity generalization samples. ;

The extrapolated sentence generalization sample generating unit is configured to select the noise sentence containing the lexical entity in the initial keyword bag and insert it into the real entity labeled sample in the external think tank expanded word bag to obtain the extrapolated sentence generalization sample. sample.

A terminal includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the automatic annotation method for industry professional texts as described above. A step of.

A computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the steps in the automatic annotation method for industry professional texts are implemented as described above.

The advantages of the present invention are mainly reflected in the following aspects:

The invention proposes an automatic labeling method for industry professional texts based on data enhancement, which utilizes a semi-supervised entity recognition algorithm and combines an external professional knowledge text library to minimize the labor cost of text entity labeling and improve text entity recognition. Quality and efficiency of modeling in the process. At the same time, the method of the present invention also uses a universal entity recognition algorithm, combined with specific bag-of-words high-dimensional vectorization similarity calculation technology, so that the early text entity recognition process can be carried out in an automated manner, and can be carried out in multiple different Achieve high-quality entity information extraction in the professional field. In addition, the method of the present invention is based on data enhancement technology, and through noise entity feature interpolation and noise sentence feature extrapolation technology, it greatly optimizes the problem of insufficient generalization ability of traditional unsupervised automatic labeling algorithms, so that the labeling model can be used in multiple applications. Semi-supervised automatic annotation is implemented on professional texts from different industries.

Corresponding to the above method, the data-enhanced industry professional text automatic annotation device, terminal and storage medium proposed by the present invention can efficiently and accurately complete the industry professional text annotation with a systematic and standardized processing flow. Annotation, the hardware has high adaptability and compatibility, and can be effectively used in technical implementation in the field of text annotation.

The present invention also provides a reference for other solutions related to text annotation technology, which can be used as a basis for expansion, extension and in-depth research, and has very broad application prospects.

The specific implementation modes of the present invention will be further described in detail below with reference to the examples and drawings, so as to make the technical solution of the present invention easier to understand and master.

Description of the drawings

The accompanying drawings, which constitute a part of this application, are included to provide a further understanding of the application so that other features, objects and advantages of the application will become apparent. The drawings and descriptions of the schematic embodiments of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:

Figure 1 is a schematic flow chart of an automatic annotation method for industry professional texts provided by an embodiment of the present invention;

Figure 2 is a schematic diagram of the results of initial keyword bag sorting according to the method in the embodiment of the present invention;

Figure 3 is a schematic structural diagram of the named entity recognition algorithm model selected in the embodiment of the present invention;

Figure 4 is a schematic diagram of the network structure of the noise recognition model established in the embodiment of the present invention;

Figure 5 is a schematic structural diagram of an automatic annotation device for industry professional texts provided by an embodiment of the present invention.

Detailed ways

The present invention discloses a method, device, terminal and storage medium for automatic annotation of industry professional texts based on data enhancement. The specific scheme is as follows.

On the one hand, the present invention relates to a method for automatic annotation of industry professional texts. The specific process is shown in Figure 1, including the following steps:

S1. Perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain the initial expanded word bag of the external think tank. Furthermore, this step specifically includes the following operations.

S11. Obtain vocabulary entities, and organize and obtain a small-scale initial keyword bag X based on preset standards based on the manual experience of professionals.

The initial keyword _bag In order to ensure the diversity and robustness of subsequent bag-of-word vector representations, the lexical entities in each type of small bag-of-words cannot be repeated, and the number of entities should be no less than 50.

For example, it is necessary to develop entity recognition algorithms for texts in the field of "smart wear". The traditional method is to recruit annotators who are familiar with the field of "smart wear" to manually annotate the text. This solution only needs to be based on the experience of professionals in the field of "smart wear". Based on manual experience, it is enough to sort out a small-scale keyword bag composed of different entity types. The sorting results are shown in Figure 2.

S12 _. Use the initial keyword bag approximate.

Considering that professional texts often contain multiple different professional entity information, for example, patent documents containing "wearable devices" often also contain other extended entities such as technologies, fields, products, raw materials, etc. related to "wearable devices". Therefore, it is necessary to use the initial keyword word bag

In order to ensure that the subsequent reinforcement learning model can have enough training samples, the number of texts from the professional think tank selected in this step for keyword search expansion should be no less than 10,000. At the same time, in order to ensure that the expanded sample can cover different entity types as much as possible, the expanded text information retrieved here should ensure that the number of samples matching each type of small word bag x _i in the expanded sample is as even as possible, and the number of samples of each type should be as uniform as possible. More than 1000 items.

S13. Use entity recognition technology to perform entity information recognition on the expanded text information, and obtain the initial expanded word bag Y _i of the external think tank.

Considering that the professional text contains a large amount of entity information that is similar to the target entity, in this step, general entity recognition technology (Named Entity Recognition, NER) is used to identify the entity information in the extended text information. In this embodiment, the three-layer model structure of BERT+BiLSTM+CRF is selected, which has a relatively stable effect in the named entity recognition algorithm. The overall framework of the model is shown in Figure 3.

The first layer of BERT uses the Transformer mechanism to encode the input data and uses the pre-trained model to obtain the semantic representation of the word. Transformer is different from the transformation model of traditional sequence-aligned recurrent neural networks or convolutional neural networks. It is a representation that completely relies on the self-attention mechanism (Attention) to calculate input and output. The calculation formula of the self-attention layer is as follows, where Q is the matrix of Query vector combination, K is the matrix of Key vector combination, V is the matrix of Value vector combination, and d is the dimension of Query vector.

The second BiLSTM layer further extracts high-level features of the data based on the BERT output results. The reason why the BiLSTM algorithm is used here is to better capture the contextual information of feature entities in professional texts.

The third layer of CRF is a globally statistically normalized conditional state transition probability matrix. It imposes state transition constraints on the output results of the BiLSTM layer, allowing the underlying deep neural network to learn the new loss function under the characteristics of CRF. A more reasonable set of nonlinear transformation spaces.

S2. Compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag are similar, and obtain the external think tank expanded word bag and the external think tank expanded noise word bag respectively based on the comparison results. Furthermore, this step specifically includes the following operations.

S21. Use the convolutional neural network to perform vector calculation in the high-dimensional vector space on the expanded words y _i in the initial expanded word bag Y _i of the external think tank and the entity words in the initial keyword word bag, and proceed based on the calculation results. Recognition, if similar, the expanded word y _i is determined to be a real entity annotation sample.

If they are not similar, the expanded word y _i is determined to be a noise entity labeling sample.

Specifically, in order to ensure the noise recognition effect while taking into account the calculation speed in actual operation, this embodiment designs a single-layer CNN neural network model as the noise recognition model. The specific network structure design is shown in Figure 4.

Among them, the word vector conversion tool used in this embodiment is Tencent word vector library, which can correspond to a 200-dimensional vector for each word. Compared with other existing Chinese word vector data, Tencent word vector library has improved Coverage, freshness, and accuracy of the overall word vector. After completing the word vector conversion, the single word vector can be assembled into an input word bag vector. The length of the word bag vector is fixed to n, that is, a word bag consists of n words (if it exceeds n, it will be truncated, and if it is less than n, use blank characters. padding) input vector. The characteristics of each word are concatenated by its own word vector (d_word-dimensional vector) and its position vector (d_pos-dimensional vector) from the two entities. The convolution kernel performs a convolution operation on the word vectors and position vectors of three consecutive words at the word granularity. The convolution kernel size is 3*(d_word+2*d_pos).

The parameters of the above classification model based on convolutional neural network include the weights and offset terms of the convolutional layer and the fully connected layer. Therefore, the model parameter set Φ={Y _c , bc , Y _f , b _f }, then a certain sentence is output The probability of belonging to a certain relationship type is shown in the following formula, where Y represents the bag of words divided by the model, b represents an entity in the bag of words, and the subscripts c and f represent real entities and noise entities respectively.

p(r|y;Φ)=softmax(Y _f *tanh(max(Y _c *y+b _c ))+b _f ),

Given the training set {T}, the model parameters Φ, the loss function of the model is designed as shown in the following formula.

S22. Label all the real entities as samples

Summarize and sort out word bags expanded by external think tanks

Label all the noise entities in the sample

Summarize and organize external think tanks to expand the noise word bag

S3. Use the external think tank to expand the noise word bag to perform noise entity feature interpolation on the external think tank expanded word bag to obtain interpolated entity generalization samples. Based on the initial keyword bag X, the external think tank expanded word bag

Extrapolate noise sentence features to obtain generalization samples of extrapolated sentences.

Since the algorithm in an unsupervised environment can easily capture the specific characteristics of professional texts (such as patent texts), the final algorithm can only have a good annotation effect on specific texts (such as patent texts), and the overall generalization ability of the model is poor. Therefore, in order to solve the problem of insufficient generalization ability of the annotation model, it is necessary to expand the bag of words in an external think tank.

Specific data augmentation techniques are used in the feature space.

Different from simple vector noise addition methods (such as adding Gaussian noise), this solution uses a specific data enhancement method for professional texts. Furthermore, this step specifically includes the following operations.

S31. Among entities of the same category, randomly select from the noise entity word bag

Select 1-2 noise entities with a high degree of similarity in the feature space and insert them into the external think tank to expand the word bag

between real entities, thereby achieving the effect of reducing the spatial correlation between related entities and obtaining the interpolated entity generalization sample T ₁ .

The main consideration for the above operation is that in professional texts, the density between entities is much higher than that in ordinary news texts, which also makes it very easy for the model to learn the spatial characteristics between entities when identifying annotated entities.

The specific performance of the operation is that in the interpolation entity selection, the inserted noise entities are mainly passed through the noise entity word bag

Select the noise entity that is closest to the space vector representation between the two real entities. Suppose there are two real entities k and j, whose representations in n-dimensional vector space are ξ _k and ξ _j respectively.

Obtain a random n-dimensional weight matrix N, select λ _n ∈ (0,1), ∑λ _n =1, then the n vector representation of the interpolated noise entity τ should satisfy the following expression, that is, the selected noise entity should be at a high The dimensional space representation should be as close as possible to the real entities k and j, so as to achieve the best generalization effect.

τ≈Σλ _n ξ _k +(1-λ _n )ξ _j ,

In the selection of the inserted entity position, the inserted noise entity position should be as close as possible to the middle position of the real entities k and j, so that the generalized influence of the noise entity on the two real entities is as similar as possible. The mathematical expression is as follows.

ξ _k +τ≈ξ _j +τ.

S32. Select noise sentences containing lexical entities in the initial keyword bag X and insert them into the external think tank expanded word bag.

From the real entity annotation samples, the extrapolated sentence generalization sample T ₂ is obtained.

Since the writing style of professional texts is significantly different from other texts, the overall characteristics of professional texts will also be captured by deep learning algorithms, which is not conducive to annotation recognition in other types of professional texts or non-professional texts. Therefore, in order to improve the annotation effect of the annotation algorithm in different types of texts, it is also necessary to insert noise sentences outside the entities to reduce the impact of the overall characteristics of the professional text on the generalization ability of the algorithm.

In the selection of noise sentences, this solution selects other types of text sentences containing artificial word bag .

For example, the original sentence contains the two entities "laser beam scanning display projection element" and "holographic projection lens". In order to enhance the algorithm's generalization ability to identify these two entities through contextual relationships, we will insert another sentence before the sentence. For statements of type text, the requirement is that the inserted sentence should contain one or more words from the artificial word bag X.

There is also an implementation tip here. If you want the model to gain stronger generalization capabilities in a specific type of text, extrapolated sentences can be obtained from that type of text. For example, if you want the model to generalize better in news-type texts, extrapolated sentences can be extracted from news texts.

S4. Perform reverse text annotation and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity annotation samples.

To sum up, the automatic annotation method for industry professional texts proposed by the present invention has the following advantages compared with the manual annotation and traditional supervised text annotation methods in the prior art:

1. The cost of technology implementation is lower. The traditional annotation method for industry professional texts requires manual annotation by annotators, which not only consumes a lot of time and costs, but the annotation results may also have certain errors and omissions, and cannot meet the requirements of large-scale professional entity recognition in actual production. The method of the present invention uses a semi-supervised entity recognition algorithm and combines it with an external professional knowledge text library to minimize the labor cost of text entity annotation and improve the quality and efficiency of model construction in the actual text entity recognition process.

2. The quality of text annotation is higher. Traditional text entity annotation for professional fields often requires annotators to have professional knowledge in specific fields, which places high demands on annotation work. The method of the present invention uses a universal entity recognition algorithm (NER), combined with a specific bag-of-words high-dimensional vectorized similarity calculation technology, so that the early text entity recognition process does not require the intervention of annotators, and can theoretically be Achieve high-quality entity information extraction in multiple different professional fields.

3. Stronger generalization ability. Traditional automatic text annotation technology can often only annotate a single specific text. When the text type or text quality changes, the annotation algorithm often cannot achieve better annotation results on the new annotation set. The method of the present invention is based on data enhancement technology, and through noise entity feature interpolation and noise sentence feature extrapolation technology, it greatly optimizes the problem of insufficient generalization ability of traditional unsupervised automatic labeling algorithms, so that the labeling model can be used in a variety of different applications. Semi-supervised automatic annotation is implemented on industry professional texts.

On the other hand, the present invention also relates to an automatic annotation device for industry professional texts. Its architecture is shown in Figure 5, including:

The expanded word bag and expanded noise word bag generation module is configured to compare whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag are similar, and obtain the external think tank respectively based on the comparison results. Expand the bag of words and external think tanks to expand the bag of noise words;

The initial expanded word bag generation module includes:

The expanded bag of words and expanded noise bag of words generation modules include:

The generalization sample generation module includes:

In another aspect, the present invention also relates to a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the above-mentioned steps are implemented. The steps in the automatic annotation method for industry professional texts are, for example, the steps shown in Figure 1. Alternatively, when the processor executes the computer program, the functions of each module/unit in each of the above device embodiments are implemented, such as the functions of each module/unit shown in Figure 5 .

In yet another aspect, the present invention also relates to a computer-readable storage medium that stores a computer program. When the computer program is executed by a processor, the computer program can automatically read industry professional texts as described above. Label the steps in the method.

The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to a processor such that the processor can read information from the readable storage medium and write information to the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and readable storage medium may be located in Application Specific Integrated Circuits (ASICs). Additionally, the ASIC can be located in the user equipment. Of course, the processor and the readable storage medium may also exist as discrete components in the communication device. Readable storage media can be read-only memory (ROM), random-access memory (RAM), CD-ROM, tapes, floppy disks, optical data storage devices, etc.

Corresponding to the content of the above method, the invention proposes an automatic annotation device, terminal and storage medium for industry professional texts, which can efficiently and accurately complete the annotation of industry professional texts through a systematic and standardized processing flow. The hardware has high adaptability and compatibility and can be effectively used in technical implementation in the field of text annotation.

It is obvious to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, and that the present invention can be implemented in other specific forms without departing from the spirit and essential characteristics of the present invention. Therefore, the embodiments should be regarded as illustrative and non-restrictive from any point of view, and the scope of the present invention is defined by the appended claims rather than the above description, and it is therefore intended that all claims falling within the claims All changes within the meaning and scope of equivalent elements are included in the present invention.

Finally, it should be understood that although this specification is described in terms of implementations, not each implementation only contains an independent technical solution. This description of the specification is only for the sake of clarity, and those skilled in the art should take the specification as a whole. , the technical solutions in each embodiment can also be appropriately combined to form other implementations that can be understood by those skilled in the art.

Claims

An automatic annotation method for industry professional texts, which is characterized by including:

Perform a keyword search in a professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded bag of words from an external think tank;

Compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain the external think tank expanded word bag and the external think tank expanded noise word bag based on the comparison results;

Use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples, and perform noise on the external think tank expanded bag of words based on the initial keyword bag. Extrapolate sentence features to obtain generalized samples of extrapolated sentences;

Perform reverse text annotation and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity annotation samples.
The automatic annotation method for industry professional texts according to claim 1, characterized in that the keyword search is performed in the professional text library based on the initial keyword bag to obtain expanded text information, and entity recognition is performed on the expanded text information. , get the initial expanded word bag from the external think tank, including:

Obtain vocabulary entities and sort out initial keyword bags, which are divided into multiple types of small bags according to part-of-speech classification requirements;

Using the initial keyword bag, perform keyword retrieval in a professional text library to obtain expanded text information. The number of expanded samples in the expanded text information corresponding to the small bag of words of each type is similar;

Use entity recognition technology to perform entity information recognition on the expanded text information to obtain the initial expanded word bag of the external think tank.
The automatic annotation method of industry professional texts according to claim 1, characterized in that the step-by-step comparison is made one by one to see whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, Based on the comparison results, the bag of words expanded by external think tanks and the bag of noise words expanded by external think tanks were obtained, including:

The convolutional neural network is used to perform vector calculations in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag, and the real entity annotation samples are identified based on the calculation results. Noisy entity labeling samples;

All the real entity labeled samples are summarized and sorted to obtain the external think tank expanded word bag, and all the noise entity labeled samples are summarized and sorted to obtain the external think tank expanded noise word bag.
The automatic annotation method of industry professional texts according to claim 1, characterized in that the use of the external think tank to expand the noise word bag is used to interpolate the noise entity features of the external think tank expanded bag of words to obtain interpolated entity generalization. Samples are used to extrapolate noise sentence features based on the initial keyword bag to the external think tank expanded word bag to obtain extrapolated sentence generalization samples, including:

Randomly select noise entity samples from the external think tank-expanded noise word bag and insert them into the real entity annotation samples in the external think tank expanded word bag to obtain interpolated entity generalization samples;

Select noise sentences containing lexical entities in the initial keyword bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain extrapolated sentence generalization samples.
An automatic annotation device for industry professional texts, which is characterized by including:

The initial expanded word bag generation module is configured to perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded word bag from an external think tank;

The expanded word bag and expanded noise word bag generation module is configured to compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain external Think tanks expand the bag of words and external think tanks expand the bag of noise words;

A generalization sample generation module configured to use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples based on the initial keyword bag of words. The external think tank expands the bag of words to extrapolate noise sentence features to obtain generalization samples of extrapolated sentences;

The professional entity labeling module is configured to perform reverse text labeling and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity labeling samples.
The automatic annotation device for industry professional texts according to claim 5, characterized in that the initial expanded word bag generation module includes:

The initial keyword bag acquisition unit is configured to acquire vocabulary entities and organize to obtain initial keyword bags. The initial keyword bags are divided into multiple types of small bags according to part-of-speech classification requirements;

The expanded text information acquisition unit is configured to use the initial keyword bag to perform keyword retrieval in the professional text library to obtain expanded text information, and the expanded samples in the expanded text information correspond to each type of the small bag of words. The sample sizes are similar;

The external think tank initial expanded word bag generation unit is configured to use entity recognition technology to perform entity information recognition on the expanded text information to obtain the external think tank initial expanded word bag.
The automatic annotation device for industry professional texts according to claim 5, characterized in that the expanded bag of words and expanded noise bag of words generation modules include:

The expanded word bag generation unit is configured to use a convolutional neural network to perform vector calculation in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag. The calculation results identify real entity labeled samples and noise entity labeled samples;

The expanded noise word bag generation unit is configured to summarize and organize all the real entity labeled samples to obtain an external think tank expanded word bag, and summarize and organize all the noise entity labeled samples to obtain an external think tank expanded noise word bag.
The automatic annotation device for industry professional texts according to claim 5, characterized in that the generalization sample generation module includes:

The interpolated entity generalization sample generation unit is configured to randomly select noise entity samples from the external think tank expanded noise word bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain interpolated entity generalization samples. ;

The extrapolated sentence generalization sample generating unit is configured to select the noise sentence containing the lexical entity in the initial keyword bag and insert it into the real entity labeled sample in the external think tank expanded word bag to obtain the extrapolated sentence generalization sample. sample.
A terminal including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the computer program, it implements claims 1 to 4 The steps in any of the automatic annotation methods for industry professional texts.
A computer-readable storage medium, the computer-readable storage medium stores a computer program, characterized in that, when the computer program is executed by a processor, the industry as claimed in any one of claims 1 to 4 is realized. Steps in the method of automatic annotation of professional texts.