WO2023178903A1 - Industry professional text automatic labeling method and apparatus, terminal, and storage medium - Google Patents

Industry professional text automatic labeling method and apparatus, terminal, and storage medium Download PDF

Info

Publication number
WO2023178903A1
WO2023178903A1 PCT/CN2022/109617 CN2022109617W WO2023178903A1 WO 2023178903 A1 WO2023178903 A1 WO 2023178903A1 CN 2022109617 W CN2022109617 W CN 2022109617W WO 2023178903 A1 WO2023178903 A1 WO 2023178903A1
Authority
WO
WIPO (PCT)
Prior art keywords
expanded
bag
entity
external
noise
Prior art date
Application number
PCT/CN2022/109617
Other languages
French (fr)
Chinese (zh)
Inventor
沈浩
吴优
Original Assignee
上海帜讯信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海帜讯信息技术股份有限公司 filed Critical 上海帜讯信息技术股份有限公司
Publication of WO2023178903A1 publication Critical patent/WO2023178903A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the invention belongs to a text annotation solution, specifically an industry professional text automatic annotation method, device, terminal and storage medium based on data enhancement, and relates to the technical field of text entity recognition.
  • text annotation is an important technology, which can semantically annotate text and construct a mapping from words to semantic concepts.
  • the subsequent text processing process even if the operator only conducts a shallow analysis of the text, he can still judge the distribution of the text in the semantic concept space based on the mapping, thereby providing a practical basis for text management, search and recommendation.
  • the manual annotation method has several obvious shortcomings in the actual operation process:
  • Manual annotation not only requires the annotator to have a certain degree of professional knowledge in the professional field, but also requires the annotator to perform a large amount of manual professional information extraction, judgment and annotation, thus greatly increasing the cost of text annotation in the professional field.
  • Manual annotation requires the annotator to perform full-text information retrieval in the text, find and locate specific entity locations, and then annotate entity types. The entire operation process is inefficient and prone to problems such as entity omissions and type errors.
  • the purpose of the present invention is to propose a method, device, terminal and storage medium for automatic annotation of industry professional texts based on data enhancement, specifically as follows.
  • An automatic annotation method for industry professional texts including:
  • the keyword search is performed in a professional text database based on the initial keyword bag to obtain expanded text information
  • entity recognition is performed on the expanded text information to obtain an initial expanded word bag from an external think tank, including:
  • the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are compared one by one, and the external think tank expanded word bag and the external think tank expanded noise are respectively obtained based on the comparison results.
  • Bag of words including:
  • the convolutional neural network is used to perform vector calculations in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag, and the real entity annotation samples are identified based on the calculation results.
  • noisy entity labeling samples
  • the external think tank is used to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples, and the external think tank is used to expand the bag of keywords based on the initial keyword bag. Expand the bag of words to extrapolate noise sentence features and obtain generalization samples of extrapolated sentences, including:
  • An industry professional text automatic annotation device including:
  • the initial expanded word bag generation module is configured to perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded word bag from an external think tank;
  • the expanded word bag and expanded noise word bag generation module is configured to compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain external Think tanks expand the bag of words and external think tanks expand the bag of noise words;
  • a generalization sample generation module configured to use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples based on the initial keyword bag of words.
  • the external think tank expands the bag of words to extrapolate noise sentence features to obtain generalization samples of extrapolated sentences;
  • the professional entity labeling module is configured to perform reverse text labeling and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity labeling samples.
  • the initial expanded word bag generation module includes:
  • the initial keyword bag acquisition unit is configured to acquire vocabulary entities and organize to obtain initial keyword bags.
  • the initial keyword bags are divided into multiple types of small bags according to part-of-speech classification requirements;
  • the expanded text information acquisition unit is configured to use the initial keyword bag to perform keyword retrieval in the professional text library to obtain expanded text information, and the expanded samples in the expanded text information correspond to each type of the small bag of words.
  • the sample sizes are similar;
  • the external think tank initial expanded word bag generation unit is configured to use entity recognition technology to perform entity information recognition on the expanded text information to obtain the external think tank initial expanded word bag.
  • the expanded bag of words and expanded noise bag of words generation modules include:
  • the expanded word bag generation unit is configured to use a convolutional neural network to perform vector calculation in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag.
  • the calculation results identify real entity labeled samples and noise entity labeled samples;
  • the expanded noise word bag generation unit is configured to summarize and organize all the real entity labeled samples to obtain an external think tank expanded word bag, and summarize and organize all the noise entity labeled samples to obtain an external think tank expanded noise word bag.
  • the generalization sample generation module includes:
  • the interpolated entity generalization sample generation unit is configured to randomly select noise entity samples from the external think tank expanded noise word bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain interpolated entity generalization samples. ;
  • the extrapolated sentence generalization sample generating unit is configured to select the noise sentence containing the lexical entity in the initial keyword bag and insert it into the real entity labeled sample in the external think tank expanded word bag to obtain the extrapolated sentence generalization sample. sample.
  • a terminal includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, it implements the automatic annotation method for industry professional texts as described above. A step of.
  • a computer-readable storage medium stores a computer program.
  • the steps in the automatic annotation method for industry professional texts are implemented as described above.
  • the invention proposes an automatic labeling method for industry professional texts based on data enhancement, which utilizes a semi-supervised entity recognition algorithm and combines an external professional knowledge text library to minimize the labor cost of text entity labeling and improve text entity recognition. Quality and efficiency of modeling in the process.
  • the method of the present invention also uses a universal entity recognition algorithm, combined with specific bag-of-words high-dimensional vectorization similarity calculation technology, so that the early text entity recognition process can be carried out in an automated manner, and can be carried out in multiple different Achieve high-quality entity information extraction in the professional field.
  • the method of the present invention is based on data enhancement technology, and through noise entity feature interpolation and noise sentence feature extrapolation technology, it greatly optimizes the problem of insufficient generalization ability of traditional unsupervised automatic labeling algorithms, so that the labeling model can be used in multiple applications.
  • Semi-supervised automatic annotation is implemented on professional texts from different industries.
  • the data-enhanced industry professional text automatic annotation device, terminal and storage medium proposed by the present invention can efficiently and accurately complete the industry professional text annotation with a systematic and standardized processing flow.
  • Annotation the hardware has high adaptability and compatibility, and can be effectively used in technical implementation in the field of text annotation.
  • the present invention also provides a reference for other solutions related to text annotation technology, which can be used as a basis for expansion, extension and in-depth research, and has very broad application prospects.
  • Figure 1 is a schematic flow chart of an automatic annotation method for industry professional texts provided by an embodiment of the present invention
  • Figure 2 is a schematic diagram of the results of initial keyword bag sorting according to the method in the embodiment of the present invention.
  • Figure 3 is a schematic structural diagram of the named entity recognition algorithm model selected in the embodiment of the present invention.
  • Figure 4 is a schematic diagram of the network structure of the noise recognition model established in the embodiment of the present invention.
  • Figure 5 is a schematic structural diagram of an automatic annotation device for industry professional texts provided by an embodiment of the present invention.
  • the present invention discloses a method, device, terminal and storage medium for automatic annotation of industry professional texts based on data enhancement.
  • the specific scheme is as follows.
  • the present invention relates to a method for automatic annotation of industry professional texts.
  • the specific process is shown in Figure 1, including the following steps:
  • the initial keyword bag In order to ensure the diversity and robustness of subsequent bag-of-word vector representations, the lexical entities in each type of small bag-of-words cannot be repeated, and the number of entities should be no less than 50.
  • the number of texts from the professional think tank selected in this step for keyword search expansion should be no less than 10,000.
  • the expanded text information retrieved here should ensure that the number of samples matching each type of small word bag x i in the expanded sample is as even as possible, and the number of samples of each type should be as uniform as possible. More than 1000 items.
  • the first layer of BERT uses the Transformer mechanism to encode the input data and uses the pre-trained model to obtain the semantic representation of the word.
  • Transformer is different from the transformation model of traditional sequence-aligned recurrent neural networks or convolutional neural networks. It is a representation that completely relies on the self-attention mechanism (Attention) to calculate input and output.
  • the calculation formula of the self-attention layer is as follows, where Q is the matrix of Query vector combination, K is the matrix of Key vector combination, V is the matrix of Value vector combination, and d is the dimension of Query vector.
  • the second BiLSTM layer further extracts high-level features of the data based on the BERT output results.
  • the reason why the BiLSTM algorithm is used here is to better capture the contextual information of feature entities in professional texts.
  • the third layer of CRF is a globally statistically normalized conditional state transition probability matrix. It imposes state transition constraints on the output results of the BiLSTM layer, allowing the underlying deep neural network to learn the new loss function under the characteristics of CRF. A more reasonable set of nonlinear transformation spaces.
  • this step specifically includes the following operations.
  • this embodiment designs a single-layer CNN neural network model as the noise recognition model.
  • the specific network structure design is shown in Figure 4.
  • the word vector conversion tool used in this embodiment is Tencent word vector library, which can correspond to a 200-dimensional vector for each word. Compared with other existing Chinese word vector data, Tencent word vector library has improved Coverage, freshness, and accuracy of the overall word vector.
  • the single word vector can be assembled into an input word bag vector.
  • the length of the word bag vector is fixed to n, that is, a word bag consists of n words (if it exceeds n, it will be truncated, and if it is less than n, use blank characters. padding) input vector.
  • the characteristics of each word are concatenated by its own word vector (d_word-dimensional vector) and its position vector (d_pos-dimensional vector) from the two entities.
  • the convolution kernel performs a convolution operation on the word vectors and position vectors of three consecutive words at the word granularity.
  • the convolution kernel size is 3*(d_word+2*d_pos).
  • Y represents the bag of words divided by the model
  • b represents an entity in the bag of words
  • the subscripts c and f represent real entities and noise entities respectively.
  • the loss function of the model is designed as shown in the following formula.
  • this solution uses a specific data enhancement method for professional texts. Furthermore, this step specifically includes the following operations.
  • the specific performance of the operation is that in the interpolation entity selection, the inserted noise entities are mainly passed through the noise entity word bag Select the noise entity that is closest to the space vector representation between the two real entities.
  • the noise entity word bag Select the noise entity that is closest to the space vector representation between the two real entities.
  • the n vector representation of the interpolated noise entity ⁇ should satisfy the following expression, that is, the selected noise entity should be at a high
  • the dimensional space representation should be as close as possible to the real entities k and j, so as to achieve the best generalization effect.
  • the inserted noise entity position should be as close as possible to the middle position of the real entities k and j, so that the generalized influence of the noise entity on the two real entities is as similar as possible.
  • the mathematical expression is as follows.
  • this solution selects other types of text sentences containing artificial word bag .
  • the original sentence contains the two entities "laser beam scanning display projection element” and "holographic projection lens".
  • the requirement is that the inserted sentence should contain one or more words from the artificial word bag X.
  • extrapolated sentences can be obtained from that type of text. For example, if you want the model to generalize better in news-type texts, extrapolated sentences can be extracted from news texts.
  • the automatic annotation method for industry professional texts proposed by the present invention has the following advantages compared with the manual annotation and traditional supervised text annotation methods in the prior art:
  • the cost of technology implementation is lower.
  • the traditional annotation method for industry professional texts requires manual annotation by annotators, which not only consumes a lot of time and costs, but the annotation results may also have certain errors and omissions, and cannot meet the requirements of large-scale professional entity recognition in actual production.
  • the method of the present invention uses a semi-supervised entity recognition algorithm and combines it with an external professional knowledge text library to minimize the labor cost of text entity annotation and improve the quality and efficiency of model construction in the actual text entity recognition process.
  • the quality of text annotation is higher.
  • Traditional text entity annotation for professional fields often requires annotators to have professional knowledge in specific fields, which places high demands on annotation work.
  • the method of the present invention uses a universal entity recognition algorithm (NER), combined with a specific bag-of-words high-dimensional vectorized similarity calculation technology, so that the early text entity recognition process does not require the intervention of annotators, and can theoretically be Achieve high-quality entity information extraction in multiple different professional fields.
  • NER universal entity recognition algorithm
  • the present invention also relates to an automatic annotation device for industry professional texts. Its architecture is shown in Figure 5, including:
  • the initial expanded word bag generation module is configured to perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded word bag from an external think tank;
  • the expanded word bag and expanded noise word bag generation module is configured to compare whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag are similar, and obtain the external think tank respectively based on the comparison results. Expand the bag of words and external think tanks to expand the bag of noise words;
  • a generalization sample generation module configured to use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples based on the initial keyword bag of words.
  • the external think tank expands the bag of words to extrapolate noise sentence features to obtain generalization samples of extrapolated sentences;
  • the professional entity labeling module is configured to perform reverse text labeling and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity labeling samples.
  • the initial expanded word bag generation module includes:
  • the initial keyword bag acquisition unit is configured to acquire vocabulary entities and organize to obtain initial keyword bags.
  • the initial keyword bags are divided into multiple types of small bags according to part-of-speech classification requirements;
  • the expanded text information acquisition unit is configured to use the initial keyword bag to perform keyword retrieval in the professional text library to obtain expanded text information, and the expanded samples in the expanded text information correspond to each type of the small bag of words.
  • the sample sizes are similar;
  • the external think tank initial expanded word bag generation unit is configured to use entity recognition technology to perform entity information recognition on the expanded text information to obtain the external think tank initial expanded word bag.
  • the expanded bag of words and expanded noise bag of words generation modules include:
  • the expanded word bag generation unit is configured to use a convolutional neural network to perform vector calculation in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag.
  • the calculation results identify real entity labeled samples and noise entity labeled samples;
  • the expanded noise word bag generation unit is configured to summarize and organize all the real entity labeled samples to obtain an external think tank expanded word bag, and summarize and organize all the noise entity labeled samples to obtain an external think tank expanded noise word bag.
  • the generalization sample generation module includes:
  • the interpolated entity generalization sample generation unit is configured to randomly select noise entity samples from the external think tank expanded noise word bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain interpolated entity generalization samples. ;
  • the extrapolated sentence generalization sample generating unit is configured to select the noise sentence containing the lexical entity in the initial keyword bag and insert it into the real entity labeled sample in the external think tank expanded word bag to obtain the extrapolated sentence generalization sample. sample.
  • the present invention also relates to a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • a terminal including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, the above-mentioned steps are implemented.
  • the steps in the automatic annotation method for industry professional texts are, for example, the steps shown in Figure 1.
  • the processor executes the computer program, the functions of each module/unit in each of the above device embodiments are implemented, such as the functions of each module/unit shown in Figure 5 .
  • the present invention also relates to a computer-readable storage medium that stores a computer program.
  • the computer program When executed by a processor, the computer program can automatically read industry professional texts as described above. Label the steps in the method.
  • the readable storage medium may be a computer storage medium or a communication medium.
  • Communication media includes any medium that facilitates transfer of a computer program from one place to another.
  • Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer.
  • a readable storage medium is coupled to a processor such that the processor can read information from the readable storage medium and write information to the readable storage medium.
  • the readable storage medium may also be an integral part of the processor.
  • the processor and readable storage medium may be located in Application Specific Integrated Circuits (ASICs). Additionally, the ASIC can be located in the user equipment.
  • ASICs Application Specific Integrated Circuits
  • the processor and the readable storage medium may also exist as discrete components in the communication device.
  • Readable storage media can be read-only memory (ROM), random-access memory (RAM), CD-ROM, tapes, floppy disks, optical data storage devices, etc.
  • the invention proposes an automatic annotation device, terminal and storage medium for industry professional texts, which can efficiently and accurately complete the annotation of industry professional texts through a systematic and standardized processing flow.
  • the hardware has high adaptability and compatibility and can be effectively used in technical implementation in the field of text annotation.
  • the present invention also provides a reference for other solutions related to text annotation technology, which can be used as a basis for expansion, extension and in-depth research, and has very broad application prospects.

Abstract

Provided are an industry professional text automatic labeling method and apparatus, a terminal, and a storage medium. A semi-supervised entity recognition algorithm is used, and an external professional knowledge text library is combined, such that the labor cost of text entity labeling is reduced to the maximum extent, and the quality and efficiency of modeling in a text entity recognition process are improved. In addition, a universal entity recognition algorithm is also used, and a specific bag of words high-dimensional vectorization similarity calculation technique is combined, such that the early stage of the text entity recognition process can be carried out automatically, and high-quality entity information extraction can be achieved in a plurality of different professional fields. In addition, on the basis of a data augmentation technique, by means of noise entity feature interpolation and noise statement feature extrapolation techniques, the problem of an insufficient generalization capability of a traditional unsupervised automatic labeling algorithm is mitigated, such that a labeling model can implement semi-supervised automatic labeling on various different industry professional texts.

Description

行业专业文本自动标注方法、装置、终端及存储介质Industry professional text automatic annotation methods, devices, terminals and storage media 技术领域Technical field
本发明属于一种文本标注方案,具体为一种基于数据增强的行业专业文本自动标注方法、装置、终端及存储介质,涉及文本实体识别技术领域。The invention belongs to a text annotation solution, specifically an industry professional text automatic annotation method, device, terminal and storage medium based on data enhancement, and relates to the technical field of text entity recognition.
背景技术Background technique
在文本实体识别领域中,文本标注是一项重要技术,其能够对文本进行语义标注,构建起词到语义概念的映射。在后续的文本处理过程中,操作者即便只对文本进行浅层分析,也能够根据映射判断文本在语义概念空间上的分布,进而为文本的管理、搜索和推荐提供切实的基础。In the field of text entity recognition, text annotation is an important technology, which can semantically annotate text and construct a mapping from words to semantic concepts. In the subsequent text processing process, even if the operator only conducts a shallow analysis of the text, he can still judge the distribution of the text in the semantic concept space based on the mapping, thereby providing a practical basis for text management, search and recommendation.
目前,深度学习和机器学习等人工智能算法已经逐步成为了文本实体识别领域内的主流技术,在领域内得到了广泛应用。但这类技术在应对专业领域和行业专业文本实体的文本标注时,算法的最终效果往往无法达到预期。这是因为对于这些专业度较高的文本而言,标注语料的数量、质量及泛化能力等因素直接决定着文本标注的效果,而现阶段的各类自动化技术手段在专业性方面的表现并不如人工标注。也正因如此,现阶段业内对于各类行业专业文本的标注仍然以人工标注方式为主。At present, artificial intelligence algorithms such as deep learning and machine learning have gradually become mainstream technologies in the field of text entity recognition and have been widely used in the field. However, when this type of technology deals with text annotation of professional text entities in professional fields and industries, the final effect of the algorithm often fails to meet expectations. This is because for these highly professional texts, factors such as the quantity, quality and generalization ability of the annotated corpus directly determine the effect of text annotation, and various automated technical means at the current stage do not perform well in terms of professionalism. Not as good as manual annotation. Because of this, at this stage, the industry's annotation of professional texts in various industries is still dominated by manual annotation.
可以预见地,人工标注方法在实际操作过程中存在着几个明显的不足:Predictably, the manual annotation method has several obvious shortcomings in the actual operation process:
首先,人工标注方法对于标注人员的专业性要求过高。人工标注不仅需要标注人员在专业领域内具备一定程度的专业知识,还要求标注人员能够进行大量的手工专业信息提取、判断和标注,从而极大地提高了专业领域内的文本标注成本。First of all, the manual annotation method requires too much professionalism from the annotators. Manual annotation not only requires the annotator to have a certain degree of professional knowledge in the professional field, but also requires the annotator to perform a large amount of manual professional information extraction, judgment and annotation, thus greatly increasing the cost of text annotation in the professional field.
其次,人工标注方法的效率和质量难以保证。人工标注需要标注人员在文本中进行全文信息检索,寻找并定位具体的实体位置,再进行实体类型标注。整个操作过程效率偏低,且极容易出现实体遗漏和类型错误等问题。Secondly, the efficiency and quality of manual annotation methods are difficult to guarantee. Manual annotation requires the annotator to perform full-text information retrieval in the text, find and locate specific entity locations, and then annotate entity types. The entire operation process is inefficient and prone to problems such as entity omissions and type errors.
最后,人工标注方法的泛化能力较差。在专业领域和行业专业文本的分类过程中,领域和行业的边界往往存在动态性、模糊性和不确定性,人工标注工 作一旦开展,过程中就很难再对标注实体的数量、类型和定义等进行调整,最终导致标注样本整体的泛化能力差。Finally, manual annotation methods have poor generalization capabilities. In the process of classifying professional texts in professional fields and industries, there are often dynamics, ambiguities and uncertainties in the boundaries of fields and industries. Once manual annotation is carried out, it will be difficult to classify the number, type and definition of annotated entities in the process. etc., which ultimately leads to poor generalization ability of the entire labeled sample.
综上所述,如果能够在目前已有的各类自动化文本标注方案的基础上,利用互联网上开放的知识和知识库信息,结合数据增强、半监督实体识别等技术手段,实现专业领域内行业专业文本的自动标注,那么必将极大地提升行业专业文本的标注效率和质量。To sum up, if we can build on the existing various automated text annotation solutions, make use of open knowledge and knowledge base information on the Internet, and combine it with technical means such as data enhancement and semi-supervised entity recognition, we can achieve industry-wide recognition in professional fields. Automatic annotation of professional texts will greatly improve the annotation efficiency and quality of professional texts in the industry.
发明内容Contents of the invention
鉴于现有技术存在上述缺陷,本发明的目的是提出一种基于数据增强的行业专业文本自动标注方法、装置、终端及存储介质,具体如下。In view of the above-mentioned deficiencies in the existing technology, the purpose of the present invention is to propose a method, device, terminal and storage medium for automatic annotation of industry professional texts based on data enhancement, specifically as follows.
一种行业专业文本自动标注方法,包括:An automatic annotation method for industry professional texts, including:
依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋;Perform a keyword search in a professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded bag of words from an external think tank;
逐一比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋;Compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain the external think tank expanded word bag and the external think tank expanded noise word bag based on the comparison results;
利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋对所述外部智库拓展词袋进行噪音语句特征外插,得到外插语句泛化样本;Use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples, and perform noise on the external think tank expanded bag of words based on the initial keyword bag. Extrapolate sentence features to obtain generalized samples of extrapolated sentences;
针对所述内插实体泛化样本和所述外插语句泛化样本进行逆向文本标注定位,得到专业实体标注样本。Perform reverse text annotation and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity annotation samples.
优选地,所述依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋,包括:Preferably, the keyword search is performed in a professional text database based on the initial keyword bag to obtain expanded text information, and entity recognition is performed on the expanded text information to obtain an initial expanded word bag from an external think tank, including:
获取词汇实体,整理得到初始关键词词袋,所述初始关键词词袋依据词性分类要求被划分为多种类型的小词袋;Obtain vocabulary entities and sort out initial keyword bags, which are divided into multiple types of small bags according to part-of-speech classification requirements;
利用所述初始关键词词袋,在专业文本库中进行关键词检索,得到拓展文 本信息,所述拓展文本信息中拓展样本对应每种类型所述小词袋的样本数量相近似;Using the initial keyword bag, perform keyword retrieval in the professional text library to obtain expanded text information. The number of expanded samples in the expanded text information corresponding to the small bag of words of each type is similar;
利用实体识别技术对所述拓展文本信息进行实体信息识别,得到外部智库初始拓展词袋。Use entity recognition technology to perform entity information recognition on the expanded text information to obtain the initial expanded word bag of the external think tank.
优选地,所述逐一比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋,包括:Preferably, the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are compared one by one, and the external think tank expanded word bag and the external think tank expanded noise are respectively obtained based on the comparison results. Bag of words, including:
利用卷积神经网络对所述外部智库初始拓展词袋中的拓展词与所述初始关键词词袋中的实体词进行高维向量空间中的向量计算,依据计算结果识别出真实实体标注样本及噪音实体标注样本;The convolutional neural network is used to perform vector calculations in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag, and the real entity annotation samples are identified based on the calculation results. Noisy entity labeling samples;
将全部所述真实实体标注样本汇总整理得到外部智库拓展词袋,将全部所述噪音实体标注样本汇总整理得到外部智库拓展噪音词袋。All the real entity labeled samples are summarized and sorted to obtain the external think tank expanded word bag, and all the noise entity labeled samples are summarized and sorted to obtain the external think tank expanded noise word bag.
优选地,所述利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋对所述外部智库拓展词袋进行噪音语句特征外插,得到外插语句泛化样本,包括:Preferably, the external think tank is used to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples, and the external think tank is used to expand the bag of keywords based on the initial keyword bag. Expand the bag of words to extrapolate noise sentence features and obtain generalization samples of extrapolated sentences, including:
从所述外部智库拓展噪音词袋随机挑选噪音实体样本插入所述外部智库拓展词袋内所述真实实体标注样本中,得到内插实体泛化样本;Randomly select noise entity samples from the external think tank-expanded noise word bag and insert them into the real entity annotation samples in the external think tank expanded word bag to obtain interpolated entity generalization samples;
选择包含有所述初始关键词词袋内词汇实体的噪音语句插入所述外部智库拓展词袋内所述真实实体标注样本中,得到外插语句泛化样本。Select noise sentences containing lexical entities in the initial keyword bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain extrapolated sentence generalization samples.
一种行业专业文本自动标注装置,包括:An industry professional text automatic annotation device, including:
初始拓展词袋生成模块,被配置为依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋;The initial expanded word bag generation module is configured to perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded word bag from an external think tank;
拓展词袋及拓展噪音词袋生成模块,被配置为逐一比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋;The expanded word bag and expanded noise word bag generation module is configured to compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain external Think tanks expand the bag of words and external think tanks expand the bag of noise words;
泛化样本生成模块,被配置为利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋对所述外部智库拓展词袋进行噪音语句特征外插,得到外插语句泛化样本;A generalization sample generation module configured to use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples based on the initial keyword bag of words. The external think tank expands the bag of words to extrapolate noise sentence features to obtain generalization samples of extrapolated sentences;
专业实体标注模块,被配置为针对所述内插实体泛化样本和所述外插语句泛化样本进行逆向文本标注定位,得到专业实体标注样本。The professional entity labeling module is configured to perform reverse text labeling and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity labeling samples.
优选地,所述初始拓展词袋生成模块,包括:Preferably, the initial expanded word bag generation module includes:
初始关键词词袋获取单元,被配置为获取词汇实体,整理得到初始关键词词袋,所述初始关键词词袋依据词性分类要求被划分为多种类型的小词袋;The initial keyword bag acquisition unit is configured to acquire vocabulary entities and organize to obtain initial keyword bags. The initial keyword bags are divided into multiple types of small bags according to part-of-speech classification requirements;
拓展文本信息获取单元,被配置为利用所述初始关键词词袋,在专业文本库中进行关键词检索,得到拓展文本信息,所述拓展文本信息中拓展样本对应每种类型所述小词袋的样本数量相近似;The expanded text information acquisition unit is configured to use the initial keyword bag to perform keyword retrieval in the professional text library to obtain expanded text information, and the expanded samples in the expanded text information correspond to each type of the small bag of words. The sample sizes are similar;
外部智库初始拓展词袋生成单元,被配置为利用实体识别技术对所述拓展文本信息进行实体信息识别,得到外部智库初始拓展词袋。The external think tank initial expanded word bag generation unit is configured to use entity recognition technology to perform entity information recognition on the expanded text information to obtain the external think tank initial expanded word bag.
优选地,所述拓展词袋及拓展噪音词袋生成模块,包括:Preferably, the expanded bag of words and expanded noise bag of words generation modules include:
拓展词袋生成单元,被配置为利用卷积神经网络对所述外部智库初始拓展词袋中的拓展词与所述初始关键词词袋中的实体词进行高维向量空间中的向量计算,依据计算结果识别出真实实体标注样本及噪音实体标注样本;The expanded word bag generation unit is configured to use a convolutional neural network to perform vector calculation in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag. The calculation results identify real entity labeled samples and noise entity labeled samples;
拓展噪音词袋生成单元,被配置为将全部所述真实实体标注样本汇总整理得到外部智库拓展词袋,将全部所述噪音实体标注样本汇总整理得到外部智库拓展噪音词袋。The expanded noise word bag generation unit is configured to summarize and organize all the real entity labeled samples to obtain an external think tank expanded word bag, and summarize and organize all the noise entity labeled samples to obtain an external think tank expanded noise word bag.
优选地,所述泛化样本生成模块,包括:Preferably, the generalization sample generation module includes:
内插实体泛化样本生成单元,被配置为从所述外部智库拓展噪音词袋随机挑选噪音实体样本插入所述外部智库拓展词袋内所述真实实体标注样本中,得到内插实体泛化样本;The interpolated entity generalization sample generation unit is configured to randomly select noise entity samples from the external think tank expanded noise word bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain interpolated entity generalization samples. ;
外插语句泛化样本生成单元,被配置为选择包含有所述初始关键词词袋内词汇实体的噪音语句插入所述外部智库拓展词袋内所述真实实体标注样本中,得到外插语句泛化样本。The extrapolated sentence generalization sample generating unit is configured to select the noise sentence containing the lexical entity in the initial keyword bag and insert it into the real entity labeled sample in the external think tank expanded word bag to obtain the extrapolated sentence generalization sample. sample.
一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上所述行业专业文本自动标注方法中的步骤。A terminal includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the automatic annotation method for industry professional texts as described above. A step of.
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行所述计算机程序时实现如上所述行业专业文本自动标注方法中的步骤。A computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the steps in the automatic annotation method for industry professional texts are implemented as described above.
本发明的优点主要体现在以下几个方面:The advantages of the present invention are mainly reflected in the following aspects:
本发明所提出的一种基于数据增强的行业专业文本自动标注方法,利用半监督实体识别算法,结合外部专业的知识文本库,最大限度上降低了文本实体标注的人工成本,提升了文本实体识别过程中建模的质量和效率。同时,本发明的方法还使用了具有通用性的实体识别算法,结合特定的词袋高维向量化相似度计算技术,使得前期的文本实体识别过程能够以自动化的方式开展,可以在多个不同的专业领域内实现高质量的实体信息提取。此外,本发明的方法基于数据增强技术,通过噪音实体特征内插和噪音语句特征外插技术,大幅优化了传统无监督自动标注算法所存在的泛化能力不足的问题,使得标注模型可以在多种不同的行业专业文本上实现半监督自动标注。The invention proposes an automatic labeling method for industry professional texts based on data enhancement, which utilizes a semi-supervised entity recognition algorithm and combines an external professional knowledge text library to minimize the labor cost of text entity labeling and improve text entity recognition. Quality and efficiency of modeling in the process. At the same time, the method of the present invention also uses a universal entity recognition algorithm, combined with specific bag-of-words high-dimensional vectorization similarity calculation technology, so that the early text entity recognition process can be carried out in an automated manner, and can be carried out in multiple different Achieve high-quality entity information extraction in the professional field. In addition, the method of the present invention is based on data enhancement technology, and through noise entity feature interpolation and noise sentence feature extrapolation technology, it greatly optimizes the problem of insufficient generalization ability of traditional unsupervised automatic labeling algorithms, so that the labeling model can be used in multiple applications. Semi-supervised automatic annotation is implemented on professional texts from different industries.
与上述方法相对应的,本发明所提出的一种基于数据增强的行业专业文本自动标注装置、终端及存储介质,能够以系统化、标准化的处理流程,高效、准确地完成对行业专业文本的标注,硬件的适配性和兼容性较高,能够切实地应用于文本标注领域内的技术实现中。Corresponding to the above method, the data-enhanced industry professional text automatic annotation device, terminal and storage medium proposed by the present invention can efficiently and accurately complete the industry professional text annotation with a systematic and standardized processing flow. Annotation, the hardware has high adaptability and compatibility, and can be effectively used in technical implementation in the field of text annotation.
本发明还为其他与文本标注技术相关的方案提供了参考,可以以此为依据进行拓展延伸和深入研究,具有十分广阔的应用前景。The present invention also provides a reference for other solutions related to text annotation technology, which can be used as a basis for expansion, extension and in-depth research, and has very broad application prospects.
以下便结合实施例附图,对本发明的具体实施方式作进一步的详述,以使本发明技术方案更易于理解、掌握。The specific implementation modes of the present invention will be further described in detail below with reference to the examples and drawings, so as to make the technical solution of the present invention easier to understand and master.
附图说明Description of the drawings
构成本申请的一部分的附图用来提供对本申请的进一步理解,使得本申请的其它特征、目的和优点变得更明显。本申请的示意性实施例附图及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The accompanying drawings, which constitute a part of this application, are included to provide a further understanding of the application so that other features, objects and advantages of the application will become apparent. The drawings and descriptions of the schematic embodiments of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:
图1为本发明实施例提供的一种行业专业文本自动标注方法的流程示意图;Figure 1 is a schematic flow chart of an automatic annotation method for industry professional texts provided by an embodiment of the present invention;
图2为本发明实施例中按方法进行初始关键词词袋整理的结果示意图;Figure 2 is a schematic diagram of the results of initial keyword bag sorting according to the method in the embodiment of the present invention;
图3为本发明实施例中所选取的命名实体识别算法模型的结构示意图;Figure 3 is a schematic structural diagram of the named entity recognition algorithm model selected in the embodiment of the present invention;
图4为本发明实施例中所建立的噪音识别模型的网络结构示意图;Figure 4 is a schematic diagram of the network structure of the noise recognition model established in the embodiment of the present invention;
图5为本发明实施例提供的一种行业专业文本自动标注装置的结构示意图。Figure 5 is a schematic structural diagram of an automatic annotation device for industry professional texts provided by an embodiment of the present invention.
具体实施方式Detailed ways
本发明揭示了一种基于数据增强的行业专业文本自动标注方法、装置、终端及存储介质,具体方案如下。The present invention discloses a method, device, terminal and storage medium for automatic annotation of industry professional texts based on data enhancement. The specific scheme is as follows.
一方面,本发明涉及一种行业专业文本自动标注方法,具体流程如图1所示,包括如下步骤:On the one hand, the present invention relates to a method for automatic annotation of industry professional texts. The specific process is shown in Figure 1, including the following steps:
S1、依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋。进一步而言,这一步骤具体包括如下操作。S1. Perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain the initial expanded word bag of the external think tank. Furthermore, this step specifically includes the following operations.
S11、获取词汇实体,依据预先设置的、基于专业人员人工经验而形成的标准,整理得到一个小规模的初始关键词词袋X。S11. Obtain vocabulary entities, and organize and obtain a small-scale initial keyword bag X based on preset standards based on the manual experience of professionals.
所述初始关键词词袋X依据具体的词性分类要求可以被划分为多种类型的小词袋x i,例如产品实体词袋、技术实体词袋、领域实体词袋等。为了保证后续词袋向量表示的多样性和鲁棒性,每个类型的小词袋中的词汇实体不可重复,且实体数应不低于50个。 The initial keyword bag In order to ensure the diversity and robustness of subsequent bag-of-word vector representations, the lexical entities in each type of small bag-of-words cannot be repeated, and the number of entities should be no less than 50.
例如,需要对“智能穿戴”领域的文本进行实体识别算法开发,传统做法是召集了解的“智能穿戴”领域的标注人员人工进行文本标注,本方案只需要依据“智能穿戴”领域内专业人员的人工经验,整理一个由不同实体类型组成的小规模关键词袋即可,整理结果如图2所示。For example, it is necessary to develop entity recognition algorithms for texts in the field of "smart wear". The traditional method is to recruit annotators who are familiar with the field of "smart wear" to manually annotate the text. This solution only needs to be based on the experience of professionals in the field of "smart wear". Based on manual experience, it is enough to sort out a small-scale keyword bag composed of different entity types. The sorting results are shown in Figure 2.
S12、利用所述初始关键词词袋X,在专业文本库中进行关键词检索,得到拓展文本信息,所述拓展文本信息中拓展样本对应每种类型所述小词袋x i的样本数量相近似。 S12 . Use the initial keyword bag approximate.
考虑到专业文本中往往包含多个不同的专业实体信息,例如在包含“穿戴设备”的专利文件中,往往还包含“穿戴设备”相关的技术、领域、产品、原材料等其它扩展实体。因此,需要利用所述初始关键词词袋X,前往搜索引擎、专利检索库等专业智库进行关键词检索,由此可以得到大量的包含词汇实体的扩展文本信息。Considering that professional texts often contain multiple different professional entity information, for example, patent documents containing "wearable devices" often also contain other extended entities such as technologies, fields, products, raw materials, etc. related to "wearable devices". Therefore, it is necessary to use the initial keyword word bag
为了保证后续强化学习模型能够有足够多的训练样本,此步骤中所选取的、用于关键词检索扩展的专业智库的文本数量应不低于10000条。同时,为了尽量保证扩展样本能涵盖不同的实体类型,此处检索出的拓展文本信息应保证扩展样本中匹配每种类型小词袋x i的样本数量尽量均匀、且每种类型样本数量尽量应大于1000条。 In order to ensure that the subsequent reinforcement learning model can have enough training samples, the number of texts from the professional think tank selected in this step for keyword search expansion should be no less than 10,000. At the same time, in order to ensure that the expanded sample can cover different entity types as much as possible, the expanded text information retrieved here should ensure that the number of samples matching each type of small word bag x i in the expanded sample is as even as possible, and the number of samples of each type should be as uniform as possible. More than 1000 items.
S13、利用实体识别技术对所述拓展文本信息进行实体信息识别,得到外部智库初始拓展词袋Y iS13. Use entity recognition technology to perform entity information recognition on the expanded text information, and obtain the initial expanded word bag Y i of the external think tank.
考虑到专业文本中包含大量与目标实体相近似的实体信息,在此步骤中利用通用的实体识别技术(Named Entity Recognition,NER)对所述拓展文本信息中的实体信息进行识别。本实施例中选取在命名实体识别算法中效果较为稳定的是BERT+BiLSTM+CRF的3层模型结构,模型整体框架如图3所示。Considering that the professional text contains a large amount of entity information that is similar to the target entity, in this step, general entity recognition technology (Named Entity Recognition, NER) is used to identify the entity information in the extended text information. In this embodiment, the three-layer model structure of BERT+BiLSTM+CRF is selected, which has a relatively stable effect in the named entity recognition algorithm. The overall framework of the model is shown in Figure 3.
第一层BERT使用Transformer机制对输入数据进行编码,使用预训练模型获取字的语义表示。Transformer与传统的序列对齐递归神经网络或卷积神经网络的转换模型不同,是一个完全依赖自注意力机制(Attention)计算输入和输出的表示。自注意力层的计算公式如下,式中Q为Query向量组合的矩阵, K为Key向量组合的矩阵,V为Value向量组合的矩阵,d为Query向量的维度。The first layer of BERT uses the Transformer mechanism to encode the input data and uses the pre-trained model to obtain the semantic representation of the word. Transformer is different from the transformation model of traditional sequence-aligned recurrent neural networks or convolutional neural networks. It is a representation that completely relies on the self-attention mechanism (Attention) to calculate input and output. The calculation formula of the self-attention layer is as follows, where Q is the matrix of Query vector combination, K is the matrix of Key vector combination, V is the matrix of Value vector combination, and d is the dimension of Query vector.
Figure PCTCN2022109617-appb-000001
Figure PCTCN2022109617-appb-000001
第二层BiLSTM层在BERT输出结果的基础上进一步提取数据的高层特征。此处之所以使用BiLSTM算法是为了更好地捕捉特征实体在专业文本中的上下文信息。The second BiLSTM layer further extracts high-level features of the data based on the BERT output results. The reason why the BiLSTM algorithm is used here is to better capture the contextual information of feature entities in professional texts.
第三层CRF是全局范围内统计归一化的条件状态转移概率矩阵,对BiLSTM层的输出结果进行状态转移约束,让底层深度神经网络在CRF的特征限定下,依照新的损失函数,学习出一套更合理的非线性变换空间。The third layer of CRF is a globally statistically normalized conditional state transition probability matrix. It imposes state transition constraints on the output results of the BiLSTM layer, allowing the underlying deep neural network to learn the new loss function under the characteristics of CRF. A more reasonable set of nonlinear transformation spaces.
S2、逐一比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋。进一步而言,这一步骤具体包括如下操作。S2. Compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag are similar, and obtain the external think tank expanded word bag and the external think tank expanded noise word bag respectively based on the comparison results. Furthermore, this step specifically includes the following operations.
S21、利用卷积神经网络对所述外部智库初始拓展词袋Y i中的拓展词y i与所述初始关键词词袋中的实体词进行高维向量空间中的向量计算,依据计算结果进行识别,若相似则认定所述拓展词y i为真实实体标注样本
Figure PCTCN2022109617-appb-000002
若不相似则认定所述拓展词y i为噪音实体标注样本
Figure PCTCN2022109617-appb-000003
S21. Use the convolutional neural network to perform vector calculation in the high-dimensional vector space on the expanded words y i in the initial expanded word bag Y i of the external think tank and the entity words in the initial keyword word bag, and proceed based on the calculation results. Recognition, if similar, the expanded word y i is determined to be a real entity annotation sample.
Figure PCTCN2022109617-appb-000002
If they are not similar, the expanded word y i is determined to be a noise entity labeling sample.
Figure PCTCN2022109617-appb-000003
具体而言,为了保证噪音识别效果,同时兼顾实际运行中的计算的速度,本实施例设计了一套单层的CNN神经网络模型作为噪音识别模型,具体的网络结构设计如图4所示。Specifically, in order to ensure the noise recognition effect while taking into account the calculation speed in actual operation, this embodiment designs a single-layer CNN neural network model as the noise recognition model. The specific network structure design is shown in Figure 4.
其中,本实施例中所使用的词向量转化工具为腾讯词向量库,它可以将每个词对应一个200维的向量,与其他现有的中文词向量数据相比,腾讯词向量库提升了整体词向量的覆盖率、新鲜度、准确性。在完成词向量转化后,就可以将单个词向量组装成输入词袋向量,词袋向量长度固定为n,即一个词袋由n个词组成(如超出n则截断,小于n则用空白字符填充)的输入向量。每个词的特征由其本身的词向量(d_word维向量)及其分别距离两个实体的位置向量(d_pos维向量)串联而成。卷积核在词粒度对3个连续的词的词向量和 位置向量进行卷积操作,卷积核大小为3*(d_word+2*d_pos)。Among them, the word vector conversion tool used in this embodiment is Tencent word vector library, which can correspond to a 200-dimensional vector for each word. Compared with other existing Chinese word vector data, Tencent word vector library has improved Coverage, freshness, and accuracy of the overall word vector. After completing the word vector conversion, the single word vector can be assembled into an input word bag vector. The length of the word bag vector is fixed to n, that is, a word bag consists of n words (if it exceeds n, it will be truncated, and if it is less than n, use blank characters. padding) input vector. The characteristics of each word are concatenated by its own word vector (d_word-dimensional vector) and its position vector (d_pos-dimensional vector) from the two entities. The convolution kernel performs a convolution operation on the word vectors and position vectors of three consecutive words at the word granularity. The convolution kernel size is 3*(d_word+2*d_pos).
上述基于卷积神经网络的分类模型的参数包括卷积层和全连接层的权重和偏移项,故模型参数集合Φ={Y c,bc,Y f,b f},则输出的某句子属于某关系类型的概率如下式所示,其中,Y代表由模型划分出来的词袋,b代表词袋中的某一个实体,下标c和f分别代表真实实体和噪音实体。 The parameters of the above classification model based on convolutional neural network include the weights and offset terms of the convolutional layer and the fully connected layer. Therefore, the model parameter set Φ={Y c , bc , Y f , b f }, then a certain sentence is output The probability of belonging to a certain relationship type is shown in the following formula, where Y represents the bag of words divided by the model, b represents an entity in the bag of words, and the subscripts c and f represent real entities and noise entities respectively.
p(r|y;Φ)=softmax(Y f*tanh(max(Y c*y+b c))+b f), p(r|y;Φ)=softmax(Y f *tanh(max(Y c *y+b c ))+b f ),
在给定训练集{T},模型参数Φ,模型的损失函数设计为如下公式所示。Given the training set {T}, the model parameters Φ, the loss function of the model is designed as shown in the following formula.
Figure PCTCN2022109617-appb-000004
Figure PCTCN2022109617-appb-000004
S22、将全部所述真实实体标注样本
Figure PCTCN2022109617-appb-000005
汇总整理得到外部智库拓展词袋
Figure PCTCN2022109617-appb-000006
将全部所述噪音实体标注样本
Figure PCTCN2022109617-appb-000007
汇总整理得到外部智库拓展噪音词袋
Figure PCTCN2022109617-appb-000008
S22. Label all the real entities as samples
Figure PCTCN2022109617-appb-000005
Summarize and sort out word bags expanded by external think tanks
Figure PCTCN2022109617-appb-000006
Label all the noise entities in the sample
Figure PCTCN2022109617-appb-000007
Summarize and organize external think tanks to expand the noise word bag
Figure PCTCN2022109617-appb-000008
S3、利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋X对所述外部智库拓展词袋
Figure PCTCN2022109617-appb-000009
进行噪音语句特征外插,得到外插语句泛化样本。
S3. Use the external think tank to expand the noise word bag to perform noise entity feature interpolation on the external think tank expanded word bag to obtain interpolated entity generalization samples. Based on the initial keyword bag X, the external think tank expanded word bag
Figure PCTCN2022109617-appb-000009
Extrapolate noise sentence features to obtain generalization samples of extrapolated sentences.
由于无监督环境下算法极容易捕捉到专业文本(如专利文本)的特定特征,使得最终算法只能在特定文本(如专利文本)下具有较好的标注效果,模型整体泛化能力较差。因此,为解决标注模型泛化能力不足的问题,此处需要在外部智库扩展词袋
Figure PCTCN2022109617-appb-000010
的特征空间中采用特定的数据增强技术。
Since the algorithm in an unsupervised environment can easily capture the specific characteristics of professional texts (such as patent texts), the final algorithm can only have a good annotation effect on specific texts (such as patent texts), and the overall generalization ability of the model is poor. Therefore, in order to solve the problem of insufficient generalization ability of the annotation model, it is necessary to expand the bag of words in an external think tank.
Figure PCTCN2022109617-appb-000010
Specific data augmentation techniques are used in the feature space.
与简单的向量噪音添加方法(如添加高斯噪音)不同,本方案中采用了针对专业文本的特定数据增强方法,进一步而言,这一步骤具体包括如下操作。Different from simple vector noise addition methods (such as adding Gaussian noise), this solution uses a specific data enhancement method for professional texts. Furthermore, this step specifically includes the following operations.
S31、在同类别实体间,随机从噪音实体词袋
Figure PCTCN2022109617-appb-000011
中挑选1-2个特征空间相似程度较高的噪音实体插入到所述外部智库拓展词袋
Figure PCTCN2022109617-appb-000012
的真实实体之间,从而达到降低关联实体间空间相关性的效果、得到内插实体泛化样本T 1
S31. Among entities of the same category, randomly select from the noise entity word bag
Figure PCTCN2022109617-appb-000011
Select 1-2 noise entities with a high degree of similarity in the feature space and insert them into the external think tank to expand the word bag
Figure PCTCN2022109617-appb-000012
between real entities, thereby achieving the effect of reducing the spatial correlation between related entities and obtaining the interpolated entity generalization sample T 1 .
上述操作主要考虑的是在专业文本中,实体间密度要远高于普通的新闻类 文本,这也导致了模型在识别标注实体时非常容易学习到实体间的空间特征。The main consideration for the above operation is that in professional texts, the density between entities is much higher than that in ordinary news texts, which also makes it very easy for the model to learn the spatial characteristics between entities when identifying annotated entities.
操作具体表现为,在内插实体选择中,插入的噪音实体主要通过在噪音实体词袋
Figure PCTCN2022109617-appb-000013
中选择与2个真实实体间空间向量表示最接近的噪音实体。假设有两个真实实体k和j,其在n维向量空间中的表示分别是ξ k和ξ j
The specific performance of the operation is that in the interpolation entity selection, the inserted noise entities are mainly passed through the noise entity word bag
Figure PCTCN2022109617-appb-000013
Select the noise entity that is closest to the space vector representation between the two real entities. Suppose there are two real entities k and j, whose representations in n-dimensional vector space are ξ k and ξ j respectively.
获取随机的n维权重矩阵N,选取λ n∈(0,1),∑λ n=1,则内插噪音实体的n为向量表示τ应当满足如下表达式,即选择的噪音实体应当在高维空间表示中尽量与真实实体k和j接近,从而达到最好的泛化效果。 Obtain a random n-dimensional weight matrix N, select λ n ∈ (0,1), ∑λ n =1, then the n vector representation of the interpolated noise entity τ should satisfy the following expression, that is, the selected noise entity should be at a high The dimensional space representation should be as close as possible to the real entities k and j, so as to achieve the best generalization effect.
τ≈Σλ nξ k+(1-λ njτ≈Σλ n ξ k +(1-λ nj ,
在插入实体位置选择中,插入的噪音实体位置应当尽量处于真实实体k和j的中间位置,从而使得噪音实体的泛化影响对于两个真实实体的影响尽可能相似,数学表达式如下所示。In the selection of the inserted entity position, the inserted noise entity position should be as close as possible to the middle position of the real entities k and j, so that the generalized influence of the noise entity on the two real entities is as similar as possible. The mathematical expression is as follows.
ξ k+τ≈ξ j+τ。 ξ k +τ≈ξ j +τ.
S32、选择包含有所述初始关键词词袋X内词汇实体的噪音语句插入所述外部智库拓展词袋
Figure PCTCN2022109617-appb-000014
内所述真实实体标注样本中,得到外插语句泛化样本T 2
S32. Select noise sentences containing lexical entities in the initial keyword bag X and insert them into the external think tank expanded word bag.
Figure PCTCN2022109617-appb-000014
From the real entity annotation samples, the extrapolated sentence generalization sample T 2 is obtained.
由于专业文本的行文方式与其他文本存在巨大差异,专业文本的整体特征也会被深度学习算法捕捉,从而不利于在其他类型的专业文本或非专业文本中进行标注识别。因此,为了提高标注算法在不同类型文本中的标注效果,还需要在实体外进行噪音语句的插入,以降低专业文本整体特征对算法泛化能力的影响。Since the writing style of professional texts is significantly different from other texts, the overall characteristics of professional texts will also be captured by deep learning algorithms, which is not conducive to annotation recognition in other types of professional texts or non-professional texts. Therefore, in order to improve the annotation effect of the annotation algorithm in different types of texts, it is also necessary to insert noise sentences outside the entities to reduce the impact of the overall characteristics of the professional text on the generalization ability of the algorithm.
在噪音语句的选择中,本方案中选择的是包含人工词袋X的其他类型的文本语句,不仅可以添加特征差异的噪音语句,也可以尽量使得噪音语句与原语句间存在一定的特征一致性。In the selection of noise sentences, this solution selects other types of text sentences containing artificial word bag .
例如,在原语句中包含“激光束扫描显示投影元件”和“全息投影镜片”这两个实体,为了增强算法通过上下文关系识别这两个实体的泛化能力,我们会在该句前插入一句其他类型文本的语句,要求是插入的句子中应包含人工词 袋X的一个或多个词。For example, the original sentence contains the two entities "laser beam scanning display projection element" and "holographic projection lens". In order to enhance the algorithm's generalization ability to identify these two entities through contextual relationships, we will insert another sentence before the sentence. For statements of type text, the requirement is that the inserted sentence should contain one or more words from the artificial word bag X.
此处还有一个实现技巧,如果希望模型在特定类型的文本中获得更强的泛化能力,则外插语句可以从该类型文本中获得。例如,如果希望模型在新闻类型的文本中泛化能力更强,则外插语句可从新闻文本中提取。There is also an implementation tip here. If you want the model to gain stronger generalization capabilities in a specific type of text, extrapolated sentences can be obtained from that type of text. For example, if you want the model to generalize better in news-type texts, extrapolated sentences can be extracted from news texts.
S4、针对所述内插实体泛化样本和所述外插语句泛化样本进行逆向文本标注定位,得到专业实体标注样本。S4. Perform reverse text annotation and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity annotation samples.
综上所述,本发明所提出的一种行业专业文本自动标注方法,与现有技术中的人工标注及传统监督式文本标注方法相比,具有以下几个方面的优点:To sum up, the automatic annotation method for industry professional texts proposed by the present invention has the following advantages compared with the manual annotation and traditional supervised text annotation methods in the prior art:
1、技术实现成本更低。传统的对于行业专业文本的标注方法需要标注人员进行人工标注,不仅会耗费大量的时间成本,其标注结果也可能存在一定的错漏,无法满足实际生产中大批量专业实体识别的要求。本发明的方法利用半监督实体识别算法,结合外部专业的知识文本库,最大限度上降低了文本实体标注的人工成本,提升了实际文本实体识别过程中模型构建的质量和效率。1. The cost of technology implementation is lower. The traditional annotation method for industry professional texts requires manual annotation by annotators, which not only consumes a lot of time and costs, but the annotation results may also have certain errors and omissions, and cannot meet the requirements of large-scale professional entity recognition in actual production. The method of the present invention uses a semi-supervised entity recognition algorithm and combines it with an external professional knowledge text library to minimize the labor cost of text entity annotation and improve the quality and efficiency of model construction in the actual text entity recognition process.
2、文本标注质量更高。传统的针对专业领域的文本实体标注往往都需要标注人员具备特定领域内的专业知识,从而对标注工作提出了很高的要求。本发明的方法中使用了具有通用性的实体识别算法(NER),结合特定的词袋高维向量化相似度计算技术,使得前期的文本实体识别过程不需要标注人员进行干预,理论上可以在多个不同的专业领域内实现高质量的实体信息提取。2. The quality of text annotation is higher. Traditional text entity annotation for professional fields often requires annotators to have professional knowledge in specific fields, which places high demands on annotation work. The method of the present invention uses a universal entity recognition algorithm (NER), combined with a specific bag-of-words high-dimensional vectorized similarity calculation technology, so that the early text entity recognition process does not require the intervention of annotators, and can theoretically be Achieve high-quality entity information extraction in multiple different professional fields.
3、泛化能力更强。传统的文本自动标注技术往往只能针对单一特定文本进行标注,当文本类型或文本质量发生变化时,标注算法往往无法在新的标注集上取得较好的标注结果。本发明的方法基于数据增强技术,通过噪音实体特征内插和噪音语句特征外插技术,大幅优化了传统无监督自动标注算法所存在的泛化能力不足的问题,使得标注模型可以在多种不同的行业专业文本上实现半监督自动标注。3. Stronger generalization ability. Traditional automatic text annotation technology can often only annotate a single specific text. When the text type or text quality changes, the annotation algorithm often cannot achieve better annotation results on the new annotation set. The method of the present invention is based on data enhancement technology, and through noise entity feature interpolation and noise sentence feature extrapolation technology, it greatly optimizes the problem of insufficient generalization ability of traditional unsupervised automatic labeling algorithms, so that the labeling model can be used in a variety of different applications. Semi-supervised automatic annotation is implemented on industry professional texts.
另一方面,本发明还涉及一种行业专业文本自动标注装置,其架构如图5所示,包括:On the other hand, the present invention also relates to an automatic annotation device for industry professional texts. Its architecture is shown in Figure 5, including:
初始拓展词袋生成模块,被配置为依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋;The initial expanded word bag generation module is configured to perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded word bag from an external think tank;
拓展词袋及拓展噪音词袋生成模块,被配置为比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋;The expanded word bag and expanded noise word bag generation module is configured to compare whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag are similar, and obtain the external think tank respectively based on the comparison results. Expand the bag of words and external think tanks to expand the bag of noise words;
泛化样本生成模块,被配置为利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋对所述外部智库拓展词袋进行噪音语句特征外插,得到外插语句泛化样本;A generalization sample generation module configured to use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples based on the initial keyword bag of words. The external think tank expands the bag of words to extrapolate noise sentence features to obtain generalization samples of extrapolated sentences;
专业实体标注模块,被配置为针对所述内插实体泛化样本和所述外插语句泛化样本进行逆向文本标注定位,得到专业实体标注样本。The professional entity labeling module is configured to perform reverse text labeling and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity labeling samples.
所述初始拓展词袋生成模块,包括:The initial expanded word bag generation module includes:
初始关键词词袋获取单元,被配置为获取词汇实体,整理得到初始关键词词袋,所述初始关键词词袋依据词性分类要求被划分为多种类型的小词袋;The initial keyword bag acquisition unit is configured to acquire vocabulary entities and organize to obtain initial keyword bags. The initial keyword bags are divided into multiple types of small bags according to part-of-speech classification requirements;
拓展文本信息获取单元,被配置为利用所述初始关键词词袋,在专业文本库中进行关键词检索,得到拓展文本信息,所述拓展文本信息中拓展样本对应每种类型所述小词袋的样本数量相近似;The expanded text information acquisition unit is configured to use the initial keyword bag to perform keyword retrieval in the professional text library to obtain expanded text information, and the expanded samples in the expanded text information correspond to each type of the small bag of words. The sample sizes are similar;
外部智库初始拓展词袋生成单元,被配置为利用实体识别技术对所述拓展文本信息进行实体信息识别,得到外部智库初始拓展词袋。The external think tank initial expanded word bag generation unit is configured to use entity recognition technology to perform entity information recognition on the expanded text information to obtain the external think tank initial expanded word bag.
所述拓展词袋及拓展噪音词袋生成模块,包括:The expanded bag of words and expanded noise bag of words generation modules include:
拓展词袋生成单元,被配置为利用卷积神经网络对所述外部智库初始拓展词袋中的拓展词与所述初始关键词词袋中的实体词进行高维向量空间中的向量计算,依据计算结果识别出真实实体标注样本及噪音实体标注样本;The expanded word bag generation unit is configured to use a convolutional neural network to perform vector calculation in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag. The calculation results identify real entity labeled samples and noise entity labeled samples;
拓展噪音词袋生成单元,被配置为将全部所述真实实体标注样本汇总整理得到外部智库拓展词袋,将全部所述噪音实体标注样本汇总整理得到外部智库 拓展噪音词袋。The expanded noise word bag generation unit is configured to summarize and organize all the real entity labeled samples to obtain an external think tank expanded word bag, and summarize and organize all the noise entity labeled samples to obtain an external think tank expanded noise word bag.
所述泛化样本生成模块,包括:The generalization sample generation module includes:
内插实体泛化样本生成单元,被配置为从所述外部智库拓展噪音词袋随机挑选噪音实体样本插入所述外部智库拓展词袋内所述真实实体标注样本中,得到内插实体泛化样本;The interpolated entity generalization sample generation unit is configured to randomly select noise entity samples from the external think tank expanded noise word bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain interpolated entity generalization samples. ;
外插语句泛化样本生成单元,被配置为选择包含有所述初始关键词词袋内词汇实体的噪音语句插入所述外部智库拓展词袋内所述真实实体标注样本中,得到外插语句泛化样本。The extrapolated sentence generalization sample generating unit is configured to select the noise sentence containing the lexical entity in the initial keyword bag and insert it into the real entity labeled sample in the external think tank expanded word bag to obtain the extrapolated sentence generalization sample. sample.
又一方面,本发明还涉及一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如前文所述行业专业文本自动标注方法中的步骤,例如图1所示的步骤。或者,处理器执行计算机程序时实现上述各装置实施例中各模块/单元的功能,例如图5所示的各模块/单元的功能。In another aspect, the present invention also relates to a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the above-mentioned steps are implemented. The steps in the automatic annotation method for industry professional texts are, for example, the steps shown in Figure 1. Alternatively, when the processor executes the computer program, the functions of each module/unit in each of the above device embodiments are implemented, such as the functions of each module/unit shown in Figure 5 .
再一方面,本发明还涉及一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行所述计算机程序时实现如前文所述行业专业文本自动标注方法中的步骤。In yet another aspect, the present invention also relates to a computer-readable storage medium that stores a computer program. When the computer program is executed by a processor, the computer program can automatically read industry professional texts as described above. Label the steps in the method.
其中,可读存储介质可以是计算机存储介质,也可以是通信介质。通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。计算机存储介质可以是通用或专用计算机能够存取的任何可用介质。例如,可读存储介质耦合至处理器,从而使处理器能够从该可读存储介质读取信息,且可向该可读存储介质写入信息。当然,可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific Integrated Circuits,ASIC)中。另外,该ASIC可以位于用户设备中。当然,处理器和可读存储介质也可以作为分立组件存在于通信设备中。可读存储介质可以是只读存储器(ROM)、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to a processor such that the processor can read information from the readable storage medium and write information to the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and readable storage medium may be located in Application Specific Integrated Circuits (ASICs). Additionally, the ASIC can be located in the user equipment. Of course, the processor and the readable storage medium may also exist as discrete components in the communication device. Readable storage media can be read-only memory (ROM), random-access memory (RAM), CD-ROM, tapes, floppy disks, optical data storage devices, etc.
与前文中方法内容相对应的,本发明所提出的一种行业专业文本自动标注 装置、终端及存储介质,能够以系统化、标准化的处理流程,高效、准确地完成对行业专业文本的标注,硬件的适配性和兼容性较高,能够切实地应用于文本标注领域内的技术实现中。Corresponding to the content of the above method, the invention proposes an automatic annotation device, terminal and storage medium for industry professional texts, which can efficiently and accurately complete the annotation of industry professional texts through a systematic and standardized processing flow. The hardware has high adaptability and compatibility and can be effectively used in technical implementation in the field of text annotation.
本发明还为其他与文本标注技术相关的方案提供了参考,可以以此为依据进行拓展延伸和深入研究,具有十分广阔的应用前景。The present invention also provides a reference for other solutions related to text annotation technology, which can be used as a basis for expansion, extension and in-depth research, and has very broad application prospects.
对于本领域技术人员而言,显然本发明不限于上述示范性实施例的细节,而且在不背离本发明的精神和基本特征的情况下,能够以其他的具体形式实现本发明。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本发明内。It is obvious to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, and that the present invention can be implemented in other specific forms without departing from the spirit and essential characteristics of the present invention. Therefore, the embodiments should be regarded as illustrative and non-restrictive from any point of view, and the scope of the present invention is defined by the appended claims rather than the above description, and it is therefore intended that all claims falling within the claims All changes within the meaning and scope of equivalent elements are included in the present invention.
最后,应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施例中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。Finally, it should be understood that although this specification is described in terms of implementations, not each implementation only contains an independent technical solution. This description of the specification is only for the sake of clarity, and those skilled in the art should take the specification as a whole. , the technical solutions in each embodiment can also be appropriately combined to form other implementations that can be understood by those skilled in the art.

Claims (10)

  1. 一种行业专业文本自动标注方法,其特征在于,包括:An automatic annotation method for industry professional texts, which is characterized by including:
    依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋;Perform a keyword search in a professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded bag of words from an external think tank;
    逐一比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋;Compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain the external think tank expanded word bag and the external think tank expanded noise word bag based on the comparison results;
    利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋对所述外部智库拓展词袋进行噪音语句特征外插,得到外插语句泛化样本;Use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples, and perform noise on the external think tank expanded bag of words based on the initial keyword bag. Extrapolate sentence features to obtain generalized samples of extrapolated sentences;
    针对所述内插实体泛化样本和所述外插语句泛化样本进行逆向文本标注定位,得到专业实体标注样本。Perform reverse text annotation and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity annotation samples.
  2. 根据权利要求1所述的行业专业文本自动标注方法,其特征在于,所述依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋,包括:The automatic annotation method for industry professional texts according to claim 1, characterized in that the keyword search is performed in the professional text library based on the initial keyword bag to obtain expanded text information, and entity recognition is performed on the expanded text information. , get the initial expanded word bag from the external think tank, including:
    获取词汇实体,整理得到初始关键词词袋,所述初始关键词词袋依据词性分类要求被划分为多种类型的小词袋;Obtain vocabulary entities and sort out initial keyword bags, which are divided into multiple types of small bags according to part-of-speech classification requirements;
    利用所述初始关键词词袋,在专业文本库中进行关键词检索,得到拓展文本信息,所述拓展文本信息中拓展样本对应每种类型所述小词袋的样本数量相近似;Using the initial keyword bag, perform keyword retrieval in a professional text library to obtain expanded text information. The number of expanded samples in the expanded text information corresponding to the small bag of words of each type is similar;
    利用实体识别技术对所述拓展文本信息进行实体信息识别,得到外部智库初始拓展词袋。Use entity recognition technology to perform entity information recognition on the expanded text information to obtain the initial expanded word bag of the external think tank.
  3. 根据权利要求1所述的行业专业文本自动标注方法,其特征在于,所述逐一比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋,包括:The automatic annotation method of industry professional texts according to claim 1, characterized in that the step-by-step comparison is made one by one to see whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, Based on the comparison results, the bag of words expanded by external think tanks and the bag of noise words expanded by external think tanks were obtained, including:
    利用卷积神经网络对所述外部智库初始拓展词袋中的拓展词与所述初始关键词词袋中的实体词进行高维向量空间中的向量计算,依据计算结果识别出真实实体标注样本及噪音实体标注样本;The convolutional neural network is used to perform vector calculations in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword word bag, and the real entity annotation samples are identified based on the calculation results. Noisy entity labeling samples;
    将全部所述真实实体标注样本汇总整理得到外部智库拓展词袋,将全部所述噪音实体标注样本汇总整理得到外部智库拓展噪音词袋。All the real entity labeled samples are summarized and sorted to obtain the external think tank expanded word bag, and all the noise entity labeled samples are summarized and sorted to obtain the external think tank expanded noise word bag.
  4. 根据权利要求1所述的行业专业文本自动标注方法,其特征在于,所述利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋对所述外部智库拓展词袋进行噪音语句特征外插,得到外插语句泛化样本,包括:The automatic annotation method of industry professional texts according to claim 1, characterized in that the use of the external think tank to expand the noise word bag is used to interpolate the noise entity features of the external think tank expanded bag of words to obtain interpolated entity generalization. Samples are used to extrapolate noise sentence features based on the initial keyword bag to the external think tank expanded word bag to obtain extrapolated sentence generalization samples, including:
    从所述外部智库拓展噪音词袋随机挑选噪音实体样本插入所述外部智库拓展词袋内所述真实实体标注样本中,得到内插实体泛化样本;Randomly select noise entity samples from the external think tank-expanded noise word bag and insert them into the real entity annotation samples in the external think tank expanded word bag to obtain interpolated entity generalization samples;
    选择包含有所述初始关键词词袋内词汇实体的噪音语句插入所述外部智库拓展词袋内所述真实实体标注样本中,得到外插语句泛化样本。Select noise sentences containing lexical entities in the initial keyword bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain extrapolated sentence generalization samples.
  5. 一种行业专业文本自动标注装置,其特征在于,包括:An automatic annotation device for industry professional texts, which is characterized by including:
    初始拓展词袋生成模块,被配置为依据初始关键词词袋在专业文本库中进行关键词检索,得到拓展文本信息,对所述拓展文本信息进行实体识别,得到外部智库初始拓展词袋;The initial expanded word bag generation module is configured to perform keyword retrieval in the professional text database based on the initial keyword bag to obtain expanded text information, perform entity recognition on the expanded text information, and obtain an initial expanded word bag from an external think tank;
    拓展词袋及拓展噪音词袋生成模块,被配置为逐一比较所述外部智库初始拓展词袋中拓展词与所述初始关键词词袋中实体词的词向量是否相似,依据比较结果分别得到外部智库拓展词袋和外部智库拓展噪音词袋;The expanded word bag and expanded noise word bag generation module is configured to compare one by one whether the word vectors of the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag are similar, and obtain external Think tanks expand the bag of words and external think tanks expand the bag of noise words;
    泛化样本生成模块,被配置为利用所述外部智库拓展噪音词袋对所述外部智库拓展词袋进行噪音实体特征内插,得到内插实体泛化样本,基于所述初始关键词词袋对所述外部智库拓展词袋进行噪音语句特征外插,得到外插语句泛化样本;A generalization sample generation module configured to use the external think tank to expand the noise bag of words to perform noise entity feature interpolation on the external think tank expanded bag of words to obtain interpolated entity generalization samples based on the initial keyword bag of words. The external think tank expands the bag of words to extrapolate noise sentence features to obtain generalization samples of extrapolated sentences;
    专业实体标注模块,被配置为针对所述内插实体泛化样本和所述外插语句泛化样本进行逆向文本标注定位,得到专业实体标注样本。The professional entity labeling module is configured to perform reverse text labeling and positioning on the interpolated entity generalization sample and the extrapolated sentence generalization sample to obtain professional entity labeling samples.
  6. 根据权利要求5所述的行业专业文本自动标注装置,其特征在于,所述初始拓展词袋生成模块,包括:The automatic annotation device for industry professional texts according to claim 5, characterized in that the initial expanded word bag generation module includes:
    初始关键词词袋获取单元,被配置为获取词汇实体,整理得到初始关键词词袋,所述初始关键词词袋依据词性分类要求被划分为多种类型的小词袋;The initial keyword bag acquisition unit is configured to acquire vocabulary entities and organize to obtain initial keyword bags. The initial keyword bags are divided into multiple types of small bags according to part-of-speech classification requirements;
    拓展文本信息获取单元,被配置为利用所述初始关键词词袋,在专业文本库中进行关键词检索,得到拓展文本信息,所述拓展文本信息中拓展样本对应每种类型所述小词袋的样本数量相近似;The expanded text information acquisition unit is configured to use the initial keyword bag to perform keyword retrieval in the professional text library to obtain expanded text information, and the expanded samples in the expanded text information correspond to each type of the small bag of words. The sample sizes are similar;
    外部智库初始拓展词袋生成单元,被配置为利用实体识别技术对所述拓展文本信息进行实体信息识别,得到外部智库初始拓展词袋。The external think tank initial expanded word bag generation unit is configured to use entity recognition technology to perform entity information recognition on the expanded text information to obtain the external think tank initial expanded word bag.
  7. 根据权利要求5所述的行业专业文本自动标注装置,其特征在于,所述拓展词袋及拓展噪音词袋生成模块,包括:The automatic annotation device for industry professional texts according to claim 5, characterized in that the expanded bag of words and expanded noise bag of words generation modules include:
    拓展词袋生成单元,被配置为利用卷积神经网络对所述外部智库初始拓展词袋中的拓展词与所述初始关键词词袋中的实体词进行高维向量空间中的向量计算,依据计算结果识别出真实实体标注样本及噪音实体标注样本;The expanded word bag generation unit is configured to use a convolutional neural network to perform vector calculation in a high-dimensional vector space on the expanded words in the initial expanded word bag of the external think tank and the entity words in the initial keyword bag. The calculation results identify real entity labeled samples and noise entity labeled samples;
    拓展噪音词袋生成单元,被配置为将全部所述真实实体标注样本汇总整理得到外部智库拓展词袋,将全部所述噪音实体标注样本汇总整理得到外部智库拓展噪音词袋。The expanded noise word bag generation unit is configured to summarize and organize all the real entity labeled samples to obtain an external think tank expanded word bag, and summarize and organize all the noise entity labeled samples to obtain an external think tank expanded noise word bag.
  8. 根据权利要求5所述的行业专业文本自动标注装置,其特征在于,所述泛化样本生成模块,包括:The automatic annotation device for industry professional texts according to claim 5, characterized in that the generalization sample generation module includes:
    内插实体泛化样本生成单元,被配置为从所述外部智库拓展噪音词袋随机挑选噪音实体样本插入所述外部智库拓展词袋内所述真实实体标注样本中,得到内插实体泛化样本;The interpolated entity generalization sample generation unit is configured to randomly select noise entity samples from the external think tank expanded noise word bag and insert them into the real entity labeled samples in the external think tank expanded word bag to obtain interpolated entity generalization samples. ;
    外插语句泛化样本生成单元,被配置为选择包含有所述初始关键词词袋内词汇实体的噪音语句插入所述外部智库拓展词袋内所述真实实体标注样本中,得到外插语句泛化样本。The extrapolated sentence generalization sample generating unit is configured to select the noise sentence containing the lexical entity in the initial keyword bag and insert it into the real entity labeled sample in the external think tank expanded word bag to obtain the extrapolated sentence generalization sample. sample.
  9. 一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述 处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至4中任一所述行业专业文本自动标注方法中的步骤。A terminal including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the computer program, it implements claims 1 to 4 The steps in any of the automatic annotation methods for industry professional texts.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行所述计算机程序时实现如权利要求1至4中任一所述行业专业文本自动标注方法中的步骤。A computer-readable storage medium, the computer-readable storage medium stores a computer program, characterized in that, when the computer program is executed by a processor, the industry as claimed in any one of claims 1 to 4 is realized. Steps in the method of automatic annotation of professional texts.
PCT/CN2022/109617 2022-03-24 2022-08-02 Industry professional text automatic labeling method and apparatus, terminal, and storage medium WO2023178903A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210291524.5A CN114386424B (en) 2022-03-24 2022-03-24 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium
CN202210291524.5 2022-03-24

Publications (1)

Publication Number Publication Date
WO2023178903A1 true WO2023178903A1 (en) 2023-09-28

Family

ID=81205813

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/109617 WO2023178903A1 (en) 2022-03-24 2022-08-02 Industry professional text automatic labeling method and apparatus, terminal, and storage medium

Country Status (2)

Country Link
CN (1) CN114386424B (en)
WO (1) WO2023178903A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386424B (en) * 2022-03-24 2022-06-10 上海帜讯信息技术股份有限公司 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205780A1 (en) * 2014-01-20 2015-07-23 Lenovo (Singapore) Pte. Ltd. Real time data tagging in text-based documents
CN112131366A (en) * 2020-09-23 2020-12-25 腾讯科技(深圳)有限公司 Method, device and storage medium for training text classification model and text classification
CN113065341A (en) * 2021-03-14 2021-07-02 北京工业大学 Automatic labeling and classifying method for environmental complaint report text
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN114386424A (en) * 2022-03-24 2022-04-22 上海帜讯信息技术股份有限公司 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075435B (en) * 2007-04-19 2011-05-18 深圳先进技术研究院 Intelligent chatting system and its realizing method
CN103927358B (en) * 2014-04-15 2017-02-15 清华大学 text search method and system
CN108280061B (en) * 2018-01-17 2021-10-26 北京百度网讯科技有限公司 Text processing method and device based on ambiguous entity words
CN110059185B (en) * 2019-04-03 2022-10-04 天津科技大学 Medical document professional vocabulary automatic labeling method
US20230032536A1 (en) * 2019-12-23 2023-02-02 Medsavana S.L. Privacy preservation in a queryable database built from unstructured texts

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205780A1 (en) * 2014-01-20 2015-07-23 Lenovo (Singapore) Pte. Ltd. Real time data tagging in text-based documents
CN112131366A (en) * 2020-09-23 2020-12-25 腾讯科技(深圳)有限公司 Method, device and storage medium for training text classification model and text classification
CN113065341A (en) * 2021-03-14 2021-07-02 北京工业大学 Automatic labeling and classifying method for environmental complaint report text
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN114386424A (en) * 2022-03-24 2022-04-22 上海帜讯信息技术股份有限公司 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium

Also Published As

Publication number Publication date
CN114386424A (en) 2022-04-22
CN114386424B (en) 2022-06-10

Similar Documents

Publication Publication Date Title
Liu et al. Attention-based BiGRU-CNN for Chinese question classification
US11631007B2 (en) Method and device for text-enhanced knowledge graph joint representation learning
WO2021223323A1 (en) Image content automatic description method based on construction of chinese visual vocabulary list
CN109766417B (en) Knowledge graph-based literature dating history question-answering system construction method
CN111581401A (en) Local citation recommendation system and method based on depth correlation matching
CN110619050B (en) Intention recognition method and device
CN108595546B (en) Semi-supervision-based cross-media feature learning retrieval method
US11183175B2 (en) Systems and methods implementing data query language and utterance corpus implements for handling slot-filling and dialogue intent classification data in a machine learning task-oriented dialogue system
CN113128557B (en) News text classification method, system and medium based on capsule network fusion model
Chen et al. Deep neural networks for multi-class sentiment classification
He et al. Named entity recognition for Chinese marine text with knowledge-based self-attention
Miao et al. A dynamic financial knowledge graph based on reinforcement learning and transfer learning
WO2023178903A1 (en) Industry professional text automatic labeling method and apparatus, terminal, and storage medium
CN112632258A (en) Text data processing method and device, computer equipment and storage medium
US20230169361A1 (en) Generating answers to multi-hop constraint-based questions from knowledge graphs
Dabade Sentiment analysis of Twitter data by using deep learning And machine learning
Li et al. Cross-Model Hashing Retrieval Based on Deep Residual Network.
Cambria et al. Towards a chinese common and common sense knowledge base for sentiment analysis
Liao et al. The sg-cim entity linking method based on bert and entity name embeddings
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.
Wang et al. Generalised zero-shot learning for entailment-based text classification with external knowledge
Nazarizadeh et al. Sentiment analysis of Persian language: review of algorithms, approaches and datasets
Meng et al. An attention network based on feature sequences for cross-domain sentiment classification
Li et al. Deep learning for semantic matching: A survey
Zhao et al. Representation Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22932955

Country of ref document: EP

Kind code of ref document: A1