WO2021036181A1 - Data extraction method and device, storage medium and equipment - Google Patents

Data extraction method and device, storage medium and equipment Download PDF

Info

Publication number
WO2021036181A1
WO2021036181A1 PCT/CN2020/071879 CN2020071879W WO2021036181A1 WO 2021036181 A1 WO2021036181 A1 WO 2021036181A1 CN 2020071879 W CN2020071879 W CN 2020071879W WO 2021036181 A1 WO2021036181 A1 WO 2021036181A1
Authority
WO
WIPO (PCT)
Prior art keywords
documents
triples
manually
model
labeled
Prior art date
Application number
PCT/CN2020/071879
Other languages
French (fr)
Chinese (zh)
Inventor
吴文旷
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2021036181A1 publication Critical patent/WO2021036181A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/02Agriculture; Fishing; Mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mining & Mineral Resources (AREA)
  • Economics (AREA)
  • Animal Husbandry (AREA)
  • Marine Sciences & Fisheries (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Agronomy & Crop Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data extraction method and device, a storage medium and an equipment. The method comprises: acquiring a manually annotated triad on the basis of tags manually added for characters in a first group of documents (S101); determining an automatically annotated triad according to a triad identified by a preset model from a second group of documents (S102), the preset model being a model that is preset and fits to the type of the second group of documents, and the model being obtained by training through using training data that comprise the manually annotated triad and the first group of documents; and using the manually annotated triad and the automatically annotated triad as knowledge data extracted from the documents (S103). The method can improve the use ratio of useful information in the documents, and the obtained knowledge data are more comprehensive.

Description

一种数据抽取方法、装置、存储介质及设备Data extraction method, device, storage medium and equipment
本申请要求于2019年08月26日提交中国专利局、申请号为201910789378.7、发明名称为“一种数据抽取方法、装置、存储介质及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 26, 2019, the application number is 201910789378.7, and the invention title is "a data extraction method, device, storage medium and equipment", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本发明涉及电子信息领域,特别是涉及一种数据抽取方法、装置、存储介质及设备。The present invention relates to the field of electronic information, in particular to a data extraction method, device, storage medium and equipment.
背景技术Background technique
在油田的勘探、开发、生产过程中,积累了大量的文档形式的科技成果,例如,勘探部署、油气藏描述、开发方案、研究报告、档案文献等高价值的文档。在这些文档中存在大量的有用信息,例如:油田的名称、开发投产时间、日产油量、油气藏圈闭、储层岩性、厚度、净毛比等。这些信息对从事勘探开发的科研人员快速检索资料、分析数据、发掘资料潜在价值具有极强的辅助作用。In the process of oilfield exploration, development, and production, a large number of scientific and technological achievements in the form of documents have been accumulated, such as high-value documents such as exploration deployment, oil and gas reservoir descriptions, development plans, research reports, and archives. There are a lot of useful information in these documents, such as: the name of the oil field, the time of development and production, daily oil production, oil and gas reservoir traps, reservoir lithology, thickness, net-to-gross ratio, etc. This information has a strong auxiliary effect for scientific research personnel engaged in exploration and development to quickly retrieve data, analyze data, and discover the potential value of data.
但是,文档中的有用信息是非结构化的,不方便科研人员的查询和使用,即文档中有用信息的利用率较低。However, the useful information in the document is unstructured, which is inconvenient for researchers to query and use, that is, the utilization rate of the useful information in the document is low.
发明内容Summary of the invention
鉴于上述问题,本发明提供一种克服上述问题或者至少部分地解决上述问题的数据抽取方法、装置、存储介质及设备。In view of the above-mentioned problems, the present invention provides a data extraction method, device, storage medium, and equipment that overcome the above-mentioned problems or at least partially solve the above-mentioned problems.
借由上述技术方案,本发明提供的With the above technical solutions, the present invention provides
本申请提供了一种数据抽取方法,包括:This application provides a data extraction method, including:
基于人工为第一组文档中的字符添加的标签,获取人工标注三元组;Based on the tags manually added to the characters in the first set of documents, obtain the manually labeled triples;
依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组;其中,所述预设模型为预设的与所述第二组文档的类型适配的模型,所述模型使用训练数据训练得到,所述训练数据包括所述人工标注三元组以及 所述第一组文档;According to the triples identified from the second set of documents by the preset model, determine to automatically mark the triples; wherein, the preset model is a preset model adapted to the type of the second set of documents, so The model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;
将所述人工标注三元组和所述自动标注三元组,作为从文档中抽取的知识数据。The manual labeling triples and the automatic labeling triples are used as knowledge data extracted from the document.
可选的,获取所述人工为第一组文档中的字符添加的标签的过程包括:Optionally, the process of obtaining the tags manually added to the characters in the first group of documents includes:
基于人工选取所述第一组文档中的字符的操作,显示待选实体标签的列表,所述待选实体标签依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting characters in the first set of documents, a list of candidate entity tags is displayed, the candidate entity tags being determined according to the business requirements of the field to which the first set of documents and the second set of documents belong ;
将人工从所述待选实体标签的列表中选择的标签,作为被选取字符的实体标签,得到标注的字符;Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;
基于人工选取所述标注的字符的操作,显示实体标签间的待选关系的列表,所述实体标签间的待选关系依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting the marked characters, a list of to-be-selected relationships between entity tags is displayed, and the to-be-selected relationships between the entity tags are based on the business in the field to which the first set of documents and the second set of documents belong Demand determination;
将人工从所述待选关系的列表中选择的关系,作为被选取的所述标注的字符的关系标签。The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
可选的,所述训练数据还包括以下至少一项:Optionally, the training data further includes at least one of the following:
所述人工标注三元组中的元素在所述第一组文档中的位置、所述人工标注三元组中的元素在所述第一组文档中的组间距离、所述人工标注三元组中的元素在所述第一组文档中的组间语法关系。The position of the elements in the manually labeled triples in the first set of documents, the distance between the elements in the manually labeled triples in the first set of documents, and the manually labeled triples The grammatical relationship between the elements in the group in the first group of documents.
可选的,所述依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组,包括:Optionally, the determining to automatically label the triples based on the triples identified from the second set of documents according to the preset model includes:
标注目标三元组,所述目标三元组为以下至少一项:所述预设模型从所述第二组文档中识别出的三元组中,存在矛盾的三元组;所述预设模型从所述第二组文档中识别出的三元组中,与所述人工标注三元组存在矛盾的三元组;缺项三元组;Annotate target triples, where the target triples are at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples; the preset Among the triples identified by the model from the second set of documents, the triples that contradict the manually labeled triples; the missing triples;
获取人工对于所述目标三元组的校正结果,作为所述自动标注三元组。Obtain a manual correction result for the target triplet and use it as the automatically labeled triplet.
可选的,使用所述目标三元组的校正结果和所述第二组文档,重新训练所述模型。Optionally, use the correction result of the target triple and the second set of documents to retrain the model.
可选的,所述第一组文档的类型与所述第二组文档的类型相同;Optionally, the type of the first group of documents is the same as the type of the second group of documents;
所述与所述第二组文档的类型适配的模型的确定过程包括:The process of determining the model adapted to the type of the second set of documents includes:
将在训练过程中,从所述第一组文档中识别出的三元组的准确率最高的模型,作为与所述第二组文档的类型适配的模型。In the training process, the model with the highest accuracy of the triples identified from the first set of documents is used as a model adapted to the type of the second set of documents.
本申请还提供了一种数据抽取装置,包括:This application also provides a data extraction device, including:
第一获取模块,用于基于人工为第一组文档中的字符添加的标签,获取人工标注三元组;The first obtaining module is configured to obtain manually labeled triples based on the tags manually added to the characters in the first set of documents;
确定模块,用于依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组;其中,所述预设模型为预设的与所述第二组文档的类型适配的模型,所述模型使用训练数据训练得到,所述训练数据包括所述人工标注三元组以及所述第一组文档;The determining module is used to determine the automatically labeled triples according to the triples identified from the second set of documents according to the preset model; wherein, the preset model is a preset model suitable for the type of the second set of documents. Configured model, the model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;
执行模块,用于将所述人工标注三元组和所述自动标注三元组,作为从文档中抽取的知识数据。The execution module is configured to use the manual labeling triples and the automatic labeling triples as knowledge data extracted from the document.
可选的,还包括:第二获取模块,用于获取所述人工为第一组文档中的字符添加的标签;Optionally, it further includes: a second obtaining module, configured to obtain the tags manually added to the characters in the first group of documents;
所述第二获取模块,具体用于基于人工选取所述第一组文档中的字符的操作,显示待选实体标签的列表,所述待选实体标签依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;The second acquiring module is specifically configured to display a list of candidate entity tags based on an operation of manually selecting characters in the first set of documents, and the candidate entity tags are based on the first set of documents and the first set of documents. Determine the business requirements of the field to which the second set of documents belong;
将人工从所述待选实体标签的列表中选择的标签,作为被选取字符的实体标签,得到标注的字符;Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;
基于人工选取所述标注的字符的操作,显示实体标签间的待选关系的列表,所述实体标签间的待选关系依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting the marked characters, a list of to-be-selected relationships between entity tags is displayed, and the to-be-selected relationships between the entity tags are based on the business in the field to which the first set of documents and the second set of documents belong Demand determination;
将人工从所述待选关系的列表中选择的关系,作为被选取的所述标注的字符的关系标签。The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
可选的,所述训练数据还包括以下至少一项:Optionally, the training data further includes at least one of the following:
所述人工标注三元组中的元素在所述第一文档中的位置、所述人工标注三元组中的元素在所述第一文档中的组间距离、所述人工标注三元组中的元素在所述第一文档中的组间语法关系。The position of the element in the manually-labeled triplet in the first document, the distance between the elements in the manually-labeled triplet in the first document, and the position of the manually-labeled triplet in the first document The grammatical relationship between the elements of the group in the first document.
可选的,所述确定模块,用于依据预设模型从第二文档中识别出的三元组,确定自动标注三元组,包括:Optionally, the determining module is configured to determine the triples identified automatically from the second document according to the preset model, including:
所述确定模块,具体用于标注目标三元组,所述目标三元组为以下至少一项:所述预设模型从所述第二文档中识别出的三元组中,存在矛盾的三元组;所述预设模型从所述第二文档中识别出的三元组中,与所述人工标注三元组存在矛盾的三元组;缺项三元组;The determining module is specifically configured to annotate target triples, the target triples being at least one of the following: among the triples identified by the preset model from the second document, there are contradictory triples Tuples; among the triples identified by the preset model from the second document, the triples that contradict the manually labeled triples; the missing triples;
获取人工对于所述目标三元组的校正结果,作为所述自动标注三元组。Obtain a manual correction result for the target triplet and use it as the automatically labeled triplet.
可选的,还包括:训练模块;Optionally, it also includes: training module;
所述训练模块,用于使用所述目标三元组的校正结果和所述第二文档,重新训练所述模型。The training module is configured to use the correction result of the target triplet and the second document to retrain the model.
可选的,还包括适配模型确定模块,用于将在训练过程中,从所述第一文档中识别出的三元组的准确率最高的模型,作为与所述第二文档的类型适配的模型;所述第一文档的类型与所述第二文档的类型相同。Optionally, it further includes a fitting model determination module, which is used to use the model with the highest accuracy of the triples identified from the first document in the training process as the model suitable for the type of the second document. Matching model; the type of the first document is the same as the type of the second document.
本申请还提供了一种存储介质,所述存储介质包括存储的程序,其中,所述程序执行上述任意一种所述的数据抽取方法。The present application also provides a storage medium, the storage medium includes a stored program, wherein the program executes any one of the aforementioned data extraction methods.
本申请还提供了一种设备,包括:处理器、存储器和总线;所述处理器与所述存储器通过所述总线连接;The present application also provides a device, including: a processor, a memory, and a bus; the processor and the memory are connected through the bus;
所述存储器用于存储程序,所述处理器用于运行程序,其中,所述程序运行时执行上述任意一种所述的数据抽取方法。The memory is used to store a program, and the processor is used to run a program, wherein, when the program is running, any one of the aforementioned data extraction methods is executed.
在本发明提供的数据抽取方案中,基于人工为第一组文档中的字符添加的标签,获取人工标注三元组,依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组,将人工标注三元组与自动标注三元组,作为从文档中抽取的知识数据。即本发明得到的知识数据是三元组,因为三元组是结构化数据,从而,方便用户进行查询和使用,因此,通过本发明的方案可以提高文档中有用信息的利用率。In the data extraction scheme provided by the present invention, based on the tags manually added to the characters in the first set of documents, the manually labeled triples are obtained, and the triples identified from the second set of documents according to the preset model are determined automatically. Annotate triples. Manually annotate triples and automatically annotate triples as knowledge data extracted from documents. That is, the knowledge data obtained by the present invention is a triplet, because the triplet is a structured data, which is convenient for users to query and use. Therefore, the solution of the present invention can improve the utilization rate of useful information in the document.
此外,在本发明中,将人工标注三元组和自动标注三元组都作为知识数据,其中,自动标注三元组是依据预设模型从第二组文档中识别出的三元组确定得到的,模型是采用人工标注三元组以及第一组文档为训练样本 训练得到的与第二组文档的类型适配的模型。由于人工标注三元组为训练样本中的三元组,自动标注三元组是依据预设模型在测试过程中得到的三元组确定出的,即本发明所得到的知识数据中既包括训练样本中的三元组,也包括依据测试过程得到的三元组所得到的三元组,进而,使得本发明所得到的知识数据更全面。In addition, in the present invention, both manual labeling of triples and automatic labeling of triples are used as knowledge data, wherein the automatic labeling of triples is determined by determining the triples identified from the second set of documents according to a preset model Yes, the model is a model adapted to the type of the second set of documents obtained by training using manual annotation of triples and the first set of documents as training samples. Since the manually labeled triples are the triples in the training sample, the automatically labeled triples are determined based on the triples obtained by the preset model in the test process, that is, the knowledge data obtained by the present invention includes both training The triples in the sample also include the triples obtained from the triples obtained according to the test process, which further makes the knowledge data obtained by the present invention more comprehensive.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objectives, features and advantages of the present invention more obvious and understandable. In the following, specific embodiments of the present invention will be cited.
附图说明Description of the drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only used for the purpose of illustrating the preferred embodiments, and are not considered as a limitation to the present invention. Also, throughout the drawings, the same reference symbols are used to denote the same components. In the attached picture:
图1示出了本申请实施例公开的一种数据抽取方法的流程示意图;Fig. 1 shows a schematic flowchart of a data extraction method disclosed in an embodiment of the present application;
图2示出了本申请实施例公开的一种模型训练方法的流程示意图;Fig. 2 shows a schematic flowchart of a model training method disclosed in an embodiment of the present application;
图3示出了本申请实施例公开的又一种数据抽取方法的流程示意图;FIG. 3 shows a schematic flowchart of another data extraction method disclosed in an embodiment of the present application;
图4示出了本申请实施例公开的一种数据抽取装置的结构示意图;Figure 4 shows a schematic structural diagram of a data extraction device disclosed in an embodiment of the present application;
图5示出了本申请实施例公开的一种设备的结构示意图。Fig. 5 shows a schematic structural diagram of a device disclosed in an embodiment of the present application.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
在本申请实施例中,将用于训练模型的文档称为第一组文档,将测试过程中的文档称为第二组文档。具体的,哪些文档是第一组文档,哪些文档是第二组文档,可以根据实际情况确定,本实施例不作限定。In the embodiments of the present application, the documents used for training the model are referred to as the first set of documents, and the documents in the testing process are referred to as the second set of documents. Specifically, which documents are the first group of documents and which are the second group of documents can be determined according to actual conditions, and this embodiment does not limit it.
图1为本申请实施例提供的数据抽取方法,包括以下步骤:Figure 1 is a data extraction method provided by an embodiment of this application, including the following steps:
S101:基于人工为第一组文档中的字符添加的标签,获取人工标注三元组。S101: Based on the tags manually added to the characters in the first set of documents, obtain a manually labeled triplet.
在本步骤中,三元组一般包括两个实体以及实体关系,其中,实体关系用于反映两个实体间的关系。In this step, the triplet generally includes two entities and an entity relationship, where the entity relationship is used to reflect the relationship between the two entities.
例如,在石油勘探领域,从第一组文档中获取的人工标注三元组可以为“a油田三千万吨产量”,其中,“a油田”为实体,“三千万吨”也是实体,“产量”是实体“a油田”和实体“三千万吨”的关系。For example, in the field of petroleum exploration, the artificially labeled triples obtained from the first set of documents can be "a oil field 30 million tons production", where "a oil field" is an entity, and "30 million tons" is also an entity. "Production" is the relationship between the entity "a oil field" and the entity "30 million tons".
S102:依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组。S102: Determine to automatically label the triples according to the triples identified from the second set of documents according to the preset model.
其中,预设模型为预设的与第二组文档的类型适配的模型。模型使用训练数据训练得到,训练数据包括人工标注三元组以及第一组文档。The preset model is a preset model adapted to the type of the second set of documents. The model is trained using training data. The training data includes manually labeled triples and the first set of documents.
具体的,对模型的训练过程包括:正向传播过程和反向传播过程。在正向传播过程中,模型从第一组文档中识别三元组。在反向传播过程中,按照预设损失函数,计算识别出的三元组与人工标注三元组间的损失函数值,以降低损失函数值为目标,调整模型中的参数。按照训练过程对调整参数后的模型继续进行训练,直至损失函数值不大于预设阈值,模型训练完成。Specifically, the training process of the model includes: forward propagation process and back propagation process. In the forward propagation process, the model identifies triples from the first set of documents. In the backpropagation process, according to the preset loss function, the loss function value between the identified triplet and the manually labeled triplet is calculated, and the loss function value is reduced as the goal, and the parameters in the model are adjusted. Continue to train the model after adjusting the parameters according to the training process until the loss function value is not greater than the preset threshold, and the model training is completed.
需要说明的是,损失函数的具体内容可以参见现有技术,这里不再赘述。It should be noted that the specific content of the loss function can be referred to the prior art, which will not be repeated here.
人工标注三元组的获取过程,以及模型的与文档类型之间的适配过程,将在图2所示的实施例中说明。The acquisition process of manually labeling triples and the adaptation process between the model and the document type will be described in the embodiment shown in FIG. 2.
S103:将人工标注三元组和自动标注三元组,作为从文档中抽取的知识数据。S103: Use manual labeling of triples and automatic labeling of triples as knowledge data extracted from the document.
因为三元组是结构化数据,从而,方便用户进行查询和使用,因此,本实施例将文档中的字符转换为三元组,可以提高文档中的有用信息的利用率。Because triples are structured data, it is convenient for users to query and use. Therefore, in this embodiment, the characters in the document are converted into triples, which can improve the utilization rate of useful information in the document.
此外,在本实施例中,将人工标注三元组和自动标注三元组都作为知 识数据,其中,自动标注三元组是依据预设模型从第二组文档中识别出的三元组确定得到的,模型是采用人工标注三元组以及第一组文档为训练样本训练得到的。由于人工标注三元组为训练样本中的三元组,自动标注三元组是依据预设模型在测试过程中得到的三元组确定出的,即本发明所得到的知识数据中既包括训练样本中的三元组,也包括依据测试过程得到的三元组所得到的三元组,进而,使得本发明所得到的知识数据更全面。In addition, in this embodiment, both manual labeling of triples and automatic labeling of triples are used as knowledge data, where the automatic labeling of triples is determined based on the triples identified from the second set of documents based on the preset model. Obtained, the model is obtained by manually labeling triples and the first set of documents as training samples. Since the manually labeled triples are the triples in the training sample, the automatically labeled triples are determined based on the triples obtained by the preset model in the test process, that is, the knowledge data obtained by the present invention includes both training The triples in the sample also include the triples obtained from the triples obtained according to the test process, which further makes the knowledge data obtained by the present invention more comprehensive.
并且,预设模型为预设的与第二组文档的类型适配的模型,从而能够更为准确地识别出第二组文档的三元组。In addition, the preset model is a preset model adapted to the type of the second set of documents, so that the triples of the second set of documents can be more accurately identified.
需要说明的是,以上所述第一组文档和第二组文档可以为任何领域的文档,即上述数据抽取方法,可以应用在能够产生文档的任何领域,在以下实施例中,将以石油勘探领域为例进行说明。It should be noted that the above-mentioned first set of documents and second set of documents can be documents in any field, that is, the above-mentioned data extraction method can be applied to any field that can generate documents. In the following embodiments, oil exploration Take the domain as an example.
图2为本申请实施例提供的模型训练方法,包括以下步骤:Figure 2 is a model training method provided by an embodiment of the application, including the following steps:
S201、获取训练样本。S201. Obtain training samples.
在本实施例中,训练样本包括:第一组文档和第一组文档中标注的三元组。In this embodiment, the training sample includes: the first set of documents and the triples marked in the first set of documents.
具体的,在本步骤中,获取第一组文档中标注的三元组的过程包括步骤A1~步骤A6:Specifically, in this step, the process of obtaining the triples marked in the first set of documents includes steps A1 to A6:
A1、获取第一组文档。A1. Get the first set of documents.
在本实施例中,第一组文档为油气勘探、开发与生产过程中产生的文档,其中,第一组文档的格式可以为Word、PPT、PDF、Excel、JPG和PNG等格式。获取第一组文档的具体方式可以为:接收第一组文档,并提取文档中的字符(在文档为图片文档的情况下,对其中的字符进行OCR识别),得到识别后的字符,为人工对第一组文档中的字符标注三元组提供条件。In this embodiment, the first set of documents are documents generated in the process of oil and gas exploration, development, and production, and the format of the first set of documents may be Word, PPT, PDF, Excel, JPG, PNG and other formats. The specific method for obtaining the first set of documents can be: receiving the first set of documents, and extracting the characters in the documents (if the documents are picture documents, perform OCR recognition on the characters), and obtain the recognized characters, which are manual Provide conditions for labeling triples of characters in the first set of documents.
A2、基于人工选取第一组文档中的字符的操作,显示待选实体标签的列表。A2. Based on the operation of manually selecting characters in the first set of documents, a list of to-be-selected entity tags is displayed.
在本步骤中,在人工选取第一组文档中的字符的情况下,显示待选实体标签的列表,即显示用于供人工选择的实体标签的列表。In this step, in the case of manually selecting characters in the first set of documents, a list of entity tags to be selected is displayed, that is, a list of entity tags for manual selection is displayed.
在本实施例中,待选实体标签的列表中设置的待选实体标签,依据第 一组文档和第二组文档所属的石油勘探领域的业务需求确定。例如,石油勘探领域的实体标签包括但不限于:油气田的名称、开发投产时间、日产油量、油气藏圈闭、储层岩性、厚度、净毛比。In this embodiment, the candidate entity tags set in the list of candidate entity tags are determined according to the business requirements of the petroleum exploration field to which the first group of documents and the second group of documents belong. For example, the physical tags in the field of petroleum exploration include, but are not limited to: the name of the oil and gas field, the time of development and production, daily oil production, oil and gas reservoir traps, reservoir lithology, thickness, and net-to-gross ratio.
A3、将人工从待选实体标签的列表中选择的标签,作为被选取字符的实体标签,得到标注的字符。A3. Use the label manually selected from the list of entity labels to be selected as the entity label of the selected character to obtain the marked character.
在本步骤中,针对已在第一组文档中选取的字符(即被选取字符),人工从待选实体标签的列表中选择被选取字符所属的实体的标签,作为被选取字符的实体标签,为了描述方便,将标注有实体标签的被选取字符称为标注的字符。In this step, for the characters that have been selected in the first set of documents (that is, the selected characters), manually select the label of the entity to which the selected character belongs from the list of to-be-selected entity labels, as the entity label of the selected character, For the convenience of description, the selected characters marked with entity tags are called marked characters.
经过本步骤的操作,在第一组文档中可能存在多个被选取的字符,进而,可能存在多个标注的字符。After the operation of this step, there may be multiple selected characters in the first set of documents, and further, there may be multiple labeled characters.
A4、基于人工选取标注的字符的操作,显示实体标签间的待选关系的列表。A4. Based on the operation of manually selecting labeled characters, a list of to-be-selected relationships between entity tags is displayed.
多个标注的字符中多个标注的字符之间存在某种关系,因此,在本步骤中,在人工已选取标注的字符的情况下,显示实体标签间的待选关系的列表,以供人工从待选关系的列表中选择被选取的标注的字符间的关系。There is a certain relationship between multiple labeled characters in multiple labeled characters. Therefore, in this step, when the labeled characters have been manually selected, a list of to-be-selected relationships between entity tags is displayed for manual use. Select the relationship between the selected labeled characters from the list of to-be-selected relationships.
在本实施例中,实体标签间的待选关系依据第一组文档和第二组文档所属的石油勘探领域的业务需求确定,例如,实体标签间的待选关系包括实体1的产量。其中,实体1为已标注的作为实体的字符的编号。假设一段话“a油田的产量为三千万吨”中,已选取的的字符“a油田”被标注为实体1,“三千万吨”被标注为实体2,则实体2的关系标签即为实体1的产量。In this embodiment, the candidate relationship between the entity tags is determined according to the business requirements of the petroleum exploration field to which the first set of documents and the second set of documents belong. For example, the candidate relationship between the entity tags includes the output of the entity 1. Among them, entity 1 is the number of the marked character as the entity. Assuming that in a passage "The output of oil field a is 30 million tons", the selected character "a oil field" is marked as entity 1, and "30 million tons" is marked as entity 2, then the relationship label of entity 2 is Is the output of entity 1.
A5、将人工从待选关系的列表中选择的关系,作为被选取的标注的字符的关系标签。A5. Use the relationship manually selected from the list of to-be-selected relationships as the relationship label of the selected labeled character.
上述步骤A1~步骤A5是人工为第一组文档中的字符添加标签的过程。The above steps A1 to A5 are the process of manually adding tags to the characters in the first set of documents.
通过上述步骤A1~步骤A5,得到了实体标签,以及不同实体标签所指示的实体间关系的关系标签,并且,还得到了实体标签和关系标签间的对应关系。Through the above steps A1 to A5, the entity tags and the relationship tags of the relationships between entities indicated by different entity tags are obtained, and the corresponding relationship between the entity tags and the relationship tags is also obtained.
A6、基于人工添加的实体标签和关系标签,获取人工标注三元组。A6. Obtain manually labeled triples based on manually added entity tags and relationship tags.
在本步骤中,从实体标签、关系标签以及对应关系中,获取由每个对应关系指示的三元组。其中,每个对应关系指示的三元组的获取过程相同,对于任意一个对应关系,该对应关系中实体标签指示的实体和关系标签指示的关系构成三元组,即得到人工标注三元组。In this step, the triples indicated by each corresponding relationship are obtained from the entity tag, the relationship tag, and the corresponding relationship. Wherein, the acquisition process of the triples indicated by each corresponding relationship is the same. For any corresponding relationship, the entity indicated by the entity tag in the corresponding relationship and the relationship indicated by the relationship tag constitute a triple, that is, an artificially labeled triple is obtained.
在本实施例中,可以将第一组文档以及从第一组文档中人工标注三元组作为训练样本。In this embodiment, the first set of documents and the manual annotation of triples from the first set of documents can be used as training samples.
可选的,为了提高训练后的模型从第二组文档中识别三元组的准确性,即训练后的模型从第二组文档中识别出的三元组准确性。在本实施例中,训练样本还包括:人工标注三元组中的元素在第一组文档中的位置,人工标注三元组中的元素在第一组文档中的组间距离,人工标注三元组中的元素在第一组文档中的组间语法关系。Optionally, in order to improve the accuracy of identifying triples from the second set of documents by the trained model, that is, the accuracy of identifying triples from the second set of documents by the trained model. In this embodiment, the training sample also includes: manually marking the position of the elements in the triplet in the first set of documents, manually marking the distance between the elements in the triplet in the first set of documents, and manually marking the three The grammatical relationship between the elements in the tuple in the first group of documents.
其中,每个人工标注三元组分别对应有组间距离和组间语法关系,对于任意一个人工标注三元组,该人工标注三元组中的元素在第一组文档中的组间距离指:该人工标注三元组中的元素在第一组文档中的位置间的距离,具体的,距离可以为欧式距离,当然,还可以为其他形式的距离,本实施例不对距离的具体形式作限定。该人工标注三元组中的元素在第一组文档中的组间语法关系指:该人工标注三元组中的元素在第一组文档中的语法关系,其中,语法关系的示例为主、谓、宾、定、状、补、系、表等。还以人工标注三元组为“a油田三千万吨产量”为例,假设第一组文档中的句子为“a油田的产量为三千万吨”,则人工标注出三元组“a油田三千万吨产量”后,还标注出三元组中的元素在第一组文档中的语法的关系:“产量”标注为“主语”、“a油田”标注为“定语”、“三千万吨”标注为“宾语”。Among them, each manually labeled triple corresponds to the distance between groups and the grammatical relationship between groups. For any manually labeled triple, the distance between the elements in the manually labeled triple refers to the distance between the groups in the first set of documents. : The distance between the positions of the elements in the triples in the first set of documents is manually marked. Specifically, the distance can be Euclidean distance, of course, it can also be other forms of distance. The specific form of distance is not used in this embodiment. limited. The grammatical relationship between the elements in the manually labeled triplet in the first set of documents refers to: the grammatical relationship between the elements in the manually labeled triplet in the first set of documents, among which the examples of the grammatical relationship are mainly, Predicate, object, definite, adverb, complement, system, table, etc. Also take the manual labeling of the triplet "a oilfield 30 million tons production" as an example. If the sentence in the first set of documents is "a oilfield output is 30 million tons", then the triplet "a" is manually labeled After the output of 30 million tons of oil field", the grammatical relationship of the elements in the triplet in the first set of documents is also marked: "production" is marked as "subject", "a oil field" is marked as "attribute" and "three". Ten million tons" is marked as "object".
S202、采用训练样本对多个模型分别进行训练,得到训练后的多个模型。S202: Use training samples to train multiple models separately to obtain multiple models after training.
在本步骤中,多个模型可以包括:朴素贝叶斯模型、支持向量机模型(例如,SVM)、词嵌入模型(例如,word2vec)、循环神经网络模型(例 如,RNN)和长短时记忆网络模型(例如,LSTM)。In this step, multiple models may include: naive Bayes model, support vector machine model (for example, SVM), word embedding model (for example, word2vec), recurrent neural network model (for example, RNN), and long and short-term memory network Model (for example, LSTM).
具体的,对任意一个模型的训练过程都为现有技术,这里不在赘述,在本实施例中,训练后的模型具有从石油勘探领域的文档中识别三元组的功能。Specifically, the training process for any model is in the prior art, and will not be repeated here. In this embodiment, the trained model has the function of identifying triples from documents in the field of petroleum exploration.
申请人在研究的过程中发现,在石油勘探领域中,由于模型的结构不同,使得训练后的不同模型对某种类型的文档的测试准确性不同,因此,为了提高模型识别出三元组的准确性,可选的,本实施例中,可以选用不同类型的第一组文档,对多种模型分别进行训练,对于任意一种类型,使用该类型的第一组文档训练每种模型,将多种模型的输出结果的准确性(可以将损失函数的最小迭代值作为准确性得分)相比,选出输出结果最准确(损失函数的值最小)的模型,作为该类型的文档适配的模型。The applicant discovered in the process of research that in the field of petroleum exploration, due to the different structure of the model, different models after training have different test accuracy for certain types of documents. Therefore, in order to improve the model's recognition of triples Accuracy, optionally, in this embodiment, different types of first set of documents can be selected to train multiple models separately. For any type, use the first set of documents of that type to train each model, and Comparing the accuracy of the output results of multiple models (the smallest iteration value of the loss function can be used as the accuracy score), select the model with the most accurate output (the smallest value of the loss function) as the type of document adaptation model.
使用任意一种类型的第一组文档训练模型的过程,可以参见图2所示的流程。For the process of training the model using any type of the first set of documents, refer to the process shown in Figure 2.
图3为本申请实施例提供的又一种数据抽取方法,包括以下步骤:FIG. 3 is another data extraction method provided by an embodiment of this application, including the following steps:
S301、基于人工为第一组文档中的字符添加的标签,获取人工标注三元组。S301. Obtain a manually labeled triplet based on the tags manually added to the characters in the first set of documents.
本步骤的具体实现原理可以参考S101,这里不再赘述。For the specific implementation principle of this step, please refer to S101, which will not be repeated here.
S302、依据第二组文档的类型,选择与第二组文档的类型适配的模型,作为目标模型。S302: According to the type of the second group of documents, a model adapted to the type of the second group of documents is selected as the target model.
S303、将第二组文档输入目标模型,得到目标模型从第二组文档中识别出的三元组。S303. Input the second set of documents into the target model to obtain the triples identified by the target model from the second set of documents.
S304、依据目标模型从第二组文档中识别出的三元组,确定自动标注三元组。S304: Determine to automatically label the triples according to the triples recognized by the target model from the second set of documents.
在本步骤中,自动标注三元组指可以作为知识数据的三元组。具体的,依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组的方式可以包括:In this step, automatic labeling of triples refers to triples that can be used as knowledge data. Specifically, according to the triples identified from the second set of documents based on the preset model, the method for determining the automatic labeling of the triples may include:
第一种方式:将目标模型识别出的三元组作为自动标注三元组。The first method: use the triples identified by the target model as the automatically labeled triples.
第二种方式:从目标模型识别出的三元组中标注目标三元组,获取人工对目标三元组的校正结果,并将对目标三元组校正后的三元组作为自动标注三元组。The second method: label the target triples from the triples identified by the target model, obtain the manual correction results of the target triples, and use the corrected triples of the target triples as the automatic labeling triples group.
其中,目标三元组为以下至少一项:1、目标模型从第二组文档中识别出的三元组中存在矛盾的三元组、2、目标模型从第二组文档中识别出的三元组中,与人工标注三元组存在矛盾的三元组,以及3、缺项三元组。Among them, the target triples are at least one of the following: 1. There are contradictory triples in the triples identified by the target model from the second set of documents; 2. The triples identified by the target model from the second set of documents Among the tuples, the triples that contradict the manually labeled triples, and the triples with missing items.
具体的,目标模型从第二组文档中识别出的三元组中,存在矛盾的三元组指:目标模型从第二组文档中识别出的三元组之间存在矛盾的三元组。例如,目标模型从第二组文档中识别出的三元组包括:“最大石油量为1000吨”和“最大石油量为5000吨”两个三元组,由于最大石油量的取值是唯一的,因此,这两个三元组是矛盾的三元组。Specifically, among the triples identified by the target model from the second set of documents, contradictory triples refer to: triples with contradictions between the triples identified by the target model from the second set of documents. For example, the triples identified by the target model from the second set of documents include: "Maximum oil volume is 1000 tons" and "Maximum oil volume is 5000 tons" two triples, because the value of the maximum oil volume is unique Therefore, these two triples are contradictory triples.
第三种方式:将目标模型识别出的三元组中除目标三元组外的三元组,以及对目标三元组校正后的三元组,作为自动标注三元组。The third way: Take the triples except the target triples among the triples recognized by the target model, and the triples after the correction of the target triples, as the automatically labeled triples.
S305、将人工标注三元组和自动标注三元组,作为从文档中抽取的知识数据。S305. Use manual labeling of triples and automatic labeling of triples as knowledge data extracted from the document.
S306、将从文档中抽取的知识数据保存在预设知识图谱库。S306. Save the knowledge data extracted from the document in a preset knowledge graph library.
本步骤的具体实现过程为现有技术,这里不再赘述。The specific implementation process of this step is in the prior art, and will not be repeated here.
可选的,为了提高训练后的模型从文档中识别三元组的准确性,还可以将目标标注三元组校正后的三元组以及第二组文档为训练样本,对训练后的模型继续进行训练,得到更新后的模型,在后续需要从文档中识别三元组的情况下,采用更新后的模型从文档中识别三元组。Optionally, in order to improve the accuracy of the trained model in recognizing triples from documents, the target can also be labeled with triple corrected triples and the second set of documents as training samples, and continue with the trained model After training, the updated model is obtained. In the case of subsequent identification of triples from the document, the updated model is used to identify the triples from the document.
本申请实施例具有以下有益效果:The embodiments of this application have the following beneficial effects:
有益效果一、Beneficial effect 1.
本实施例中,基于人工为第一组文档中的字符添加的标签,获取人工标注三元组,依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组,将人工标注三元组与自动标注三元组,作为从文档中抽取的知识数据。即本发明得到的知识数据是三元组,并且,三元组是结构化数据, 从而,方便用户进行查询和使用,因此,通过本发明的方案可以提高文档中知识数据的利用率。In this embodiment, based on the tags manually added to the characters in the first set of documents, the manually labeled triples are obtained, and the triples identified from the second set of documents are determined to be automatically labeled according to the preset model. Manual labeling of triples and automatic labeling of triples are used as knowledge data extracted from documents. That is, the knowledge data obtained by the present invention is a triplet, and the triplet is a structured data, which is convenient for users to query and use. Therefore, the solution of the present invention can improve the utilization rate of the knowledge data in the document.
此外,在本实施例中,将人工标注三元组和自动标注三元组都作为知识数据,其中,自动标注三元组是依据预设模型从第二组文档中识别出的三元组确定得到的,预设模型为预设的与第二组文档的类型适配的模型,模型是采用人工标注三元组以及第一组文档为训练样本训练得到的。由于人工标注三元组为训练样本中的三元组,自动标注三元组是依据预设模型在测试过程中得到的三元组确定出的,即本实施例所得到的知识数据中既包括训练样本中的三元组,也包括依据测试过程得到的三元组所得到的三元组,进而,使得本实施例所得到的知识数据更全面。In addition, in this embodiment, both manual labeling of triples and automatic labeling of triples are used as knowledge data, where the automatic labeling of triples is determined based on the triples identified from the second set of documents based on the preset model. Obtained, the preset model is a preset model adapted to the type of the second set of documents, and the model is obtained by training using manual annotation of triples and the first set of documents as training samples. Since the manually labeled triples are the triples in the training sample, the automatically labeled triples are determined based on the triples obtained by the preset model in the testing process, that is, the knowledge data obtained in this embodiment includes both The triples in the training samples also include the triples obtained according to the triples obtained in the testing process, which further makes the knowledge data obtained in this embodiment more comprehensive.
有益效果二、Beneficial effect two,
相对于现有技术中采用人工从文档中识别三元组作为知识数据,本申请实施例采用自动和半自动的方式从文档中识别三元组作为知识数据,因此,可以提高知识数据提取的速度和效率。Compared with the prior art that manually recognizes triples from documents as knowledge data, the embodiment of the present application uses automatic and semi-automatic methods to recognize triples from documents as knowledge data. Therefore, the speed and speed of knowledge data extraction can be improved. effectiveness.
图4为本申请实施例提供的一种数据处理装置,包括:第一获取模块401、确定模块402和执行模块403。FIG. 4 is a data processing device provided by an embodiment of the application, including: a first acquiring module 401, a determining module 402, and an executing module 403.
其中,第一获取模块401用于基于人工为第一组文档中的字符添加的标签,获取人工标注三元组。确定模块402用于依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组。其中,预设模型为预设的与第二组文档的类型适配的模型,模型使用训练数据训练得到,训练数据包括人工标注三元组以及所述第一组文档。执行模块403用于将人工标注三元组和自动标注三元组,作为从文档中抽取的知识数据。Wherein, the first obtaining module 401 is configured to obtain manually labeled triples based on the tags manually added to the characters in the first set of documents. The determining module 402 is configured to determine the triples to be automatically labeled according to the triples identified from the second set of documents according to the preset model. Wherein, the preset model is a preset model adapted to the type of the second set of documents, and the model is obtained by training using training data, and the training data includes manually labeled triples and the first set of documents. The execution module 403 is used for manually labeling triples and automatically labeling triples as knowledge data extracted from the document.
可选的,该装置还包括:第二获取模块404用于获取人工为第一组文档中的字符添加的标签。Optionally, the device further includes: a second obtaining module 404 configured to obtain tags manually added to the characters in the first set of documents.
第二获取模块404具体用于基于人工选取第一组文档中的字符的操作,显示待选实体标签的列表,待选实体标签依据第一组文档和第二组文档所属的领域的业务需求确定。将人工从待选实体标签的列表中选择的标 签,作为被选取字符的实体标签,得到标注的字符。基于人工选取标注的字符的操作,显示实体标签间的待选关系的列表,实体标签间的待选关系依据第一组文档和第二组文档所属的领域的业务需求确定。将人工从待选关系的列表中选择的关系,作为被选取的标注的字符的关系标签。The second acquisition module 404 is specifically configured to display a list of candidate entity tags based on the operation of manually selecting characters in the first set of documents. The candidate entity tags are determined according to the business requirements of the field to which the first set of documents and the second set of documents belong . The label manually selected from the list of to-be-selected entity labels is used as the entity label of the selected character, and the marked character is obtained. Based on the operation of manually selecting annotated characters, a list of candidate relationships between entity tags is displayed. The candidate relationships between entity tags are determined according to the business requirements of the fields to which the first set of documents and the second set of documents belong. The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
可选的,训练数据还包括以下至少一项:人工标注三元组中的元素在第一组文档中的位置、人工标注三元组中的元素在第一组文档中的组间距离、人工标注三元组中的元素在第一组文档中的组间语法关系。Optionally, the training data further includes at least one of the following: manually labeling the position of the elements in the triplet in the first set of documents, manually labeling the distance between the elements in the triplet in the first set of documents, and manually Mark the grammatical relationship between the elements in the triples in the first set of documents.
可选的,确定模块402用于依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组,包括:确定模块402具体用于标注目标三元组,目标三元组为以下至少一项:预设模型从第二组文档中识别出的三元组中,存在矛盾的三元组,预设模型从第二组文档中识别出的三元组中,与人工标注三元组存在矛盾的三元组;缺项三元组。获取人工对于目标三元组的校正结果,作为自动标注三元组。Optionally, the determining module 402 is configured to determine the triples identified automatically from the second set of documents according to the preset model, including: the determining module 402 is specifically configured to annotate the target triples, the target triples The group is at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples, and the preset model identifies the triples from the second set of documents. Mark triples with contradictory triples; triples with missing items. Obtain the manual correction result of the target triplet as an automatic labeling triplet.
可选的,该装置还包括:训练模块405。训练模块405用于使用目标三元组的校正结果和第二组文档,重新训练模型。Optionally, the device further includes: a training module 405. The training module 405 is used to retrain the model using the correction result of the target triplet and the second set of documents.
可选的,该装置还包括:适配模型确定模块406用于将在训练过程中,从第一组文档中识别出的三元组的准确率最高的模型,作为与第二组文档的类型适配的模型,第一组文档的类型与第二组文档的类型相同。Optionally, the device further includes: an adaptation model determination module 406, configured to use the model with the highest accuracy of the triples identified from the first set of documents in the training process as the type of the second set of documents. For the adapted model, the type of the first group of documents is the same as the type of the second group of documents.
所述数据抽取装置包括处理器和存储器,上述第一获取模块、确定模块和执行模块等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The data extraction device includes a processor and a memory. The first acquisition module, the determination module, and the execution module are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来提高文档中有用信息的利用率。The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more, by adjusting the kernel parameters to improve the utilization of useful information in the document.
本发明实施例提供了一种存储介质,其上存储有程序,该程序被处理器执行时实现所述数据抽取方法。The embodiment of the present invention provides a storage medium on which a program is stored, and the data extraction method is implemented when the program is executed by a processor.
本发明实施例提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行所述数据抽取方法。The embodiment of the present invention provides a processor, the processor is used to run a program, wherein the data extraction method is executed when the program is running.
本发明实施例提供了一种设备,设备包括至少一个处理器、以及与处理器连接的至少一个存储器、总线;其中,处理器、存储器通过总线完成相互间的通信;处理器用于调用存储器中的程序指令,以执行上述的数据抽取方法。本文中的设备可以是服务器、PC、PAD、手机等。The embodiment of the present invention provides a device. The device includes at least one processor, and at least one memory and a bus connected to the processor; wherein the processor and the memory communicate with each other through the bus; the processor is used to call Program instructions to perform the above-mentioned data extraction method. The devices in this article can be servers, PCs, PADs, mobile phones, etc.
本申请还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序:This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
基于人工为第一组文档中的字符添加的标签,获取人工标注三元组;Based on the tags manually added to the characters in the first set of documents, obtain the manually labeled triples;
依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组;其中,预设模型为预设的与第二组文档的类型适配的模型,模型使用训练数据训练得到,训练数据包括人工标注三元组以及第一组文档;According to the triples recognized by the preset model from the second set of documents, determine to automatically label the triples; where the preset model is a preset model that fits the type of the second set of documents, and the model is trained using training data Obtained, the training data includes manually labeled triples and the first set of documents;
将人工标注三元组和自动标注三元组,作为从文档中抽取的知识数据。Manual labeling of triples and automatic labeling of triples are used as knowledge data extracted from documents.
获取人工为第一组文档中的字符添加的标签的过程包括:The process of obtaining the tags manually added to the characters in the first set of documents includes:
基于人工选取第一组文档中的字符的操作,显示待选实体标签的列表,待选实体标签依据第一组文档和第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting characters in the first set of documents, display a list of candidate entity tags, which are determined according to the business requirements of the field to which the first set of documents and the second set of documents belong;
将人工从待选实体标签的列表中选择的标签,作为被选取字符的实体标签,得到标注的字符;Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;
基于人工选取标注的字符的操作,显示实体标签间的待选关系的列表,实体标签间的待选关系依据第一组文档和第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting annotated characters, a list of candidate relationships between entity tags is displayed, and the candidate relationships between entity tags are determined according to the business requirements of the field to which the first set of documents and the second set of documents belong;
将人工从待选关系的列表中选择的关系,作为被选取的所述标注的字符的关系标签。The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
训练数据还包括以下至少一项:人工标注三元组中的元素在第一组文档中的位置、人工标注三元组中的元素在第一组文档中的组间距离、人工标注三元组中的元素在第一组文档中的组间语法关系。The training data also includes at least one of the following: manually labeling the position of the element in the triplet in the first set of documents, manually labeling the element in the triplet in the first set of documents, and manually labeling the triplet The grammatical relationship between the elements in the first group of documents.
依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组,包括:Determine the automatic labeling of the triples based on the triples identified from the second set of documents based on the preset model, including:
标注目标三元组,目标三元组为以下至少一项:预设模型从第二组文 档中识别出的三元组中,存在矛盾的三元组;预设模型从第二组文档中识别出的三元组中,与人工标注三元组存在矛盾的三元组;缺项三元组;Annotate the target triples, and the target triples are at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples; the preset model identifies from the second set of documents Among the three-tuples out, the three-tuples that contradict the manual-labeled three-tuples; the missing three-tuples;
获取人工对于目标三元组的校正结果,作为自动标注三元组。Obtain the manual correction result of the target triplet as an automatic labeling triplet.
使用所述三元组的校正结果和第二组文档,重新训练所述模型。Using the correction result of the triplet and the second set of documents, the model is retrained.
第一组文档的类型与第二组文档的类型相同;The type of the first group of documents is the same as the type of the second group of documents;
与第二组文档的类型适配的模型的确定过程包括:将在训练过程中,从第一组文档中识别出的三元组的准确率最高的模型,作为与第二组文档的类型适配的模型。The process of determining the model adapted to the type of the second group of documents includes: during the training process, the model with the highest accuracy of the triples identified from the first group of documents is used as the model suitable for the type of the second group of documents Matching model.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
在一个典型的配置中,设备包括一个或多个处理器(CPU)、存储器和总线。设备还可以包括输入/输出接口、网络接口等,如图5所示。In a typical configuration, the device includes one or more processors (CPUs), memory, and buses. The equipment may also include input/output interfaces, network interfaces, etc., as shown in Figure 5.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。存储器是计算机可读介质的示例。The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip. The memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可 擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, product or device that includes a series of elements includes not only those elements, but also Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity or equipment that includes the element.
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (10)

  1. 一种数据抽取方法,其特征在于,包括:A data extraction method, characterized in that it comprises:
    基于人工为第一组文档中的字符添加的标签,获取人工标注三元组;Based on the tags manually added to the characters in the first set of documents, obtain the manually labeled triples;
    依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组;其中,所述预设模型为预设的与所述第二组文档的类型适配的模型,所述模型使用训练数据训练得到,所述训练数据包括所述人工标注三元组以及所述第一组文档;According to the triples identified from the second set of documents by the preset model, determine to automatically mark the triples; wherein, the preset model is a preset model adapted to the type of the second set of documents, so The model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;
    将所述人工标注三元组和所述自动标注三元组,作为从文档中抽取的知识数据。The manual labeling triples and the automatic labeling triples are used as knowledge data extracted from the document.
  2. 根据权利要求1所述的方法,其特征在于,获取所述人工为第一组文档中的字符添加的标签的过程包括:The method according to claim 1, wherein the process of obtaining the tags manually added to the characters in the first set of documents comprises:
    基于人工选取所述第一组文档中的字符的操作,显示待选实体标签的列表,所述待选实体标签依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting characters in the first set of documents, a list of candidate entity tags is displayed, the candidate entity tags being determined according to the business requirements of the field to which the first set of documents and the second set of documents belong ;
    将人工从所述待选实体标签的列表中选择的标签,作为被选取字符的实体标签,得到标注的字符;Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;
    基于人工选取所述标注的字符的操作,显示实体标签间的待选关系的列表,所述实体标签间的待选关系依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting the marked characters, a list of to-be-selected relationships between entity tags is displayed, and the to-be-selected relationships between the entity tags are based on the business in the field to which the first set of documents and the second set of documents belong Demand determination;
    将人工从所述待选关系的列表中选择的关系,作为被选取的所述标注的字符的关系标签。The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
  3. 根据权利要求1所述的方法,其特征在于,所述训练数据还包括以下至少一项:The method according to claim 1, wherein the training data further comprises at least one of the following:
    所述人工标注三元组中的元素在所述第一组文档中的位置、所述人工标注三元组中的元素在所述第一组文档中的组间距离、所述人工标注三元组中的元素在所述第一组文档中的组间语法关系。The position of the elements in the manually labeled triples in the first set of documents, the distance between the elements in the manually labeled triples in the first set of documents, and the manually labeled triples The grammatical relationship between the elements in the group in the first group of documents.
  4. 根据权利要求1所述的方法,其特征在于,所述依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组,包括:The method according to claim 1, wherein the determining the triples identified from the second set of documents according to the preset model to automatically label the triples comprises:
    标注目标三元组,所述目标三元组为以下至少一项:所述预设模型从所述第二组文档中识别出的三元组中,存在矛盾的三元组;所述预设模型从所述第二组文档中识别出的三元组中,与所述人工标注三元组存在矛盾的三元组;缺项三元组;Annotate target triples, where the target triples are at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples; the preset Among the triples identified by the model from the second set of documents, the triples that contradict the manually labeled triples; the missing triples;
    获取人工对于所述目标三元组的校正结果,作为所述自动标注三元组。Obtain a manual correction result for the target triplet and use it as the automatically labeled triplet.
  5. 根据权利要求4所述的方法,其特征在于,使用所述目标三元组的校正结果和所述第二组文档,重新训练所述模型。The method according to claim 4, wherein the correction result of the target triple and the second set of documents are used to retrain the model.
  6. 根据权利要求1所述的方法,其特征在于,所述第一组文档的类型与所述第二组文档的类型相同;The method according to claim 1, wherein the type of the first group of documents is the same as the type of the second group of documents;
    所述与所述第二组文档的类型适配的模型的确定过程包括:The process of determining the model adapted to the type of the second set of documents includes:
    将在训练过程中,从所述第一组文档中识别出的三元组的准确率最高的模型,作为与所述第二组文档的类型适配的模型。In the training process, the model with the highest accuracy of the triples identified from the first set of documents is used as a model adapted to the type of the second set of documents.
  7. 一种数据抽取装置,其特征在于,包括:A data extraction device is characterized in that it comprises:
    第一获取模块,用于基于人工为第一组文档中的字符添加的标签,获取人工标注三元组;The first obtaining module is configured to obtain manually labeled triples based on the tags manually added to the characters in the first set of documents;
    确定模块,用于依据预设模型从第二组文档中识别出的三元组,确定自动标注三元组;其中,所述预设模型为预设的与所述第二组文档的类型适配的模型,所述模型使用训练数据训练得到,所述训练数据包括所述人工标注三元组以及所述第一组文档;The determining module is used to determine the automatically labeled triples according to the triples identified from the second set of documents according to the preset model; wherein, the preset model is a preset model suitable for the type of the second set of documents. Configured model, the model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;
    执行模块,用于将所述人工标注三元组和所述自动标注三元组,作为从文档中抽取的知识数据。The execution module is configured to use the manual labeling triples and the automatic labeling triples as knowledge data extracted from the document.
  8. 根据权利要求7所述的装置,其特征在于,还包括:第二获取模块,用于获取所述人工为第一组文档中的字符添加的标签;8. The device according to claim 7, further comprising: a second obtaining module, configured to obtain the tags manually added to the characters in the first set of documents;
    所述第二获取模块,具体用于基于人工选取所述第一组文档中的字符的操作,显示待选实体标签的列表,所述待选实体标签依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;The second acquiring module is specifically configured to display a list of candidate entity tags based on an operation of manually selecting characters in the first set of documents, and the candidate entity tags are based on the first set of documents and the first set of documents. Determine the business requirements of the field to which the second set of documents belong;
    将人工从所述待选实体标签的列表中选择的标签,作为被选取字符的实体标签,得到标注的字符;Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;
    基于人工选取所述标注的字符的操作,显示实体标签间的待选关系的列表,所述实体标签间的待选关系依据所述第一组文档和所述第二组文档所属的领域的业务需求确定;Based on the operation of manually selecting the marked characters, a list of to-be-selected relationships between entity tags is displayed, and the to-be-selected relationships between the entity tags are based on the business in the field to which the first set of documents and the second set of documents belong Demand determination;
    将人工从所述待选关系的列表中选择的关系,作为被选取的所述标注的字符的关系标签。The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
  9. 一种存储介质,其特征在于,所述存储介质包括存储的程序,其中,所述程序执行权利要求1~6任意一项权利要求所述的数据抽取方法。A storage medium, characterized in that the storage medium includes a stored program, wherein the program executes the data extraction method according to any one of claims 1 to 6.
  10. 一种设备,其特征在于,包括:处理器、存储器和总线;所述处理器与所述存储器通过所述总线连接;A device, characterized by comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;
    所述存储器用于存储程序,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1~6任意一项权利要求所述的数据抽取方法。The memory is used to store a program, and the processor is used to run a program, wherein the data extraction method according to any one of claims 1 to 6 is executed when the program is running.
PCT/CN2020/071879 2019-08-26 2020-01-14 Data extraction method and device, storage medium and equipment WO2021036181A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910789378.7 2019-08-26
CN201910789378.7A CN111475641B (en) 2019-08-26 2019-08-26 Data extraction method and device, storage medium and equipment

Publications (1)

Publication Number Publication Date
WO2021036181A1 true WO2021036181A1 (en) 2021-03-04

Family

ID=71744906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/071879 WO2021036181A1 (en) 2019-08-26 2020-01-14 Data extraction method and device, storage medium and equipment

Country Status (2)

Country Link
CN (1) CN111475641B (en)
WO (1) WO2021036181A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332761B (en) * 2023-11-30 2024-02-09 北京一标数字科技有限公司 PDF document intelligent identification marking system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649264A (en) * 2016-11-21 2017-05-10 中国农业大学 Text information-based Chinese fruit variety information extracting method and device
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
CN108984683A (en) * 2018-06-29 2018-12-11 北京百度网讯科技有限公司 Extracting method, system, equipment and the storage medium of structural data
CN109471948A (en) * 2018-11-08 2019-03-15 威海天鑫现代服务技术研究院有限公司 A kind of the elder's health domain knowledge question answering system construction method
US20190155898A1 (en) * 2017-11-23 2019-05-23 Beijing Baidu Netcom Science And Technology Co. Ltd. Method and device for extracting entity relation based on deep learning, and server

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101662433B1 (en) * 2015-03-09 2016-10-05 포항공과대학교 산학협력단 Method and apparatus for expanding knowledge base using open information extraction
CN107291708A (en) * 2016-03-30 2017-10-24 《中国学术期刊(光盘版)》电子杂志社有限公司 A kind of method of text based automatic identification literature research
CN108090070B (en) * 2016-11-22 2021-08-24 湖南四方天箭信息科技有限公司 Chinese entity attribute extraction method
CN108090499B (en) * 2017-11-13 2020-08-11 中国科学院自动化研究所 Data active labeling method and system based on maximum information triple screening network
US10572801B2 (en) * 2017-11-22 2020-02-25 Clinc, Inc. System and method for implementing an artificially intelligent virtual assistant using machine learning
EP3495968A1 (en) * 2017-12-11 2019-06-12 Tata Consultancy Services Limited Method and system for extraction of relevant sections from plurality of documents
CN108595460A (en) * 2018-01-05 2018-09-28 中译语通科技股份有限公司 Multichannel evaluating method and system, the computer program of keyword Automatic
CN108256063B (en) * 2018-01-15 2020-11-03 中国人民解放军国防科技大学 Knowledge base construction method for network security
CN108182295B (en) * 2018-02-09 2021-09-10 重庆电信系统集成有限公司 Enterprise knowledge graph attribute extraction method and system
CN108920465A (en) * 2018-07-13 2018-11-30 福州大学 A kind of agriculture field Relation extraction method based on syntactic-semantic
CN109492686A (en) * 2018-11-01 2019-03-19 郑州云海信息技术有限公司 A kind of picture mask method and system
CN109472033B (en) * 2018-11-19 2022-12-06 华南师范大学 Method and system for extracting entity relationship in text, storage medium and electronic equipment
CN109543047A (en) * 2018-11-21 2019-03-29 焦点科技股份有限公司 A kind of knowledge mapping construction method based on medical field website
CN109378053B (en) * 2018-11-30 2021-07-06 安徽影联云享医疗科技有限公司 Knowledge graph construction method for medical image
CN110110327B (en) * 2019-04-26 2021-06-22 网宿科技股份有限公司 Text labeling method and equipment based on counterstudy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649264A (en) * 2016-11-21 2017-05-10 中国农业大学 Text information-based Chinese fruit variety information extracting method and device
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
US20190155898A1 (en) * 2017-11-23 2019-05-23 Beijing Baidu Netcom Science And Technology Co. Ltd. Method and device for extracting entity relation based on deep learning, and server
CN108984683A (en) * 2018-06-29 2018-12-11 北京百度网讯科技有限公司 Extracting method, system, equipment and the storage medium of structural data
CN109471948A (en) * 2018-11-08 2019-03-15 威海天鑫现代服务技术研究院有限公司 A kind of the elder's health domain knowledge question answering system construction method

Also Published As

Publication number Publication date
CN111475641B (en) 2021-05-14
CN111475641A (en) 2020-07-31

Similar Documents

Publication Publication Date Title
US10372821B2 (en) Identification of reading order text segments with a probabilistic language model
US10438133B2 (en) Spend data enrichment and classification
CN107423278B (en) Evaluation element identification method, device and system
US20210374347A1 (en) Few-shot named-entity recognition
CN112800848A (en) Structured extraction method, device and equipment of information after bill identification
CN110163376B (en) Sample detection method, media object identification method, device, terminal and medium
WO2021057133A1 (en) Method for training document classification model, and related apparatus
CN111241230A (en) Method and system for identifying string mark risk based on text mining
WO2019072098A1 (en) Method and system for identifying core product terms
US20230045330A1 (en) Multi-term query subsumption for document classification
US11755766B2 (en) Systems and methods for detecting personally identifiable information
CN116644755A (en) Multi-task learning-based few-sample named entity recognition method, device and medium
CN110826315B (en) Method for identifying timeliness of short text by using neural network system
CN111858903A (en) Method and device for negative news early warning
WO2021036181A1 (en) Data extraction method and device, storage medium and equipment
CN110737770B (en) Text data sensitivity identification method and device, electronic equipment and storage medium
CN111814481B (en) Shopping intention recognition method, device, terminal equipment and storage medium
CN109670162A (en) The determination method, apparatus and terminal device of title
CN114691907B (en) Cross-modal retrieval method, device and medium
US20230316098A1 (en) Machine learning techniques for extracting interpretability data and entity-value pairs
US11238243B2 (en) Extracting joint topic-sentiment models from text inputs
CN111488737B (en) Text recognition method, device and equipment
CN113177121A (en) Text topic classification method and device, electronic equipment and storage medium
CN110717029A (en) Information processing method and system
CN111125377B (en) Entity relationship identification method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20858140

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20858140

Country of ref document: EP

Kind code of ref document: A1