WO2021136318A1 - 一种面向数字人文的电子邮件历史事件轴生成方法及装置 - Google Patents

一种面向数字人文的电子邮件历史事件轴生成方法及装置 Download PDF

Info

Publication number
WO2021136318A1
WO2021136318A1 PCT/CN2020/141129 CN2020141129W WO2021136318A1 WO 2021136318 A1 WO2021136318 A1 WO 2021136318A1 CN 2020141129 W CN2020141129 W CN 2020141129W WO 2021136318 A1 WO2021136318 A1 WO 2021136318A1
Authority
WO
WIPO (PCT)
Prior art keywords
email
mail
oriented
hash
humanities
Prior art date
Application number
PCT/CN2020/141129
Other languages
English (en)
French (fr)
Inventor
林延中
杨芸
朱南皓
潘文辉
彭文浩
许佳柱
Original Assignee
论客科技(广州)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 论客科技(广州)有限公司 filed Critical 论客科技(广州)有限公司
Publication of WO2021136318A1 publication Critical patent/WO2021136318A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/44Browsing; Visualisation therefor
    • G06F16/447Temporal browsing, e.g. timeline
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/216Handling conversation history, e.g. grouping of messages in sessions or threads
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the field of mail retrieval, in particular to a method, device, terminal equipment and readable storage medium for generating a digital humanities-oriented email history event axis.
  • the "semantic gap" is a difficult problem today. Deep learning is applied to e-mail retrieval to solve the "media gap" between heterogeneous data. Provides a large number of advanced research results in feature learning and representation.
  • traditional retrieval techniques cannot effectively capture the spatial and temporal nature of historical information contained in e-mails from the perspective of digital humanities, and cannot learn powerful feature representation and cross-modal embedding, and therefore cannot generate high-level information for digital humanities research.
  • the high-quality, compact hash code makes it impossible to obtain better retrieval results.
  • the technical problem to be solved by the embodiments of the present invention is to provide a method, device, terminal device, and readable storage medium for generating a digital humanities-oriented email history event axis, which can apply deep learning to email retrieval, and provide The e-mail generated by the e-mail history event axis, which effectively assists the preservation and utilization of digital humanities research and related electronic historical materials.
  • an embodiment of the present invention provides a method for generating a digital humanities-oriented email history event axis, including:
  • each training data sample in the training data set includes multiple tags
  • an email history event axis is generated in chronological order.
  • the method further includes:
  • All emails in the target mail system are archived in real time, and each email is time-marked and stored according to the mail receiving time or the mail sending time.
  • the establishment of a deep hash model further includes:
  • a preset loss function is added to the deep hash model to restrict the output result of the deep hash model in the training process through the loss function.
  • the training of the deep hash model using a preset training data set specifically includes:
  • the deep hash model is trained.
  • the present invention also provides a digital humanities-oriented email history event axis generation device, including:
  • the model training module is used to establish a deep hash model and use a preset training data set to train the deep hash model; wherein each training data sample in the training data set includes multiple tags;
  • the mail encoding module is used to hash all pre-filed emails using the trained deep hash model, and associate the obtained mail hash codes with the corresponding mails and store them in the mail historical database;
  • the mail search module is used to generate a search hash value according to the obtained search sentence by using the trained deep hash model, and search the mail history database according to the search hash value to extract the target email;
  • the event axis generating module is used to generate an email history event axis in chronological order according to the archiving time of each target email.
  • the digital humanities-oriented e-mail history event axis generating device also includes a mail archiving module, which is used to archive all e-mails in the target mail system in real time, and to check each e-mail according to the mail receiving time or the mail sending time. E-mails are time stamped and stored.
  • the present invention also provides a digital humanities-oriented e-mail history event axis generation terminal device, including a processor, a memory, and a device stored in the memory and configured to be executed by the processor.
  • a computer program wherein the memory is coupled to the processor, and when the processor executes the computer program, any one of the digital humanities-oriented email history event axis generation methods is implemented.
  • the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein when the computer program is running, the computer-readable storage medium is controlled to be located
  • the device executes any of the digital humanities-oriented e-mail history event axis generation methods described above.
  • the present invention has the following beneficial effects:
  • the embodiment of the present invention provides a method, device, terminal device, and readable storage medium for generating a digital humanities-oriented email history event axis.
  • the method includes: establishing a deep hash model, and using a preset training data set to pair The deep hash model is trained; wherein each training data sample in the training data set includes multiple tags; the trained deep hash model is used to hash all pre-archived emails, and The obtained mail hash code is associated with the corresponding mail and stored in the mail historical data database; the trained deep hash model is used to generate a search hash value according to the obtained search sentence, and the search hash value is used to compare the The mail historical data database is retrieved to extract the target emails; according to the archiving time of each target email, an email history event axis is generated in chronological order.
  • the invention can apply deep learning to e-mail retrieval, and generate an e-mail history event axis for retrieved e-mails, thereby effectively assisting the preservation and utilization of digital humanities research and
  • FIG. 1 is a schematic flowchart of a method for generating a digital humanities-oriented email history event axis according to an embodiment of the present invention
  • Fig. 2 is a schematic structural diagram of a digital humanities-oriented email history event axis generating device provided by an embodiment of the present invention.
  • an embodiment of the present invention provides a method for generating a digital humanities-oriented email history event axis, which includes the following steps:
  • said establishing a deep hash model further includes:
  • a preset loss function is added to the deep hash model to restrict the output result of the deep hash model in the training process through the loss function.
  • the training of the deep hash model using a preset training data set is specifically:
  • the deep hash model is trained.
  • step S1 it further includes:
  • All emails in the target mail system are archived in real time, and each email is time-marked and stored according to the mail receiving time or the mail sending time.
  • the embodiment of the present invention provides a digital humanities-oriented email history event axis generation method, which defines the similarity between data through the co-occurrence relationship between multi-label data, and uses this as a network training method.
  • Supervise information The design is suitable for the loss function with digital humanistic characteristics to train the network.
  • Use the trained model to extract feature vectors of email historical data, complete the retrieval process, and present the retrieval results as a historical event axis about the keyword, ultimately assisting research in the digital humanities field.
  • This solution provides a digital humanities-oriented email history event axis generation method.
  • First it overcomes the disadvantages of traditional artificially set features that are insufficient in feature representation, and fully considers the complex similarities between the cross-modal data contained in emails.
  • the degree relationship makes the learned hash code retain more semantic information, thereby reducing the heterogeneity of cross-modal retrieval and improving retrieval accuracy.
  • the purpose of the present invention is to overcome the shortcomings of the prior art, combine the feature representation based on deep learning, and at the same time consider the multi-layer semantic similarity of different modal data of emails, apply the hash method, and obtain the data through network training.
  • the mapping to the hash code provides a retrieval method with higher retrieval accuracy, and applies the cross-modal retrieval technology to the field of digital humanities, to obtain results with temporal and spatial characteristics that are conducive to humanistic research, and help e-mail as electronic historical data Standardized preservation, research and use.
  • the solution of the present invention proposes a new perspective, starting from the digital humanistic perspective, combining email data with history, through digital humanistic features (for example: economics, politics, history, geography, war, law, family affection, music , Fine arts, architecture, etc.) to construct the e-mail historical event axis, thereby standardizing e-mail as a kind of electronic historical data, assisting the preservation and utilization of digital humanities research and related electronic historical materials, and assisting humanities academic research.
  • digital humanistic features for example: economics, politics, history, geography, war, law, family affection, music , Fine arts, architecture, etc.
  • the electronic historical database of the solution of the present invention can be understood as a historical database storing all emails about a certain organization or a certain person in a certain time dimension.
  • the concept of historical events is not similar to the concepts of "Anshi Rebellion” and “Five Random China” as we understand it now.
  • the purpose of the present invention is to strongly bind mail and digital humanities.
  • the historical concept mentioned refers to future historians' research. In contemporary history (such as corporate development history, celebrity biographies), all e-mails are used as historical data, and all related original e-mail historical data are randomly searched according to the research content keywords, and then formed according to the relevant technology about this keyword Axis of historical events.
  • Step 1 Archive the discrete emails to form a database of original email historical data.
  • Step 2 Design a deep hash model, and based on the pre-trained model, transfer learning to train the deep hash model for email historical data, and design a loss function:
  • Step 3 According to the trained model, the corresponding binary hash code is generated from all emails through the deep hash model. In the subsequent retrieval process, the retrieval hash value is generated through the deep hash model for the specific input retrieval word, and the corresponding The hash value realizes the quick retrieval of the mail code.
  • Step 4 Generate a time axis in chronological order according to the search result and the time of archiving the email.
  • the present invention also provides a digital humanities-oriented email history event axis generation device, including:
  • the model training module 1 is used to establish a deep hash model and use a preset training data set to train the deep hash model; wherein each training data sample in the training data set includes multiple tags;
  • the mail encoding module 2 is used for hash encoding all pre-filed emails using the trained deep hash model, and associate the obtained mail hash codes with the corresponding mails and store them in the mail historical database;
  • the mail search module 3 is configured to generate a search hash value according to the obtained search sentence by using the trained deep hash model, and search the mail history database according to the search hash value to extract the target email;
  • the event axis generating module 4 is used to generate an email history event axis in chronological order according to the archiving time of each target email.
  • the digital humanities-oriented e-mail history event axis generating device also includes a mail archiving module, which is used to archive all e-mails in the target mail system in real time, and to check each e-mail according to the mail receiving time or the mail sending time. E-mails are time stamped and stored.
  • the digital humanities-oriented email history event axis generation device provided by the embodiment of the present invention can implement any method of the present invention.
  • This embodiment provides a method for generating a digital humanities-oriented email history event axis.
  • the present invention also provides a digital humanities-oriented e-mail history event axis generation terminal device, including a processor, a memory, and a device stored in the memory and configured to be executed by the processor.
  • a computer program wherein the memory is coupled to the processor, and when the processor executes the computer program, any one of the digital humanities-oriented email history event axis generation methods is implemented.
  • the digital humanities-oriented email history event axis generating terminal device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the processor may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), off-the-shelf Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor is the control center of the digital humanities-oriented email history event axis generation terminal device, using various interfaces and The line connects the entire digital humanities-oriented e-mail history event axis to generate various parts of the terminal equipment.
  • the memory may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function, and the like; the data storage area may store data created according to the use of a mobile phone.
  • the memory can include high-speed random access memory, and can also include non-volatile memory, such as hard disks, memory, plug-in hard disks, smart media cards (SMC), and secure digital (SD) cards. , Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein when the computer program is running, the computer-readable storage medium is controlled to be located
  • the device executes any of the digital humanities-oriented e-mail history event axis generation methods described above.
  • the computer program can be stored in a computer-readable storage medium, and when the computer program is executed by a processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction.
  • the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
  • the device embodiments described above are only illustrative, and the units described as separate parts may or may not be physically separated, and the parts displayed as units may or may not be physically separate. Units can be located in one place or distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the connection relationship between the modules indicates that they have a communication connection between them, which can be specifically implemented as one or more communication buses or signal lines.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种面向数字人文的电子邮件历史事件轴生成方法、装置、终端设备及可读存储介质,方法包括:建立深度哈希模型,并利用训练数据集对深度哈希模型进行训练(S1);利用训练好的深度哈希模型对预先归档的电子邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中(S2);利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值并进行检索以提取出目标电子邮件(S3);根据各个目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴(S4)。该方法将深度学习应用于电子邮件检索,对检索到的电子邮件生成电子邮件历史事件轴,从而有效辅助数字人文研究及相关电子史料的保存与利用、助力人文学术研究。

Description

一种面向数字人文的电子邮件历史事件轴生成方法及装置 技术领域
本发明涉及邮件检索领域,尤其是涉及一种面向数字人文的电子邮件历史事件轴生成方法、装置、终端设备及可读存储介质。
背景技术
在数字人文领域,传统人文历史研究也受到了诸如人工智能、深度学习的影响,产生了以技术辅助人文研究的新趋势。传统历史的研究依赖于纸质史料,随着互联网的普及,未来历史学家对当代历史的研究将会由传统纸质史料转为电子史料,电子邮件作为传统信件在信息社会里的新形式,能帮助各个群体更好地开展各方面的当代历史研究,为现实行为提供合法依据。
由于不同数据的异构性,以及传统手工设计特征在特征表示上能力不足,传统的跨模态检索方法不能有效地通过多模态数据的关联关系来降低其异构性,从而无法取得更好的效果。
同时,就提取电子邮件信息作为面向数字人文的历史事件轴而言,“语义鸿沟”问题是当今的难点问题,将深度学习应用于电子邮件检索,为解决异质数据之间的“媒体鸿沟”提供了大量特征学习与表示方面先进的研究成果。但是,传统检索技术不能有效捕捉在数字人文视域下,电子邮件包含的历史信息的空间性和时间性,无法学习强大的特征表示和跨模态嵌入,因而无法生成用于数字人文研究的高质量、紧凑的哈希编码,导致不能取得较好的检索效果。
发明内容
本发明实施例所要解决的技术问题在于,提供一种面向数字人文的电子邮件历史事件轴生成方法、装置、终端设备及可读存储介质,能够将深度学习应用于电子邮件检索,并对检索到的电子邮件生成电子邮件历史事件轴,进而有效辅助数字人文研究及相关电子史料的保存与利用。
为了解决上述技术问题,本发明实施例提供了一种面向数字人文的电子邮件 历史事件轴生成方法,包括:
建立深度哈希模型,并利用预设的训练数据集对所述深度哈希模型进行训练;其中,所述训练数据集中的每一训练数据样本均包括多个标签;
利用训练好的深度哈希模型对预先归档的所有电子邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中;
利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值,并根据所述检索哈希值对所述邮件史料数据库进行检索以提取出目标电子邮件;
根据各个所述目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴。
进一步地,在所述建立深度哈希模型之前,还包括:
实时对目标邮件系统中的所有电子邮件进行归档,根据邮件收件时间或邮件发件时间对各个电子邮件进行时间标注并存储。
进一步地,在所述对各个电子邮件进行时间标注后,还包括:
根据预设的加密算法对归档的所有电子邮件进行加密并存储。
进一步地,所述建立深度哈希模型,还包括:
将预设的损失函数添加到所述深度哈希模型,以通过所述损失函数对所述深度哈希模型在训练过程中的输出结果进行约束。
进一步地,所述利用预设的训练数据集对所述深度哈希模型进行训练,具体为:
根据所述训练数据集中的多标签数据之间的共现关系,对数据之间的相似度判定规则进行定义;
以所述相似度判定规则作为监督信息,对所述深度哈希模型进行训练。
为了解决相同的技术问题,本发明还提供了一种面向数字人文的电子邮件历史事件轴生成装置,包括:
模型训练模块,用于建立深度哈希模型,并利用预设的训练数据集对所述深度哈希模型进行训练;其中,所述训练数据集中的每一训练数据样本均包括多个标签;
邮件编码模块,用于利用训练好的深度哈希模型对预先归档的所有电子邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中;
邮件检索模块,用于利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值,并根据所述检索哈希值对所述邮件史料数据库进行检索以提取出目标电子邮件;
事件轴生成模块,用于根据各个所述目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴。
进一步地,所述的面向数字人文的电子邮件历史事件轴生成装置还包括邮件归档模块,用于实时对目标邮件系统中的所有电子邮件进行归档,根据邮件收件时间或邮件发件时间对各个电子邮件进行时间标注并存储。
进一步地,在所述对各个电子邮件进行时间标注后,还包括:
根据预设的加密算法对归档的所有电子邮件进行加密并存储。
为了解决相同的技术问题,本发明还提供了一种面向数字人文的电子邮件历史事件轴生成终端设备,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述存储器与所述处理器耦接,且所述处理器执行所述计算机程序时,实现任一项所述的面向数字人文的电子邮件历史事件轴生成方法。
为了解决相同的技术问题,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在的设备执行任一项所述的面向数字人文的电子邮件历史事件轴生成方法。
与现有技术相比,本发明具有如下有益效果:
本发明实施例提供了一种面向数字人文的电子邮件历史事件轴生成方法、装置、终端设备及可读存储介质,所述方法包括:建立深度哈希模型,并利用预设的训练数据集对所述深度哈希模型进行训练;其中,所述训练数据集中的每一训练数据样本均包括多个标签;利用训练好的深度哈希模型对预先归档的所有电子 邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中;利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值,并根据所述检索哈希值对所述邮件史料数据库进行检索以提取出目标电子邮件;根据各个所述目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴。本发明能够将深度学习应用于电子邮件检索,并对检索到的电子邮件生成电子邮件历史事件轴,从而能够有效辅助数字人文研究及相关电子史料的保存与利用、助力人文学术研究。
附图说明
图1是本发明一实施例提供的面向数字人文的电子邮件历史事件轴生成方法的流程示意图;
图2是本发明一实施例提供的面向数字人文的电子邮件历史事件轴生成装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参见图1,本发明实施例提供了一种面向数字人文的电子邮件历史事件轴生成方法,包括步骤:
S1、建立深度哈希模型,并利用预设的训练数据集对所述深度哈希模型进行训练;其中,所述训练数据集中的每一训练数据样本均包括多个标签;
在本发明实施例中,进一步地,所述建立深度哈希模型,还包括:
将预设的损失函数添加到所述深度哈希模型,以通过所述损失函数对所述深度哈希模型在训练过程中的输出结果进行约束。
在本发明实施例中,进一步地,所述利用预设的训练数据集对所述深度哈希模型进行训练,具体为:
根据所述训练数据集中的多标签数据之间的共现关系,对数据之间的相似度 判定规则进行定义;
以所述相似度判定规则作为监督信息,对所述深度哈希模型进行训练。
S2、利用训练好的深度哈希模型对预先归档的所有电子邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中;
S3、利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值,并根据所述检索哈希值对所述邮件史料数据库进行检索以提取出目标电子邮件;
S4、根据各个所述目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴。
进一步地,在步骤S1之前,还包括:
实时对目标邮件系统中的所有电子邮件进行归档,根据邮件收件时间或邮件发件时间对各个电子邮件进行时间标注并存储。
进一步地,在所述对各个电子邮件进行时间标注后,还包括:
根据预设的加密算法对归档的所有电子邮件进行加密并存储。
需要说明的是,本发明实施例提供了一种面向数字人文的电子邮件历史事件轴生成方法,通过多标签数据之间的共现关系定义数据之间的相似度,并以此作为网络训练的监督信息。设计适用于具有数字人文特征的损失函数,对网络进行训练。使用完成训练的模型提取电子邮件史料特征向量,完成检索流程,检索结果呈现为关于该关键词的历史事件轴,最终辅助数字人文领域研究。
与现有技术相比,本发明方案原理及优点如下:
本方案提供一种面向数字人文的电子邮件历史事件轴生成方法,首先,克服传统人工设定的特征在特征表示能力上不足的劣势,充分考虑电子邮件包含的跨模态数据之间复杂的相似度关系,使学习到的哈希码保留更多语义信息,从而使得跨模态检索的异构性减轻,提高检索准确率。其次,结合数字人文研究内容的时空特性,生成具备人文研究价值的结果。
需要说明的是,本发明的目的在于克服现有技术的不足,结合基于深度学习的特征表示,同时考虑电子邮件不同模态数据的多层语义相似性,应用哈希方法,通过网络训练得到数据到哈希码的映射,提供一种检索准确率更高检索方法,并 将跨模态检索的技术用于数字人文领域,获得具备时空特征的、利于人文研究的结果,助力电子邮件作为电子史料的规范化保存、研究使用。
现有技术方案,针对邮件数据的检索,多是针对垃圾邮件的检测识别,是针对文本+图像的检测;本发明方案跳出垃圾邮件检索识别的思维局限,针对中文邮件,通过多年累积的海量中文样本基础,形成了数据完整、不可篡改的电子史料数据库。本发明方案提出了一个新的视域,从数字人文视域出发,将电子邮件数据与历史相结合,通过具有数字人文特征(例如:经济、政治、历史、地理、战争、法律、亲情、音乐、美术、建筑等等)的检索结果构建电子邮件历史事件轴,从而将电子邮件作为一种电子史料进行规范化,辅助数字人文研究及相关电子史料的保存与利用,助力人文学术研究。
需要说明的是,可将本发明方案的电子史料数据库理解为存放所有关于某个组织或某个人某一个时间维度的所有电子邮件的史料库。其中,历史事件的概念不是我们现在理解的类似于“安史之乱”、“五胡乱华”这种概念,本发明目的是将邮件和数字人文强绑定,提到的历史概念指未来历史学家研究当代历史(如企业发展史、名人传记)的时候,将所有的电子邮件作为一种史料,随机根据研究内容关键词搜索出所有关联的原始电子邮件史料,再根据相关技术形成的关于这个关键词的历史事件轴。
以下列举具体例子对本发明方案进行详细说明:
步骤1、将离散的电子邮件进行归档,形成原始电子邮件历史史料数据库。
通过将离散的电子邮件通过归档系统进行归档:对接收、发送和内部互发的所有邮件进行实时归档、分类,并对进行归档的邮件采用加密算法进行加密存储,保证归档内容作为电子史料不可篡改。
步骤2、设计深度哈希模型,并基于预训练的模型,迁移学习训练针对电子邮件史料的深度哈希模型,并设计损失函数:
Figure PCTCN2020141129-appb-000001
通过上述损失函数,在确保训练过程中,所有的邮件正文分词对的输出结果b 1和b 2在±1附近。
步骤3、针对已经训练的模型,将所有邮件通过深度哈希模型生成对应的二进制哈希码,在后续检索过程中,针对特定输入的检索词语,通过深度哈希模型生成检索hash值,通过对应hash值实现邮件编码的快速检索。
步骤4、根据检索的结果以及归档邮件的时间,按照时间先后顺序生成时间轴。
需要说明的是,对于以上方法或流程实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作并不一定是本发明实施例所必须的。
请参见图2,为了解决相同的技术问题,本发明还提供了一种面向数字人文的电子邮件历史事件轴生成装置,包括:
模型训练模块1,用于建立深度哈希模型,并利用预设的训练数据集对所述深度哈希模型进行训练;其中,所述训练数据集中的每一训练数据样本均包括多个标签;
邮件编码模块2,用于利用训练好的深度哈希模型对预先归档的所有电子邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中;
邮件检索模块3,用于利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值,并根据所述检索哈希值对所述邮件史料数据库进行检索以提取出目标电子邮件;
事件轴生成模块4,用于根据各个所述目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴。
进一步地,所述的面向数字人文的电子邮件历史事件轴生成装置还包括邮件归档模块,用于实时对目标邮件系统中的所有电子邮件进行归档,根据邮件收件时间或邮件发件时间对各个电子邮件进行时间标注并存储。
进一步地,在所述对各个电子邮件进行时间标注后,还包括:
根据预设的加密算法对归档的所有电子邮件进行加密并存储。
可以理解的是上述装置项实施例,是与本发明方法项实施例相对应的,本发明实施例提供的一种面向数字人文的电子邮件历史事件轴生成装置,可以实现本发明任意一项方法项实施例提供的面向数字人文的电子邮件历史事件轴生成方法。
为了解决相同的技术问题,本发明还提供了一种面向数字人文的电子邮件历史事件轴生成终端设备,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述存储器与所述处理器耦接,且所述处理器执行所述计算机程序时,实现任一项所述的面向数字人文的电子邮件历史事件轴生成方法。
所述面向数字人文的电子邮件历史事件轴生成终端设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器是所述面向数字人文的电子邮件历史事件轴生成终端设备的控制中心,利用各种接口和线路连接整个面向数字人文的电子邮件历史事件轴生成终端设备的各个部分。
所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据手机的使用所创建的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
为了解决相同的技术问题,本发明还提供了一种计算机可读存储介质,所述 计算机可读存储介质存储有计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在的设备执行任一项所述的面向数字人文的电子邮件历史事件轴生成方法。
所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本发明提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。

Claims (10)

  1. 一种面向数字人文的电子邮件历史事件轴生成方法,其特征在于,包括:
    建立深度哈希模型,并利用预设的训练数据集对所述深度哈希模型进行训练;其中,所述训练数据集中的每一训练数据样本均包括多个标签;
    利用训练好的深度哈希模型对预先归档的所有电子邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中;
    利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值,并根据所述检索哈希值对所述邮件史料数据库进行检索以提取出目标电子邮件;
    根据各个所述目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴。
  2. 根据权利要求1所述的面向数字人文的电子邮件历史事件轴生成方法,其特征在于,在所述建立深度哈希模型之前,还包括:
    实时对目标邮件系统中的所有电子邮件进行归档,根据邮件收件时间或邮件发件时间对各个电子邮件进行时间标注并存储。
  3. 根据权利要求2所述的面向数字人文的电子邮件历史事件轴生成方法,其特征在于,在所述对各个电子邮件进行时间标注后,还包括:
    根据预设的加密算法对归档的所有电子邮件进行加密并存储。
  4. 根据权利要求1所述的面向数字人文的电子邮件历史事件轴生成方法,其特征在于,所述建立深度哈希模型,还包括:
    将预设的损失函数添加到所述深度哈希模型,以通过所述损失函数对所述深度哈希模型在训练过程中的输出结果进行约束。
  5. 根据权利要求1所述的面向数字人文的电子邮件历史事件轴生成方法,其特征在于,所述利用预设的训练数据集对所述深度哈希模型进行训练,具体为:
    根据所述训练数据集中的多标签数据之间的共现关系,对数据之间的相似度判定规则进行定义;
    以所述相似度判定规则作为监督信息,对所述深度哈希模型进行训练。
  6. 一种面向数字人文的电子邮件历史事件轴生成装置,其特征在于,包括:
    模型训练模块,用于建立深度哈希模型,并利用预设的训练数据集对所述深度哈希模型进行训练;其中,所述训练数据集中的每一训练数据样本均包括多个标签;
    邮件编码模块,用于利用训练好的深度哈希模型对预先归档的所有电子邮件进行哈希编码,并将得到的邮件哈希码与对应的邮件进行关联存储到邮件史料数据库中;
    邮件检索模块,用于利用训练好的深度哈希模型根据获取到的检索语句生成检索哈希值,并根据所述检索哈希值对所述邮件史料数据库进行检索以提取出目标电子邮件;
    事件轴生成模块,用于根据各个所述目标电子邮件的归档时间,按时间先后顺序生成电子邮件历史事件轴。
  7. 根据权利要求6所述的面向数字人文的电子邮件历史事件轴生成装置,其特征在于,还包括邮件归档模块,用于实时对目标邮件系统中的所有电子邮件进行归档,根据邮件收件时间或邮件发件时间对各个电子邮件进行时间标注并存储。
  8. 根据权利要求7所述的面向数字人文的电子邮件历史事件轴生成装置,其特征在于,在所述对各个电子邮件进行时间标注后,还包括:
    根据预设的加密算法对归档的所有电子邮件进行加密并存储。
  9. 一种面向数字人文的电子邮件历史事件轴生成终端设备,其特征在于, 包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述存储器与所述处理器耦接,且所述处理器执行所述计算机程序时,实现如权利要求1至5任一项所述的面向数字人文的电子邮件历史事件轴生成方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在的设备执行如权利要求1至5任一项所述的面向数字人文的电子邮件历史事件轴生成方法。
PCT/CN2020/141129 2019-12-30 2020-12-29 一种面向数字人文的电子邮件历史事件轴生成方法及装置 WO2021136318A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911422430.1 2019-12-30
CN201911422430.1A CN111177421B (zh) 2019-12-30 2019-12-30 一种面向数字人文的电子邮件历史事件轴生成方法及装置

Publications (1)

Publication Number Publication Date
WO2021136318A1 true WO2021136318A1 (zh) 2021-07-08

Family

ID=70654324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141129 WO2021136318A1 (zh) 2019-12-30 2020-12-29 一种面向数字人文的电子邮件历史事件轴生成方法及装置

Country Status (2)

Country Link
CN (1) CN111177421B (zh)
WO (1) WO2021136318A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806580A (zh) * 2021-09-28 2021-12-17 西安电子科技大学 基于层次语义结构的跨模态哈希检索方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177421B (zh) * 2019-12-30 2023-07-04 论客科技(广州)有限公司 一种面向数字人文的电子邮件历史事件轴生成方法及装置
CN116610805A (zh) * 2023-07-20 2023-08-18 恒辉信达技术有限公司 一种非结构化数据的应用方法、系统、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104734943A (zh) * 2015-03-17 2015-06-24 深圳市连用科技有限公司 一种电子邮件的处理方法及系统
CN108733801A (zh) * 2018-05-17 2018-11-02 武汉大学 一种面向数字人文的移动视觉检索方法
CN109033155A (zh) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 搜索邮件内容方法、装置、终端及存储介质
CN110110122A (zh) * 2018-06-22 2019-08-09 北京交通大学 基于多层语义深度哈希算法的图像-文本跨模态检索
US20190362187A1 (en) * 2018-05-23 2019-11-28 Hitachi, Ltd. Training data creation method and training data creation apparatus
CN111177421A (zh) * 2019-12-30 2020-05-19 论客科技(广州)有限公司 一种面向数字人文的电子邮件历史事件轴生成方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012116208A2 (en) * 2011-02-23 2012-08-30 New York University Apparatus, method, and computer-accessible medium for explaining classifications of documents
US10402428B2 (en) * 2013-04-29 2019-09-03 Moogsoft Inc. Event clustering system
CN109446299B (zh) * 2018-08-27 2022-08-16 中国科学院信息工程研究所 基于事件识别的搜索电子邮件内容的方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104734943A (zh) * 2015-03-17 2015-06-24 深圳市连用科技有限公司 一种电子邮件的处理方法及系统
CN108733801A (zh) * 2018-05-17 2018-11-02 武汉大学 一种面向数字人文的移动视觉检索方法
US20190362187A1 (en) * 2018-05-23 2019-11-28 Hitachi, Ltd. Training data creation method and training data creation apparatus
CN109033155A (zh) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 搜索邮件内容方法、装置、终端及存储介质
CN110110122A (zh) * 2018-06-22 2019-08-09 北京交通大学 基于多层语义深度哈希算法的图像-文本跨模态检索
CN111177421A (zh) * 2019-12-30 2020-05-19 论客科技(广州)有限公司 一种面向数字人文的电子邮件历史事件轴生成方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI ZILIN, WANG YUJUE, LONG JIAQING: "Relationship between Archives and Digital Humanities", 17 October 2018 (2018-10-17), XP055825589, Retrieved from the Internet <URL:http://www.zgdazxw.com.cn/news/2018-10/17/content_251025.htm> *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806580A (zh) * 2021-09-28 2021-12-17 西安电子科技大学 基于层次语义结构的跨模态哈希检索方法
CN113806580B (zh) * 2021-09-28 2023-10-20 西安电子科技大学 基于层次语义结构的跨模态哈希检索方法

Also Published As

Publication number Publication date
CN111177421B (zh) 2023-07-04
CN111177421A (zh) 2020-05-19

Similar Documents

Publication Publication Date Title
WO2021136318A1 (zh) 一种面向数字人文的电子邮件历史事件轴生成方法及装置
Alam et al. Processing social media images by combining human and machine computing during crises
Goh et al. Food-image Classification Using Neural Network Model
US10796203B2 (en) Out-of-sample generating few-shot classification networks
US20190163533A1 (en) Computing task management using tree structures
WO2022088671A1 (zh) 自动问答方法、装置、设备及存储介质
CN113434716B (zh) 一种跨模态信息检索方法和装置
CN110929125A (zh) 搜索召回方法、装置、设备及其存储介质
US20240054293A1 (en) Multi-turn dialogue response generation using asymmetric adversarial machine classifiers
WO2024041479A1 (zh) 一种数据处理方法及其装置
WO2023134057A1 (zh) 事务信息查询方法、装置、计算机设备及存储介质
US20230032728A1 (en) Method and apparatus for recognizing multimedia content
WO2023174119A1 (zh) 数字内容处理方法、装置、电子设备、存储介质及产品
CN111680161A (zh) 一种文本处理方法、设备以及计算机可读存储介质
CN114077841A (zh) 基于人工智能的语义提取方法、装置、电子设备及介质
CN107766498B (zh) 用于生成信息的方法和装置
WO2019227629A1 (zh) 文本信息的生成方法、装置、计算机设备及存储介质
CN113821602B (zh) 基于图文聊天记录的自动答疑方法、装置、设备及介质
Alves et al. Leveraging BERT's Power to Classify TTP from Unstructured Text
CN111382243A (zh) 文本的类别匹配方法、类别匹配装置及终端
CN112307175A (zh) 一种文本处理方法、装置、服务器及计算机可读存储介质
WO2024040870A1 (zh) 文本图像生成、训练、文本图像处理方法以及电子设备
CN116127925A (zh) 基于对文本进行破坏处理的文本数据增强方法及装置
JP7236501B2 (ja) 文書類似度学習に基づくディープラーニングモデルの転移学習方法およびコンピュータ装置
Bhoj et al. LSTM powered identification of clickbait content on entertainment and news websites

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909897

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20909897

Country of ref document: EP

Kind code of ref document: A1