WO2021159814A1 - Text data error detection method and apparatus, terminal device, and storage medium - Google Patents

Text data error detection method and apparatus, terminal device, and storage medium Download PDF

Info

Publication number
WO2021159814A1
WO2021159814A1 PCT/CN2020/132478 CN2020132478W WO2021159814A1 WO 2021159814 A1 WO2021159814 A1 WO 2021159814A1 CN 2020132478 W CN2020132478 W CN 2020132478W WO 2021159814 A1 WO2021159814 A1 WO 2021159814A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature vector
sample
text data
discriminator
Prior art date
Application number
PCT/CN2020/132478
Other languages
French (fr)
Chinese (zh)
Inventor
朱昭苇
孙行智
胡岗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021159814A1 publication Critical patent/WO2021159814A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

Disclosed are a text data error detection method and apparatus, a terminal device, and a storage medium, applicable to digital medical treatment. The method comprises: acquiring text data to be verified from any data source, wherein the text data to be verified comprises state description data of a target object and state determination data with regard to the target object (S101); acquiring a first feature vector corresponding to the state description data, and inputting the first feature vector into a generator in a generative adversarial network in order to output a second feature vector by means of the generator (S102), wherein the generator is obtained by means of adversarial training based on sample text data from at least two data sources and at least two discriminators in the generative adversarial network; and acquiring a third feature vector corresponding to the state determination data, and determining, according to the second feature vector and the third feature vector, whether the state determination data is erroneous data (S103). The method can improve the text data testing accuracy, and has high applicability.

Description

文本数据的错误检测方法、装置、终端设备及存储介质Error detection method, device, terminal equipment and storage medium of text data
本申请要求于2020年9月28日提交中国专利局、申请号为202011042326.2,发明名称为“文本数据的错误检测方法、装置、终端设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on September 28, 2020, the application number is 202011042326.2, and the invention title is "text data error detection method, device, terminal equipment and storage medium", and its entire content Incorporated in this application by reference.
技术领域Technical field
本申请涉及数据处理领域,尤其涉及一种文本数据的错误检测方法、装置、终端设备及存储介质。This application relates to the field of data processing, and in particular to an error detection method, device, terminal device and storage medium for text data.
背景技术Background technique
在企业的发展过程中,通常会产生各种类型的文本数据,其中,为提高对企业的建设和管理,对某些重要的文本数据进行质量监控(为方便描述,简称质控),可帮助企业更好地成长。例如,针对医院而言,对病历数据的诊断质控是医院管理与建设中的重要一环。应当理解的是,诊断质控对于医生的评估和事件追溯具有重要价值。通常而言,诊断质控一般包括误诊和漏诊,从医院和医生的角度看,误诊的检测对于维持医院正常运转更加的重要。但是由于我国人口基数庞大,就医人数也远超世界平均水平,因此,针对大量的病历数据,通常只能采用人工抽检的方式对病历数据进行诊断质控,但是这种人工抽检的方式效率低,且耗时长。因此,现有技术中还提出了通过模型的方式进行诊断质控,但是发明人意识到,由于该类方法建模时仅仅采用自己医院数据训练模型,因此,无法有效的迁移到其他医院进行应用,普适性差且检测的准确率低。In the development process of an enterprise, various types of text data are usually generated. In order to improve the construction and management of the enterprise, the quality control of some important text data (for ease of description, referred to as quality control) can help Companies grow better. For example, for hospitals, diagnostic quality control of medical record data is an important part of hospital management and construction. It should be understood that diagnostic quality control is of great value for doctors' assessment and event tracing. Generally speaking, diagnostic quality control generally includes misdiagnosis and missed diagnosis. From the perspective of hospitals and doctors, the detection of misdiagnosis is more important for maintaining the normal operation of the hospital. However, due to the huge population base of our country and the number of people seeking medical treatment far exceeds the world average, therefore, for a large amount of medical record data, manual sampling can usually be used to perform diagnostic quality control on the medical record data, but this manual sampling method is inefficient. And it takes a long time. Therefore, the prior art also proposes to perform diagnostic quality control through a model, but the inventor realized that because this type of method only uses its own hospital data to train the model when modeling, it cannot be effectively migrated to other hospitals for application. , Poor universality and low detection accuracy.
技术问题technical problem
本申请实施例提供一种文本数据的错误检测方法、装置、终端设备及存储介质,可提高对文本数据的检测准确性,适用性高。The embodiments of the present application provide an error detection method, device, terminal device, and storage medium for text data, which can improve the accuracy of text data detection and have high applicability.
技术解决方案Technical solutions
第一方面,本申请实施例提供了一种文本数据的错误检测方法,该方法包括:获取任一数据来源的待核验文本数据,上述待核验文本数据中包括目标对象的状态描述数据和针对上述目标对象的状态判定数据;获取上述状态描述数据对应的第一特征向量,并将上述第一特征向量输入生成式对抗网络中的生成器以通过上述生成器输出第二特征向量,上述生成器基于至少两个数据来源的样本文本数据与上述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由上述至少两个数据来源中的一个数据来源的样本文本数据训练得到;获取上述状态判定数据对应的第三特征向量,根据上述第二特征向量与上述第三特征向量确定上述状态判定数据是否为错误数据。In the first aspect, an embodiment of the present application provides an error detection method for text data. The method includes: obtaining text data to be verified from any data source. The text data to be verified includes the state description data of the target object and The state determination data of the target object; obtain the first feature vector corresponding to the state description data, and input the first feature vector into the generator in the generative confrontation network to output the second feature vector through the generator, and the generator is based on The sample text data of at least two data sources is obtained by adversarial training with at least two discriminators in the above-mentioned generative confrontation network, where one discriminator is obtained by training the sample text data of one of the above-mentioned at least two data sources ; Acquire a third feature vector corresponding to the state determination data, and determine whether the state determination data is error data based on the second feature vector and the third feature vector.
第二方面,本申请实施例提供了一种文本数据的错误检测装置,该装置包括:数据获取模块,用于获取任一数据来源的待核验文本数据,上述待核验文本数据中包括目标对象的状态描述数据和针对上述目标对象的状态判定数据;数据处理模块,用于获取上述状态描述数据对应的第一特征向量,并将上述第一特征向量输入生成式对抗网络中的生成器以通过上述生成器输出第二特征向量,上述生成器基于至少两个数据来源的样本文本数据与上述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由上述至少两个数据来源中的一个数据来源的样本文本数据训练得到;数据检测模块,用于获取上述状态判定数据对应的第三特征向量,根据上述第二特征向量与上述第三特征向量确定上述状态判定数据是否为错误数据。In a second aspect, an embodiment of the present application provides an error detection device for text data. The device includes: a data acquisition module for acquiring text data to be verified from any data source. The text data to be verified includes information about the target object. State description data and state determination data for the above-mentioned target object; a data processing module, used to obtain the first feature vector corresponding to the above-mentioned state description data, and input the above-mentioned first feature vector into the generator in the generative confrontation network to pass the above The generator outputs a second feature vector, and the generator is obtained by adversarial training based on sample text data from at least two data sources and at least two discriminators in the generative confrontation network, wherein one discriminator is obtained from the at least two data The sample text data of a data source in the source is obtained through training; the data detection module is used to obtain the third feature vector corresponding to the state determination data, and determine whether the state determination data is based on the second feature vector and the third feature vector Bad data.
第三方面,本申请实施例提供了一种终端设备,该终端设备包括处理器和存储器,该处理器和存储器相互连接。该存储器用于存储计算机程序,该计算机程序包括程序指令,该处理器被配置用于调用上述程序指令,执行以下方法:获取任一数据来源的待核验文本数据,上述待核验文本数据中包括目标对象的状态描述数据和针对上述目标对象的状态判定数据;获取上述状态描述数据对应的第一特征向量,并将上述第一特征向量输入生成式对抗网络中的生成器以通过上述生成器输出第二特征向量,上述生成器基于至少两个数据来源的样本文本数据与上述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由上述至少两个数据来源中的一个数据来源的样本文本数据训练得到;获取上述状态判定数据对应的第三特征向量,根据上述第二特征向量与上述第三特征向量确定上述状态判定数据是否为错误数据。In a third aspect, an embodiment of the present application provides a terminal device. The terminal device includes a processor and a memory, and the processor and the memory are connected to each other. The memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the above program instructions to execute the following method: obtain text data to be verified from any data source, and the text data to be verified includes a target The state description data of the object and the state determination data for the target object; the first feature vector corresponding to the state description data is obtained, and the first feature vector is input to the generator in the generative confrontation network to output the first feature vector through the generator. Two feature vectors, the generator is obtained by confrontation training based on sample text data from at least two data sources and at least two discriminators in the generative confrontation network, wherein one discriminator is obtained from one of the at least two data sources The sample text data of the data source is obtained through training; the third feature vector corresponding to the state determination data is obtained, and whether the state determination data is wrong data is determined according to the second feature vector and the third feature vector.
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行以下方法:获取任一数据来源的待核验文本数据,上述待核验文本数据中包括目标对象的状态描述数据和针对上述目标对象的状态判定数据;获取上述状态描述数据对应的第一特征向量,并将上述第一特征向量输入生成式对抗网络中的生成器以通过上述生成器输出第二特征向量,上述生成器基于至少两个数据来源的样本文本数据与上述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由上述至少两个数据来源中的一个数据来源的样本文本数据训练得到;获取上述状态判定数据对应的第三特征向量,根据上述第二特征向量与上述第三特征向量确定上述状态判定数据是否为错误数据。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute The following method: obtain the text data to be verified from any data source, the text data to be verified includes the state description data of the target object and the state determination data for the target object; obtain the first feature vector corresponding to the state description data, and The first feature vector is input to a generator in a generative confrontation network to output a second feature vector through the generator. The generator is based on sample text data from at least two data sources and at least two of the generative confrontation network. The discriminator is obtained by adversarial training, where a discriminator is obtained by training sample text data from one of the above-mentioned at least two data sources; acquiring the third feature vector corresponding to the above-mentioned state determination data, and according to the above-mentioned second feature vector and The third feature vector determines whether the state determination data is error data.
有益效果Beneficial effect
采用本申请实施例,可提高对文本数据的检测准确性,适用性强。By adopting the embodiments of the present application, the detection accuracy of text data can be improved, and the applicability is strong.
附图说明Description of the drawings
图1是本申请实施例提供的文本数据的错误检测方法的一流程示意图。FIG. 1 is a schematic flowchart of a method for detecting errors in text data provided by an embodiment of the present application.
图2是本申请实施例提供的病历数据的场景示意图。Fig. 2 is a schematic diagram of a scenario of medical record data provided by an embodiment of the present application.
图3是本申请实施例提供的文本数据的错误检测方法的另一流程示意图。FIG. 3 is another schematic flowchart of the method for detecting errors in text data provided by an embodiment of the present application.
图4是本申请实施例提供的生成式对抗网络和数据对匹配模型的框架示意图。Fig. 4 is a schematic diagram of the framework of a generative confrontation network and a data pair matching model provided by an embodiment of the present application.
图5是本申请实施例提供的文本数据的错误检测装置的一结构示意图。FIG. 5 is a schematic structural diagram of an error detection device for text data provided by an embodiment of the present application.
图6是本申请实施例提供的文本数据的错误检测装置的另一结构示意图。FIG. 6 is a schematic diagram of another structure of an error detection device for text data provided by an embodiment of the present application.
图7是本申请实施例提供的终端设备的结构示意图。Fig. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.
本申请的技术方案可应用于人工智能、智慧城市、数字医疗、区块链和/或大数据技术领域,以实现文本检测。可选的,本申请涉及的数据如文本、向量和/或判定结果等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。The technical solution of this application can be applied to the fields of artificial intelligence, smart city, digital medical, blockchain and/or big data technology to realize text detection. Optionally, the data involved in this application, such as text, vectors, and/or judgment results, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.
例如,本申请实施例提供的文本数据的错误检测方法(为方便描述,可简称本申请实施例提供的方法),可广泛适用于医疗、投资和保险等多个应用领域中的任一应用领域。其中,本申请实施例提供的方法,通过获取任一数据来源的待核验文本数据,可得到待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器,可通过生成器输出第二特征向量。进一步地,通过获取状态判定数据对应的第三特征向量,可根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。其中,上述生成器基于至少两个数据来源的样本文本数据与生成式对抗网络中的至少两个判别器进行对抗训练得到,且每个判别器由至少两个数据来源中的一个数据来源的样本文本数据训练得到。采用本申请实施例,可提高对文本数据的检测准确性,适用性强。For example, the error detection method of text data provided in the embodiment of this application (for convenience of description, the method provided in the embodiment of this application may be referred to as the method provided in the embodiment of this application), can be widely applied to any of multiple application fields such as medical treatment, investment, and insurance. . Among them, the method provided by the embodiment of the present application obtains the to-be-verified text data from any data source, and the to-be-verified text data can include the state description data of the target object and the state determination data for the target object. By obtaining the first feature vector corresponding to the state description data, and inputting the first feature vector into the generator in the generative confrontation network, the second feature vector can be output through the generator. Further, by acquiring the third feature vector corresponding to the state determination data, whether the state determination data is error data can be determined according to the second feature vector and the third feature vector. Wherein, the above generator is obtained by adversarial training based on sample text data from at least two data sources and at least two discriminators in the generative confrontation network, and each discriminator is obtained from a sample from one of the at least two data sources. The text data is trained. By adopting the embodiments of the present application, the detection accuracy of text data can be improved, and the applicability is strong.
下面将结合图1至图7分别对本申请实施例提供的方法及相关装置分别进行详细说明。The methods and related devices provided by the embodiments of the present application will be described in detail below with reference to FIGS. 1 to 7 respectively.
请参见图1,图1为本申请实施例提供的文本数据的错误检测方法的一流程示意图。本申请实施例提供的方法可以包括如下步骤S101至S103。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a method for detecting errors in text data according to an embodiment of the present application. The method provided in the embodiment of the present application may include the following steps S101 to S103.
S101、获取任一数据来源的待核验文本数据,待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。S101. Acquire text data to be verified from any data source, and the text data to be verified includes state description data of the target object and state determination data for the target object.
在一些可行的实施方式中,获取任一数据来源的待核验文本数据,所获取到的待核验文本数据中可包括目标对象的状态描述数据和针对目标对象的状态判定数据。不难理解的是,在不同的应用领域,上述待核验文本数据的数据来源是不同的。例如,在医疗应用领域,上述待核验文本数据可包括病历数据,其中病历数据的数据来源可以为医院。应当理解,当上述待核验文本数据为病历数据时,则待核验文本数据中针对目标对象的状态描述数据可以是病历数据中患者的病情描述数据,待核验文本数据中针对目标对象的状态判定数据可以是医生针对患者的病情诊断数据。其中,病情描述数据可包括主诉和现病史等,在此不做限制。又例如,在保险应用领域,上述待核验文本数据可包括保险数据,其中保险数据的数据来源可以为保险公司。应当理解,当上述待核验文本数据为保险数据时,则待核验文本数据中针对目标对象的状态描述数据可以是投保人的投保需求数据,待核验文本数据中针对目标对象的状态判定数据可以是保险代理人针对投保人的投保方案定制数据等。为方便描述,以下本申请实施例皆以医疗应用领域为例进行说明。请参见图2,图2是本申请实施例提供的病历数据的场景示意图。如图2所示,病历数据中可包括患者的姓名、性别、年龄、就诊的科室、就诊日期、接诊医生、主诉、现病史和诊断结果等数据。其中,通过提取病历数据中包括的主诉和现病史,可将主诉的现病史确定为患者的病情描述数据,通过提取病历数据中包括的诊断结果,可将诊断结果确定为患者的病情诊断数据。In some feasible implementation manners, the to-be-verified text data of any data source is acquired, and the acquired to-be-verified text data may include the state description data of the target object and the state determination data for the target object. It is not difficult to understand that in different application fields, the data sources of the above-mentioned text data to be verified are different. For example, in the field of medical applications, the aforementioned text data to be verified may include medical record data, where the data source of the medical record data may be a hospital. It should be understood that when the aforementioned text data to be verified is medical record data, the state description data for the target object in the text data to be verified may be the condition description data of the patient in the medical record data, and the state determination data for the target object in the text data to be verified It can be the doctor's diagnosis data for the patient's condition. Among them, the condition description data can include the chief complaint and the current medical history, etc., which is not limited here. For another example, in the field of insurance applications, the aforementioned text data to be verified may include insurance data, and the data source of the insurance data may be an insurance company. It should be understood that when the aforementioned text data to be verified is insurance data, the status description data for the target object in the text data to be verified may be the insurance requirement data of the applicant, and the status determination data for the target object in the text data to be verified may be The insurance agent customizes data for the policyholder’s insurance plan, etc. For the convenience of description, the following embodiments of the present application are described by taking the medical application field as an example. Please refer to FIG. 2, which is a schematic diagram of a scenario of medical record data provided by an embodiment of the present application. As shown in Figure 2, the medical record data can include data such as the patient's name, gender, age, the department visited, the date of visit, the visiting doctor, the chief complaint, the history of current illness, and the diagnosis result. Among them, by extracting the main complaint and current medical history included in the medical record data, the current medical history of the main complaint can be determined as the patient's condition description data, and by extracting the diagnosis result included in the medical record data, the diagnosis result can be determined as the patient's condition diagnosis data.
S102、获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器以通过生成器输出第二特征向量。S102. Obtain a first feature vector corresponding to the state description data, and input the first feature vector into a generator in a generative confrontation network to output a second feature vector through the generator.
在一些可行的实施方式中,通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器,可通过生成器输出第二特征向量。其中,上述生成器可基于至少两个数据来源的样本文本数据与生成式对抗网络中的至少两个判别器进行对抗训练得到。一个判别器由至少两个数据来源中的一个数据来源的样本文本数据训练得到。也就是说,每个判别器可由一个数据来源的样本文本数据训练得到。例如,假设上述至少两个数据来源包括第一数据来源和第二数据来源,上述至少两个判别器包括第一判别器和第二判别器,则生成器可基于第一数据来源的样本文本数据和第二数据来源的样本文本数据与生成式对抗网络中的第一判别器和第二判别器进行对抗训练得到,第一判别器可由第一数据来源的样本文本数据训练得到,第二判别器可由第二数据来源的样本文本数据训练得到。应当理解的是,在医疗应用领域,上述至少两个数据来源可包括同一地区的至少两家医院,或者,也可以是不同地区的两家医院,具体根据实际应用场景确定,在此不做限制。In some feasible implementation manners, by obtaining the first feature vector corresponding to the state description data, and inputting the first feature vector into the generator in the generative confrontation network, the second feature vector can be output through the generator. The above generator can be obtained by conducting confrontation training based on sample text data of at least two data sources and at least two discriminators in the generative confrontation network. A discriminator is trained on sample text data from one of at least two data sources. In other words, each discriminator can be trained on sample text data from a data source. For example, assuming that the aforementioned at least two data sources include a first data source and a second data source, and the aforementioned at least two discriminators include a first discriminator and a second discriminator, the generator may be based on sample text data from the first data source. The sample text data of the second data source is trained against the first discriminator and the second discriminator in the generative confrontation network. The first discriminator can be trained by the sample text data of the first data source, and the second discriminator It can be trained on sample text data from the second data source. It should be understood that in the field of medical applications, the above-mentioned at least two data sources may include at least two hospitals in the same area, or two hospitals in different areas, which are determined according to actual application scenarios and are not limited here. .
S103、获取状态判定数据对应的第三特征向量,根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。S103: Acquire a third feature vector corresponding to the state determination data, and determine whether the state determination data is error data according to the second feature vector and the third feature vector.
在一些可行的实施方式中,通过获取状态判定数据对应的第三特征向量,可根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。其中,可将第二特征向量与第三特征向量输入数据对匹配模型,基于数据对匹配模型的输出结果确定状态判定数据是否为错误数据。应当理解的是,上述数据对匹配模型可基于至少一个样本数据对以及各样本数据对的匹配标签训练得到。其中一个样本数据对中包括一个样本文本数据中的状态描述数据对应的第四特征向量和状态判定数据对应的第五特征向量,任一样本数据对的匹配标签用于标识任一样本数据对中的第四特征向量和第五特征向量是否匹配。也就是说,通过将状态描述数据对应的特征向量和状态判定数据对应的特征向量输入匹配模型,可根据匹配模型确定状态描述数据和状态判定数据是否匹配。其中,当状态描述数据与状态判定数据不匹配时,可认为状态判定数据是否为错误数据。In some feasible implementation manners, by acquiring the third feature vector corresponding to the state determination data, whether the state determination data is wrong data can be determined according to the second feature vector and the third feature vector. Wherein, the second feature vector and the third feature vector may be input to the data pair matching model, and based on the output result of the data pair matching model, it is determined whether the state determination data is wrong data. It should be understood that the aforementioned data pair matching model can be obtained by training based on at least one sample data pair and the matching label of each sample data pair. One of the sample data pairs includes a fourth feature vector corresponding to the state description data in the sample text data and a fifth feature vector corresponding to the state determination data. The matching label of any sample data pair is used to identify any sample data pair. Whether the fourth feature vector and the fifth feature vector match. That is to say, by inputting the feature vector corresponding to the state description data and the feature vector corresponding to the state determination data into the matching model, it can be determined whether the state description data and the state determination data match according to the matching model. Among them, when the state description data does not match the state determination data, it can be considered whether the state determination data is wrong data.
在本申请实施例中,通过获取任一数据来源的待核验文本数据,可得到待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器,可通过生成器输出第二特征向量。进一步地,通过获取状态判定数据对应的第三特征向量,可根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。其中,上述生成器基于至少两个数据来源的样本文本数据与生成式对抗网络中的至少两个判别器进行对抗训练得到,且每个判别器由至少两个数据来源中的一个数据来源的样本文本数据训练得到。采用本申请实施例,可提高对文本数据的检测准确性,适用性强。In the embodiment of the present application, by obtaining the text data to be verified from any data source, the text data to be verified includes the state description data of the target object and the state determination data for the target object. By obtaining the first feature vector corresponding to the state description data, and inputting the first feature vector into the generator in the generative confrontation network, the second feature vector can be output through the generator. Further, by acquiring the third feature vector corresponding to the state determination data, whether the state determination data is error data can be determined according to the second feature vector and the third feature vector. Wherein, the above generator is obtained by adversarial training based on sample text data from at least two data sources and at least two discriminators in the generative confrontation network, and each discriminator is obtained from a sample from one of the at least two data sources. The text data is trained. By adopting the embodiments of the present application, the detection accuracy of text data can be improved, and the applicability is strong.
参见图3,图3是本申请实施例提供的文本数据的错误检测方法的另一流程示意图。本申请实施例提供的文本数据的错误检测方法可通过如下步骤S201至S203提供的实现方式进行说明。Referring to FIG. 3, FIG. 3 is another schematic flowchart of the method for detecting errors in text data provided by an embodiment of the present application. The error detection method of text data provided in the embodiment of the present application can be described by the implementation manner provided in the following steps S201 to S203.
S201、获取训练样本集合,基于训练样本集合中来自第一数据来源的样本文本数据构建第一判别器,基于训练样本集合中来自第二数据来源的样本文本数据构建第二判别器。S201. Obtain a training sample set, construct a first discriminator based on sample text data from a first data source in the training sample set, and construct a second discriminator based on sample text data from a second data source in the training sample set.
在一些可行的实施方式中,获取训练样本集合,该训练样本集合中可包括来自至少两个数据来源的样本文本数据。其中,一个数据来源的样本文本数据可用于构建一个判别器。例如,可基于训练样本集合中来自第一数据来源的样本文本数据构建第一判别器,基于训练样本集合中来自第二数据来源的样本文本数据构建第二判别器,基于训练样本集合中来自第三数据来源的样本文本数据构建第三判别器等,具体根据实际应用场景确定,在此不做限制。应当理解的是,训练样本集合中所包括的数据来源的数量可大于或者等于所构建的判别器的数量。示意性地,以下本申请实施例以训练样本集合中包括两个数据来源(为方便描述,以第一数据来源和第二数据来源为例进行说明),所构建的判别器包括第一判别器和第二判别器为例进行说明。In some feasible implementation manners, a training sample set is obtained, and the training sample set may include sample text data from at least two data sources. Among them, sample text data from a data source can be used to construct a discriminator. For example, the first discriminator can be constructed based on the sample text data from the first data source in the training sample set, and the second discriminator can be constructed based on the sample text data from the second data source in the training sample set. The sample text data of the three data sources constructs a third discriminator, etc., which are specifically determined according to actual application scenarios, and there is no restriction here. It should be understood that the number of data sources included in the training sample set may be greater than or equal to the number of discriminators constructed. Schematically, in the following embodiments of the present application, the training sample set includes two data sources (for convenience of description, the first data source and the second data source are taken as examples for illustration), and the constructed discriminator includes the first discriminator Take the second discriminator as an example.
不难理解的是,在医疗应用领域,第一判别器和第二判别器可分别是疾病分类模型(为方便描述,可以第一疾病分类模型和第二疾病分类模型为例进行说明)。其中,训练样本集合中来自第一数据来源的样本文本数据可用于训练或构建第一疾病分类模型的模型参数,训练样本集合中来自第二数据来源的样本文本数据可用于训练或构建第二疾病分类模型的模型参数。不难理解的是,任一疾病分类模型可包括卷积神经网络(Convolutional Neural Networks, CNN)、全连接层和softmax层。其中,CNN包括多个卷积层和多个池化层,每个卷积层的卷积核大小可结合实际应用场景进行设置,池化层可以是最大池化层或者平均池化层等,在此不做限制。通过将来自第一数据来源或第二数据来源的样本文本数据中病情描述数据对应的特征向量输入疾病分类模型,依次经过CNN、全连接层和softmax层后,可得到softmax层输出的各种疾病对应的疾病概率,即疾病概率分布。通过计算各样本文本数据中病情诊断数据对应的疾病分类标签与疾病概率分布间的损失函数,可不断调整疾病分类模型的模型参数,直到得到满足收敛条件的疾病分类模型。It is not difficult to understand that in the field of medical applications, the first discriminator and the second discriminator can be respectively disease classification models (for ease of description, the first disease classification model and the second disease classification model can be used as examples for illustration). Among them, the sample text data from the first data source in the training sample set can be used to train or construct the model parameters of the first disease classification model, and the sample text data from the second data source in the training sample set can be used to train or construct the second disease The model parameters of the classification model. It is not difficult to understand that any disease classification model can include Convolutional Neural Networks (CNN), fully connected layer and softmax layer. Among them, CNN includes multiple convolutional layers and multiple pooling layers. The size of the convolution kernel of each convolutional layer can be set according to actual application scenarios. The pooling layer can be the maximum pooling layer or the average pooling layer. There is no restriction here. By inputting the feature vector corresponding to the disease description data in the sample text data from the first data source or the second data source into the disease classification model, after sequentially passing through the CNN, fully connected layer and softmax layer, various diseases output by the softmax layer can be obtained The corresponding disease probability is the disease probability distribution. By calculating the loss function between the disease classification label corresponding to the disease diagnosis data in each sample text data and the disease probability distribution, the model parameters of the disease classification model can be continuously adjusted until a disease classification model that satisfies the convergence condition is obtained.
可选的,在一些可行的实施方式中,第一判别器和第二判别器还可以为疾病分类模型中的第一分类参数和第二分类参数,其中,所述第一分类参数可根据来自所述第一数据源的样本数据对应的样本特征向量和样本数据分类结果训练得到,所述第二分类参数根据来自所述第二数据源的样本数据对应的样本特征向量和样本数据分类结果训练得到。Optionally, in some feasible implementation manners, the first discriminator and the second discriminator may also be the first classification parameter and the second classification parameter in the disease classification model, where the first classification parameter may be based on The sample feature vector corresponding to the sample data of the first data source and the sample data classification result are trained, and the second classification parameter is trained according to the sample feature vector corresponding to the sample data from the second data source and the sample data classification result. get.
S202、基于训练样本集合中各样本文本数据与生成式对抗网络中的第一判别器和第二判别器构建生成器。S202: Construct a generator based on each sample text data in the training sample set and the first discriminator and the second discriminator in the generative confrontation network.
在一些可行的实施方式中,通过获取训练样本集合中各样本文本数据中的状态描述数据,并将各样本文本数据中的状态描述数据对应的第一状态描述特征向量输入生成器,可获取生成器输出的第二状态描述特征向量。通过将第二状态描述特征向量分别输入第一判别器和第二判别器,可获取第一判别器输出的第一判定结果概率分布和第二判别器输出的第二判定结果概率分布。进一步地,可根据第一判定结果概率分布和第二判定结果概率分布调整生成器的模型参数以得到满足收敛条件的生成器。In some feasible implementations, by obtaining the state description data in each sample text data in the training sample set, and inputting the first state description feature vector corresponding to the state description data in each sample text data into the generator, the generated The second state output by the generator describes the feature vector. By inputting the second state description feature vector to the first discriminator and the second discriminator, respectively, the probability distribution of the first judgment result output by the first discriminator and the probability distribution of the second judgment result output by the second discriminator can be obtained. Further, the model parameters of the generator can be adjusted according to the probability distribution of the first determination result and the probability distribution of the second determination result to obtain a generator that satisfies the convergence condition.
应当理解的是,通过计算第一判定结果概率分布中包括的多个判断结果概率的标准差,可得到第一标准差,通过计算第二判定结果概率分布中包括的多个判断结果概率的标准差,可得到第二标准差。其中,当第一标准差与第二标准差皆小于或者等于预设标准差阈值时,可确定调整模型参数后生成器满足收敛条件。也就是说,当第一判别器和第二判别器输出的各疾病的疾病概率基本类似时,可认为基于生成器输出的特征向量的比较纯净,即生成器既学到了多个数据来源的信息,又不掺杂单个数据来源的杂质信息。It should be understood that the first standard deviation can be obtained by calculating the standard deviation of the multiple judgment result probabilities included in the probability distribution of the first judgment result, and the standard deviation of the multiple judgment result probabilities included in the second judgment result probability distribution can be calculated. Poor, the second standard deviation can be obtained. Wherein, when the first standard deviation and the second standard deviation are both less than or equal to the preset standard deviation threshold, it can be determined that the generator meets the convergence condition after adjusting the model parameters. That is to say, when the disease probabilities of each disease output by the first discriminator and the second discriminator are basically similar, it can be considered that the feature vector output based on the generator is relatively pure, that is, the generator has learned information from multiple data sources. , And not doped with impurity information from a single data source.
S203、获取任一数据来源的待核验文本数据,待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。S203. Acquire text data to be verified from any data source, and the text data to be verified includes state description data of the target object and state determination data for the target object.
在一些可行的实施方式中,当基于至少两个数据来源的样本文本数据对生成式对抗网络中的生成器和判别器进行对抗训练后。可获取来自任一数据来源的待核验文本数据,并对待核验文本数据进行错误检测。应当理解的是,上述任一数据来源可以是训练样本集合中包括的至少两个数据来源中的任一个数据来源。或者,任一数据来源可以是不同于训练样本集合中包括的各个数据来源的任一数据来源。其中,当上述任一数据来源为训练样本集合中包括的某个数据来源时,则待核验文本数据为新的文本数据,即未用作训练样本的文本数据。例如,在医疗应用领域,上述待核验文本数据可包括病历数据,其中病历数据的数据来源可以为医院。应当理解,当上述待核验文本数据为病历数据时,则待核验文本数据中针对目标对象的状态描述数据可以是病历数据中患者的病情描述数据,待核验文本数据中针对目标对象的状态判定数据可以是医生针对患者的病情诊断数据。其中,病情描述数据可包括主诉和现病史等,在此不做限制。又例如,在保险应用领域,上述待核验文本数据可包括保险数据,其中保险数据的数据来源可以为保险公司。应当理解,当上述待核验文本数据为保险数据时,则待核验文本数据中针对目标对象的状态描述数据可以是投保人的投保需求数据,待核验文本数据中针对目标对象的状态判定数据可以是保险代理人针对投保人的投保方案定制数据等。为方便描述,以下本申请实施例皆以医疗应用领域为例进行说明。In some feasible implementation manners, after performing confrontation training on the generator and the discriminator in the generative confrontation network based on the sample text data of at least two data sources. Can obtain the text data to be verified from any data source, and perform error detection on the text data to be verified. It should be understood that any of the aforementioned data sources may be any one of the at least two data sources included in the training sample set. Alternatively, any data source may be any data source different from each data source included in the training sample set. Wherein, when any of the aforementioned data sources is a certain data source included in the training sample set, the text data to be verified is new text data, that is, text data that is not used as a training sample. For example, in the field of medical applications, the aforementioned text data to be verified may include medical record data, where the data source of the medical record data may be a hospital. It should be understood that when the aforementioned text data to be verified is medical record data, the state description data for the target object in the text data to be verified may be the condition description data of the patient in the medical record data, and the state determination data for the target object in the text data to be verified It can be the doctor's diagnosis data for the patient's condition. Among them, the condition description data can include the chief complaint and the current medical history, etc., which is not limited here. For another example, in the field of insurance applications, the aforementioned text data to be verified may include insurance data, and the data source of the insurance data may be an insurance company. It should be understood that when the aforementioned text data to be verified is insurance data, the status description data for the target object in the text data to be verified may be the insurance requirement data of the applicant, and the status determination data for the target object in the text data to be verified may be The insurance agent customizes data for the policyholder’s insurance plan, etc. For the convenience of description, the following embodiments of the present application are described by taking the medical application field as an example.
其中,假设训练样本集合包括来自医院a的样本病历数据x(例如,样本病历数据x可以为2019年度医院a的病历数据)和来自医院b的样本文本数据y(例如,样本病历数据y可以为2019年度医院b的病历数据),则基于医院a的样本病历数据x和来自医院b的样本文本数据y分别训练得到对应的生成器和判别器后,可进一步获取来自医院a的新的病历数据作为待核验文本数据,例如,待核验文本数据可以为2020年在医院a就诊的某个病患或多个病患的病历数据,或者待核验文本数据还可以为2018年在医院a就诊的某个病患或多个病患的病历数据。或者,可进一步获取来自医院c的病历数据作为待核验文本数据,例如,待核验文本数据可以为2019年在医院c就诊的某个病患或多个病患的病历数据,或者待核验文本数据还可以为2020年的某个病患或多个病患的病历数据等,具体根据实际应用场景确定,在此不做限制。Among them, suppose that the training sample set includes sample medical record data x from hospital a (for example, sample medical record data x can be the medical record data of hospital a in 2019) and sample text data y from hospital b (for example, sample medical record data y can be The medical record data of hospital b in 2019), based on the sample medical record data x of hospital a and the sample text data y from hospital b, after training the corresponding generator and discriminator respectively, the new medical record data from hospital a can be further obtained As the text data to be verified, for example, the text data to be verified can be the medical record data of a patient or multiple patients who visited hospital a in 2020, or the text data to be verified can also be a patient who visited hospital a in 2018. Medical history data of one patient or multiple patients. Alternatively, the medical record data from hospital c can be further obtained as the text data to be verified. For example, the text data to be verified can be the medical record data of a patient or multiple patients who visited hospital c in 2019, or the text data to be verified It can also be the medical record data of a certain patient or multiple patients in 2020, etc., which are specifically determined according to actual application scenarios, and there is no limitation here.
S204、获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器以通过生成器输出第二特征向量。S204. Obtain a first feature vector corresponding to the state description data, and input the first feature vector into a generator in a generative confrontation network to output a second feature vector through the generator.
在一些可行的实施方式中,通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器,可通过生成器输出第二特征向量。其中,上述生成器可基于至少两个数据来源的样本文本数据与生成式对抗网络中的至少两个判别器进行对抗训练得到。一个判别器由至少两个数据来源中的一个数据来源的样本文本数据训练得到。也就是说,每个判别器可由一个数据来源的样本文本数据训练得到。例如,假设上述至少两个数据来源包括第一数据来源和第二数据来源,上述至少两个判别器包括第一判别器和第二判别器,则生成器可基于第一数据来源的样本文本数据和第二数据来源的样本文本数据与生成式对抗网络中的第一判别器和第二判别器进行对抗训练得到,第一判别器可由第一数据来源的样本文本数据训练得到,第二判别器可由第二数据来源的样本文本数据训练得到。应当理解的是,在医疗应用领域,上述至少两个数据来源可包括同一地区的至少两家医院,或者,也可以是不同地区的两家医院,具体根据实际应用场景确定,在此不做限制。In some feasible implementation manners, by obtaining the first feature vector corresponding to the state description data, and inputting the first feature vector into the generator in the generative confrontation network, the second feature vector can be output through the generator. The above generator can be obtained by conducting confrontation training based on sample text data of at least two data sources and at least two discriminators in the generative confrontation network. A discriminator is trained on sample text data from one of at least two data sources. In other words, each discriminator can be trained on sample text data from a data source. For example, assuming that the aforementioned at least two data sources include a first data source and a second data source, and the aforementioned at least two discriminators include a first discriminator and a second discriminator, the generator may be based on sample text data from the first data source. The sample text data of the second data source is trained against the first discriminator and the second discriminator in the generative confrontation network. The first discriminator can be trained by the sample text data of the first data source, and the second discriminator It can be trained on sample text data from the second data source. It should be understood that in the field of medical applications, the above-mentioned at least two data sources may include at least two hospitals in the same area, or two hospitals in different areas, which are determined according to actual application scenarios and are not limited here. .
应当理解的是,通过对待核验文本数据中包括的状态描述数据进行分词处理,可得到组成状态描述数据的多个词。通过获取组成状态描述数据的多个词中每个词对应的词向量,可根据每个词对应的词向量生成状态描述数据对应的第一特征向量。例如,当待核验文本数据包括病历数据时,待核验文本数据中针对目标对象的状态描述数据可包括患者的病情描述数据,待核验文本数据中针对目标对象的状态判定数据可包括针对患者的病情诊断数据。因此,通过对病情描述数据进行分词处理,可得到组成病情描述数据的多个词,通过获取组成病情描述数据的多个词中每个词对应的词向量,可根据每个词对应的词向量生成病情描述数据对应的第一特征向量。应当理解的是,在确定各个词对应的词向量时,可获取预设的词向量查询表。其中,词向量查询表中包括多个词索引以及各个词索引对应的词向量,其中,一个词对应一个词索引。因此,根据多个词中每个词对应的词索引,可从词向量查询表中确定出组成病情描述数据的多个词中每个词对应的词向量。进一步地,通过对各个词对应的词向量进行求和或者加权求和,可得到状态描述数据对应的第一特征向量。It should be understood that by performing word segmentation processing on the state description data included in the text data to be verified, multiple words that make up the state description data can be obtained. By obtaining the word vector corresponding to each of the multiple words constituting the state description data, the first feature vector corresponding to the state description data can be generated according to the word vector corresponding to each word. For example, when the text data to be verified includes medical record data, the state description data for the target object in the text data to be verified may include the condition description data of the patient, and the state determination data for the target object in the text data to be verified may include the state of the patient. Diagnostic data. Therefore, by performing word segmentation processing on the disease description data, multiple words that make up the disease description data can be obtained. By obtaining the word vector corresponding to each of the multiple words that make up the disease description data, the word vector corresponding to each word can be obtained. Generate the first feature vector corresponding to the disease description data. It should be understood that when determining the word vector corresponding to each word, a preset word vector lookup table can be obtained. Among them, the word vector lookup table includes multiple word indexes and word vectors corresponding to each word index, where one word corresponds to one word index. Therefore, according to the word index corresponding to each word in the multiple words, the word vector corresponding to each word in the multiple words constituting the disease description data can be determined from the word vector lookup table. Further, by summing or weighted summation of the word vectors corresponding to each word, the first feature vector corresponding to the state description data can be obtained.
可选的,在一些可行的实施方式中,当对状态描述数据进行分词处理,并得到组成状态描述数据的多个词后,还可以先剔除多个词中的停用词,然后对剔除停用词后的剩余分词进行处理,以得到剩余分词对应的词向量,进而根据剩余分词对应的词向量确定出的特征向量,以作为状态描述数据对应的第一特征向量。其中,剔除的停用词可包括语气助词、副词、介词、连接词等,具体根据实际应用场景确定,在此不做限制。Optionally, in some feasible implementation manners, after performing word segmentation processing on the state description data and obtaining multiple words that make up the state description data, stop words in the multiple words may be eliminated first, and then the elimination may be stopped. The remaining word segmentation after the word is processed to obtain the word vector corresponding to the remaining word segmentation, and then the feature vector determined according to the word vector corresponding to the remaining word segmentation is used as the first feature vector corresponding to the state description data. Among them, the eliminated stop words may include modal particles, adverbs, prepositions, conjunctions, etc., which are specifically determined according to actual application scenarios and are not limited here.
S205、获取状态判定数据对应的第三特征向量,根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。S205: Acquire a third feature vector corresponding to the state determination data, and determine whether the state determination data is error data according to the second feature vector and the third feature vector.
在一些可行的实施方式中,通过获取状态判定数据对应的第三特征向量,可根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。其中,可将第二特征向量与第三特征向量输入数据对匹配模型,基于数据对匹配模型的输出结果确定状态判定数据是否为错误数据。应当理解的是,上述数据对匹配模型可基于至少一个样本数据对以及各样本数据对的匹配标签训练得到。其中一个样本数据对中包括一个样本文本数据中的状态描述数据对应的第四特征向量和状态判定数据对应的第五特征向量,任一样本数据对的匹配标签用于标识任一样本数据对中的第四特征向量和第五特征向量是否匹配。也就是说,通过将状态描述数据对应的特征向量和状态判定数据对应的特征向量输入匹配模型,可根据匹配模型确定状态描述数据和状态判定数据是否匹配。其中,当状态描述数据与状态判定数据不匹配时,可认为状态判定数据是否为错误数据。In some feasible implementation manners, by acquiring the third feature vector corresponding to the state determination data, whether the state determination data is wrong data can be determined according to the second feature vector and the third feature vector. Wherein, the second feature vector and the third feature vector may be input to the data pair matching model, and based on the output result of the data pair matching model, it is determined whether the state determination data is wrong data. It should be understood that the aforementioned data pair matching model can be obtained by training based on at least one sample data pair and the matching label of each sample data pair. One of the sample data pairs includes a fourth feature vector corresponding to the state description data in the sample text data and a fifth feature vector corresponding to the state determination data. The matching label of any sample data pair is used to identify any sample data pair. Whether the fourth feature vector and the fifth feature vector match. That is to say, by inputting the feature vector corresponding to the state description data and the feature vector corresponding to the state determination data into the matching model, it can be determined whether the state description data and the state determination data match according to the matching model. Among them, when the state description data does not match the state determination data, it can be considered whether the state determination data is wrong data.
应当理解的是,通过对待核验文本数据中包括的状态判定数据进行分词处理,可得到组成状态判定数据的多个词。通过获取组成状态判定数据的多个词中每个词对应的词向量,可根据每个词对应的词向量生成状态判定数据对应的第一特征向量。例如,当待核验文本数据包括病历数据时,待核验文本数据中针对目标对象的状态判定数据可包括患者的病情诊断数据,待核验文本数据中针对目标对象的状态判定数据可包括针对患者的病情诊断数据。因此,通过对病情诊断数据进行分词处理,可得到组成病情诊断数据的多个词,通过获取组成病情诊断数据的多个词中每个词对应的词向量,可根据每个词对应的词向量生成病情诊断数据对应的第一特征向量。应当理解的是,在确定各个词对应的词向量时,可获取预设的词向量查询表。其中,词向量查询表中包括多个词索引以及各个词索引对应的词向量,其中,一个词对应一个词索引。因此,根据多个词中每个词对应的词索引,可从词向量查询表中确定出组成病情诊断数据的多个词中每个词对应的词向量。进一步地,通过对各个词对应的词向量进行求和或者加权求和,可得到状态判定数据对应的第一特征向量。可选的,当对状态判定数据进行分词处理,并得到组成状态判定数据的多个词后,还可以先剔除多个词中的停用词,然后对剔除停用词后的剩余分词进行处理,以得到剩余分词对应的词向量,进而根据剩余分词对应的词向量确定出的特征向量,以作为状态判定数据对应的第一特征向量。其中,剔除的停用词可包括语气助词、副词、介词、连接词等,具体根据实际应用场景确定,在此不做限制。It should be understood that by performing word segmentation processing on the state determination data included in the text data to be verified, multiple words constituting the state determination data can be obtained. By obtaining the word vector corresponding to each word in the plurality of words constituting the state determination data, the first feature vector corresponding to the state determination data can be generated according to the word vector corresponding to each word. For example, when the text data to be verified includes medical record data, the status determination data for the target object in the text data to be verified may include the patient's condition diagnosis data, and the status determination data for the target object in the text data to be verified may include the condition of the patient. Diagnostic data. Therefore, by performing word segmentation processing on the disease diagnosis data, multiple words that make up the disease diagnosis data can be obtained. By obtaining the word vector corresponding to each of the multiple words that make up the disease diagnosis data, the word vector corresponding to each word can be obtained. Generate the first feature vector corresponding to the disease diagnosis data. It should be understood that when determining the word vector corresponding to each word, a preset word vector lookup table can be obtained. Among them, the word vector lookup table includes multiple word indexes and word vectors corresponding to each word index, where one word corresponds to one word index. Therefore, according to the word index corresponding to each word in the multiple words, the word vector corresponding to each word in the multiple words constituting the disease diagnosis data can be determined from the word vector lookup table. Further, by summing or weighted summation of the word vectors corresponding to each word, the first feature vector corresponding to the state determination data can be obtained. Optionally, after performing word segmentation processing on the state determination data and obtaining multiple words that make up the state determination data, stop words in the multiple words can be eliminated first, and then the remaining word segmentation after eliminating the stop words is processed , To obtain the word vector corresponding to the remaining participle, and then determine the feature vector according to the word vector corresponding to the remaining participle as the first feature vector corresponding to the state determination data. Among them, the eliminated stop words may include modal particles, adverbs, prepositions, conjunctions, etc., which are specifically determined according to actual application scenarios and are not limited here.
其中,数据对匹配模型可以为端到端模型,其中,通过将至少一个样本数据对输入端到端模型,可基于端到端模型的输出结果以及各样本数据对的匹配标签,不断优化调整端到端模型的模型参数,进而得到满足收敛条件的端到端模型。例如,样本数据对可以包括病情描述数据对应的特征向量和病情诊断数据对应的特征向量,其中,匹配标签包括1和0,其中,1表示数据对中的病情描述数据和病情诊断数据匹配,0表示数据对中的病情描述数据与病情诊断数据不匹配。不难理解的是,当确定匹配模型的输出结果为不匹配时,可确定状态描述数据为错误数据。例如,当匹配模型的输出结果为病情描述数据和病情诊断数据不匹配时,可确定病情诊断数据为错误数据,即误诊数据。Among them, the data pair matching model can be an end-to-end model. By inputting at least one sample data to the end-to-end model, the end-to-end model can be continuously optimized and adjusted based on the output result of the end-to-end model and the matching label of each sample data pair. The model parameters of the end-to-end model, and then the end-to-end model that satisfies the convergence condition is obtained. For example, the sample data pair may include the feature vector corresponding to the disease description data and the feature vector corresponding to the disease diagnosis data, where the matching label includes 1 and 0, where 1 indicates that the disease description data in the data pair matches the disease diagnosis data, and 0 Indicates that the condition description data in the data pair does not match the condition diagnosis data. It is not difficult to understand that when it is determined that the output result of the matching model is not matched, it can be determined that the state description data is wrong data. For example, when the output result of the matching model is that the condition description data and the condition diagnosis data do not match, it can be determined that the condition diagnosis data is wrong data, that is, misdiagnosis data.
例如,请参见图4,图4是本申请实施例提供的生成式对抗网络和数据对匹配模型的框架示意图。如图4所示,可首先基于训练样本集合中的第一数据来源的样本文本数据构建第一判别器,基于训练样本集合中来自第二数据来源的样本文本数据构建第二判别器。然后,基于训练样本集合中各样本文本数据(例如,第一数据来源的样本文本数据和第二数据来源的样本文本数据)与生成式对抗网络中的第一判别器和第二判别器进行对抗训练后构建生成器。进一步地,获取任一数据来源的待核验文本数据,待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器以通过生成器输出第二特征向量。通过获取状态判定数据对应的第三特征向量,可将第二特征向量和第三特征向量输入数据对匹配模型,基于数据对匹配模型的输出结果确定状态判定数据是否为错误数据。For example, please refer to FIG. 4, which is a schematic diagram of the framework of a generative confrontation network and a data pair matching model provided by an embodiment of the present application. As shown in FIG. 4, the first discriminator can be constructed based on the sample text data of the first data source in the training sample set, and the second discriminator can be constructed based on the sample text data from the second data source in the training sample set. Then, based on each sample text data in the training sample set (for example, the sample text data of the first data source and the sample text data of the second data source) and the first discriminator and the second discriminator in the generative confrontation network Build the generator after training. Further, the to-be-verified text data of any data source is obtained, and the to-be-verified text data includes the state description data of the target object and the state determination data for the target object. The first feature vector corresponding to the state description data is obtained, and the first feature vector is input to the generator in the generative confrontation network to output the second feature vector through the generator. By obtaining the third feature vector corresponding to the state determination data, the second feature vector and the third feature vector can be input to the data pair matching model, and based on the output result of the data pair matching model, it is determined whether the state determination data is wrong data.
在本申请实施例中,通过获取训练样本集合,可基于训练样本集合中来自第一数据来源的样本文本数据构建第一判别器,基于训练样本集合中来自第二数据来源的样本文本数据构建第二判别器。进一步地,可根据训练样本集合中至少两个数据来源的样本文本数据与上述两个判别器进行对抗训练以得到生成式对抗网络中的生成器。因此,通过获取任一数据来源的待核验文本数据,可得到待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器,可通过生成器输出第二特征向量。进一步地,通过获取状态判定数据对应的第三特征向量,可根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。采用本申请实施例,可提高对文本数据的检测准确性,适用性强。In the embodiment of the present application, by obtaining the training sample set, the first discriminator can be constructed based on the sample text data from the first data source in the training sample set, and the first discriminator can be constructed based on the sample text data from the second data source in the training sample set. Two discriminator. Further, confrontation training can be performed on the two discriminators according to the sample text data of at least two data sources in the training sample set to obtain the generator in the generative confrontation network. Therefore, by obtaining the to-be-verified text data from any data source, the to-be-verified text data can be obtained including the state description data of the target object and the state determination data for the target object. By obtaining the first feature vector corresponding to the state description data, and inputting the first feature vector into the generator in the generative confrontation network, the second feature vector can be output through the generator. Further, by acquiring the third feature vector corresponding to the state determination data, whether the state determination data is error data can be determined according to the second feature vector and the third feature vector. By adopting the embodiments of the present application, the detection accuracy of text data can be improved, and the applicability is strong.
参见图5,图5是本申请实施例提供的文本数据的错误检测装置的一结构示意图。本申请实施例提供的文本数据的错误检测装置包括:数据获取模块31,用于获取任一数据来源的待核验文本数据,上述待核验文本数据中包括目标对象的状态描述数据和针对上述目标对象的状态判定数据;数据处理模块32,用于获取上述状态描述数据对应的第一特征向量,并将上述第一特征向量输入生成式对抗网络中的生成器以通过上述生成器输出第二特征向量,上述生成器基于至少两个数据来源的样本文本数据与上述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由上述至少两个数据来源中的一个数据来源的样本文本数据训练得到;数据检测模块33,用于获取上述状态判定数据对应的第三特征向量,根据上述第二特征向量与上述第三特征向量确定上述状态判定数据是否为错误数据。Referring to FIG. 5, FIG. 5 is a schematic structural diagram of an error detection apparatus for text data provided by an embodiment of the present application. The error detection device for text data provided by the embodiment of the present application includes: a data acquisition module 31 for acquiring text data to be verified from any data source. The text data to be verified includes the state description data of the target object and the description data for the target object. The state determination data; the data processing module 32 is used to obtain the first feature vector corresponding to the state description data, and input the first feature vector into the generator in the generative confrontation network to output the second feature vector through the generator , The above generator is obtained by adversarial training based on sample text data from at least two data sources and at least two discriminators in the generative confrontation network, wherein one discriminator is derived from one of the at least two data sources. The sample text data is obtained through training; the data detection module 33 is used to obtain the third feature vector corresponding to the state determination data, and determine whether the state determination data is wrong data according to the second feature vector and the third feature vector.
请一并参见图6,图6是本申请实施例提供的文本数据的错误检测装置的另一结构示意图。Please refer to FIG. 6 together. FIG. 6 is another schematic diagram of the structure of the text data error detection apparatus provided by an embodiment of the present application.
在一些可行的实施方式中,上述数据检测模块33具体用于:将上述第二特征向量与上述第三特征向量输入数据对匹配模型,基于上述数据对匹配模型的输出结果确定上述状态判定数据是否为错误数据。In some feasible implementation manners, the data detection module 33 is specifically configured to: input the second feature vector and the third feature vector into a data pair matching model, and determine whether the state determination data is based on the output result of the data pair matching model. Is wrong data.
其中,上述数据对匹配模型基于至少一个样本数据对以及各样本数据对的匹配标签训练得到,其中一个样本数据对中包括一个样本文本数据中的状态描述数据对应的第四特征向量和状态判定数据对应的第五特征向量,任一样本数据对的匹配标签用于标识上述任一样本数据对中的第四特征向量和第五特征向量是否匹配。The aforementioned data pair matching model is obtained by training based on at least one sample data pair and the matching label of each sample data pair, and one of the sample data pairs includes a fourth feature vector corresponding to the state description data in the sample text data and state determination data Corresponding to the fifth feature vector, the matching label of any sample data pair is used to identify whether the fourth feature vector and the fifth feature vector in any one of the sample data pairs match.
在一些可行的实施方式中,上述至少两个数据来源包括第一数据来源和第二数据来源,上述至少两个判别器包括第一判别器和第二判别器,上述装置还包括第一训练模块34,上述第一训练模块34用于:获取训练样本集合,上述训练样本集合包括上述第一数据来源的样本文本数据和上述第二数据来源的样本文本数据,其中一个样本数据对中包括一个样本文本数据中的状态描述数据和上述状态描述数据的状态判定标签;基于上述训练样本集合中来自上述第一数据来源的样本文本数据构建上述第一判别器,基于上述训练样本集合中来自上述第二数据来源的样本文本数据构建上述第二判别器。In some feasible embodiments, the at least two data sources include a first data source and a second data source, the at least two discriminators include a first discriminator and a second discriminator, and the device further includes a first training module 34. The first training module 34 is configured to: obtain a training sample set, the training sample set includes sample text data from the first data source and sample text data from the second data source, wherein one sample data pair includes a sample The state description data in the text data and the state determination label of the state description data; the first discriminator is constructed based on the sample text data from the first data source in the training sample set, and the second discriminator is based on the training sample set from the second The sample text data of the data source constructs the above-mentioned second discriminator.
在一些可行的实施方式中,上述装置还包括第二训练模块35,上述第二训练模块35包括:训练数据获取单元351,用于获取上述训练样本集合中各样本文本数据中的状态描述数据;训练数据处理单元352,用于将上述各样本文本数据中的状态描述数据对应的第一状态描述特征向量输入上述生成器,并获取上述生成器输出的第二状态描述特征向量;判定结果获取单元353,用于将上述第二状态描述特征向量分别输入上述第一判别器和上述第二判别器,并获取上述第一判别器输出的第一判定结果概率分布和上述第二判别器输出的第二判定结果概率分布;生成器调整单元354,用于根据上述第一判定结果概率分布和上述第二判定结果概率分布调整上述生成器的模型参数以得到满足收敛条件的生成器。In some feasible implementation manners, the above-mentioned apparatus further includes a second training module 35, and the above-mentioned second training module 35 includes: a training data obtaining unit 351, configured to obtain state description data in each sample text data in the above-mentioned training sample set; The training data processing unit 352 is configured to input the first state description feature vector corresponding to the state description data in the above-mentioned sample text data into the above generator, and obtain the second state description feature vector output by the above generator; the determination result obtaining unit 353. The second state description feature vector is used to input the first discriminator and the second discriminator respectively, and obtain the probability distribution of the first judgment result output by the first discriminator and the first judgment result output by the second discriminator. Second, the probability distribution of the determination result; the generator adjustment unit 354 is configured to adjust the model parameters of the generator according to the probability distribution of the first determination result and the probability distribution of the second determination result to obtain a generator that satisfies the convergence condition.
在一些可行的实施方式中,上述生成器调整单元354还用于:计算上述第一判定结果概率分布中包括的多个判断结果概率的第一标准差与上述第二判定结果概率分布中包括的多个判断结果概率的第二标准差;当上述第一标准差与上述第二标准差皆小于或者等于预设标准差阈值时,确定调整模型参数后上述生成器满足收敛条件。In some feasible implementation manners, the generator adjustment unit 354 is further configured to: calculate the first standard deviation of the multiple judgment result probabilities included in the first judgment result probability distribution and the first standard deviation of the plurality of judgment result probabilities included in the second judgment result probability distribution. The second standard deviation of the probabilities of multiple judgment results; when the first standard deviation and the second standard deviation are both less than or equal to the preset standard deviation threshold, the generator is determined to meet the convergence condition after adjusting the model parameters.
在一些可行的实施方式中,上述待核验文本数据包括病历数据,上述待核验文本数据中针对上述目标对象的状态描述数据包括患者的病情描述数据,上述待核验文本数据中针对上述目标对象的状态判定数据包括针对上述患者的病情诊断数据。In some feasible implementation manners, the text data to be verified includes medical record data, the state description data for the target object in the text data to be verified includes disease description data of the patient, and the text data to be verified is for the state of the target object. The judgment data includes disease diagnosis data for the above-mentioned patients.
在一些可行的实施方式中,上述数据处理模块32包括第一特征向量获取单元321和第二特征向量获取单元322,其中,上述第一特征向量获取单元321具体用于:对上述病情描述数据进行分词处理,以得到组成上述病情描述数据的多个词;获取组成上述病情描述数据的多个词中每个词对应的词向量,根据上述每个词对应的词向量生成上述病情描述数据对应的第一特征向量。In some feasible implementation manners, the aforementioned data processing module 32 includes a first feature vector acquiring unit 321 and a second feature vector acquiring unit 322, wherein the aforementioned first feature vector acquiring unit 321 is specifically configured to: Word segmentation processing to obtain multiple words that make up the disease description data; obtain the word vector corresponding to each of the multiple words that make up the disease description data, and generate the corresponding word vector for the disease description data according to the word vector corresponding to each word. The first feature vector.
在本申请实施例中,文本数据的错误检测装置可基于训练样本集合中来自第一数据来源的样本文本数据构建第一判别器,基于训练样本集合中来自第二数据来源的样本文本数据构建第二判别器。进一步地,可根据训练样本集合中至少两个数据来源的样本文本数据与上述两个判别器进行对抗训练以得到生成式对抗网络中的生成器。因此,通过获取任一数据来源的待核验文本数据,可得到待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器,可通过生成器输出第二特征向量。进一步地,通过获取状态判定数据对应的第三特征向量,可根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。采用本申请实施例,可提高对文本数据的检测准确性,适用性强。In the embodiment of the present application, the text data error detection device can construct the first discriminator based on the sample text data from the first data source in the training sample set, and construct the first discriminator based on the sample text data from the second data source in the training sample set. Two discriminator. Further, confrontation training can be performed on the two discriminators according to the sample text data of at least two data sources in the training sample set to obtain the generator in the generative confrontation network. Therefore, by obtaining the to-be-verified text data from any data source, the to-be-verified text data can be obtained including the state description data of the target object and the state determination data for the target object. By obtaining the first feature vector corresponding to the state description data, and inputting the first feature vector into the generator in the generative confrontation network, the second feature vector can be output through the generator. Further, by acquiring the third feature vector corresponding to the state determination data, whether the state determination data is error data can be determined according to the second feature vector and the third feature vector. By adopting the embodiments of the present application, the detection accuracy of text data can be improved, and the applicability is strong.
参见图7,图7是本申请实施例提供的终端设备的结构示意图。如图7所示,本实施例中的终端设备可以包括:一个或多个处理器401,存储器402和收发器403。上述处理器401,存储器402和收发器403通过总线404连接。存储器402用于存储计算机程序,该计算机程序包括程序指令,处理器401用于执行存储器402存储的程序指令,执行如下操作:获取任一数据来源的待核验文本数据,上述待核验文本数据中包括目标对象的状态描述数据和针对上述目标对象的状态判定数据;获取上述状态描述数据对应的第一特征向量,并将上述第一特征向量输入生成式对抗网络中的生成器以通过上述生成器输出第二特征向量,上述生成器基于至少两个数据来源的样本文本数据与上述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由上述至少两个数据来源中的一个数据来源的样本文本数据训练得到;获取上述状态判定数据对应的第三特征向量,根据上述第二特征向量与上述第三特征向量确定上述状态判定数据是否为错误数据。Refer to FIG. 7, which is a schematic structural diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 7, the terminal device in this embodiment may include: one or more processors 401, a memory 402, and a transceiver 403. The aforementioned processor 401, memory 402, and transceiver 403 are connected via a bus 404. The memory 402 is used to store a computer program, the computer program includes program instructions, and the processor 401 is used to execute the program instructions stored in the memory 402 to perform the following operations: obtain text data to be verified from any data source, and the text data to be verified includes The state description data of the target object and the state determination data for the target object; the first feature vector corresponding to the state description data is obtained, and the first feature vector is input to the generator in the generative confrontation network to output through the generator The second feature vector is obtained by the generator based on the sample text data of at least two data sources and at least two discriminators in the generative confrontation network, where one discriminator is obtained from the at least two data sources. The sample text data of a data source is obtained through training; the third feature vector corresponding to the state determination data is obtained, and whether the state determination data is wrong data is determined according to the second feature vector and the third feature vector.
在一些可行的实施方式中,上述处理器401用于:上述根据上述第二特征向量与上述第三特征向量确定上述状态判定数据中是否为错误数据,包括:将上述第二特征向量与上述第三特征向量输入数据对匹配模型,基于上述数据对匹配模型的输出结果确定上述状态判定数据是否为错误数据;其中,上述数据对匹配模型基于至少一个样本数据对以及各样本数据对的匹配标签训练得到,其中一个样本数据对中包括一个样本文本数据中的状态描述数据对应的第四特征向量和状态判定数据对应的第五特征向量,任一样本数据对的匹配标签用于标识上述任一样本数据对中的第四特征向量和第五特征向量是否匹配。In some feasible implementation manners, the processor 401 is configured to: determine whether the state determination data is error data according to the second feature vector and the third feature vector, including: combining the second feature vector with the first feature vector Three feature vector input data pair matching model, based on the output result of the above data pair matching model, determine whether the state judgment data is wrong data; wherein, the above data pair matching model is based on at least one sample data pair and the matching label training of each sample data pair Obtained, one of the sample data pairs includes a fourth feature vector corresponding to the state description data in the sample text data and a fifth feature vector corresponding to the state determination data. The matching label of any sample data pair is used to identify any of the above samples. Whether the fourth feature vector and the fifth feature vector in the data pair match.
在一些可行的实施方式中,上述至少两个数据来源包括第一数据来源和第二数据来源,上述处理器401用于:获取训练样本集合,上述训练样本集合包括上述第一数据来源的样本文本数据和上述第二数据来源的样本文本数据,其中一个样本数据对中包括一个样本文本数据中的状态描述数据和上述状态描述数据的状态判定标签;基于上述训练样本集合中来自上述第一数据来源的样本文本数据构建上述第一判别器,基于上述训练样本集合中来自上述第二数据来源的样本文本数据构建上述第二判别器。In some feasible implementation manners, the aforementioned at least two data sources include a first data source and a second data source, and the processor 401 is configured to: obtain a training sample set, and the training sample set includes sample text from the first data source. Data and the sample text data of the second data source, one of the sample data pairs includes the state description data in the sample text data and the state determination label of the state description data; based on the training sample set from the first data source The first discriminator is constructed based on the sample text data of the training sample set from the second data source.
在一些可行的实施方式中,上述处理器401用于:获取上述训练样本集合中各样本文本数据中的状态描述数据;将上述各样本文本数据中的状态描述数据对应的第一状态描述特征向量输入上述生成器,并获取上述生成器输出的第二状态描述特征向量;将上述第二状态描述特征向量分别输入上述第一判别器和上述第二判别器,并获取上述第一判别器输出的第一判定结果概率分布和上述第二判别器输出的第二判定结果概率分布;根据上述第一判定结果概率分布和上述第二判定结果概率分布调整上述生成器的模型参数以得到满足收敛条件的生成器。In some feasible implementation manners, the above-mentioned processor 401 is configured to: obtain the state description data in each sample text data in the above-mentioned training sample set; Input the above-mentioned generator, and obtain the second state description feature vector output by the above-mentioned generator; input the above-mentioned second state description characteristic vector into the above-mentioned first discriminator and the above-mentioned second discriminator respectively, and obtain the output of the above-mentioned first discriminator The probability distribution of the first determination result and the probability distribution of the second determination result output by the second discriminator; according to the probability distribution of the first determination result and the probability distribution of the second determination result, the model parameters of the generator are adjusted to obtain the model parameters that satisfy the convergence condition Builder.
在一些可行的实施方式中,上述处理器401用于:计算上述第一判定结果概率分布中包括的多个判断结果概率的第一标准差与上述第二判定结果概率分布中包括的多个判断结果概率的第二标准差;当上述第一标准差与上述第二标准差皆小于或者等于预设标准差阈值时,确定调整模型参数后上述生成器满足收敛条件。In some feasible implementation manners, the processor 401 is configured to: calculate the first standard deviation of the probabilities of the multiple judgment results included in the probability distribution of the first judgment result and the multiple judgments included in the probability distribution of the second judgment result. The second standard deviation of the result probability; when the first standard deviation and the second standard deviation are both less than or equal to the preset standard deviation threshold, it is determined that the generator meets the convergence condition after adjusting the model parameters.
在一些可行的实施方式中,上述待核验文本数据包括病历数据,上述待核验文本数据中针对上述目标对象的状态描述数据包括患者的病情描述数据,上述待核验文本数据中针对上述目标对象的状态判定数据包括针对上述患者的病情诊断数据。In some feasible implementation manners, the text data to be verified includes medical record data, the state description data for the target object in the text data to be verified includes disease description data of the patient, and the text data to be verified is for the state of the target object. The judgment data includes disease diagnosis data for the above-mentioned patients.
在一些可行的实施方式中,上述处理器401用于:对上述病情描述数据进行分词处理,以得到组成上述病情描述数据的多个词;获取组成上述病情描述数据的多个词中每个词对应的词向量,根据上述每个词对应的词向量生成上述病情描述数据对应的第一特征向量。In some feasible implementation manners, the processor 401 is configured to: perform word segmentation processing on the disease description data to obtain multiple words composing the disease description data; obtain each word among the multiple words composing the disease description data According to the corresponding word vector, the first feature vector corresponding to the disease description data is generated according to the word vector corresponding to each word.
应当理解,在一些可行的实施方式中,上述处理器401可以是中央处理单元 (central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器 (digital signal processor,DSP)、专用集成电路 (application specific integrated circuit,ASIC)、现成可编程门阵列 (field programmable gate array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。该存储器402可以包括只读存储器和随机存取存储器,并向处理器401 提供指令和数据。存储器402的一部分还可以包括非易失性随机存取存储器。例如,存储器402还可以存储设备类型的信息。It should be understood that, in some feasible implementation manners, the aforementioned processor 401 may be a central processing unit (CPU), and the processor may also be other general-purpose processors or digital signal processors. (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The memory 402 may include a read-only memory and a random access memory, and provides instructions and data to the processor 401. A part of the memory 402 may also include a non-volatile random access memory. For example, the memory 402 may also store device type information.
具体实现中,上述终端设备可通过其内置的各个功能模块执行如上述图1至图3中各个步骤所提供的实现方式,具体可参见上述各个步骤所提供的实现方式,在此不再赘述。In specific implementation, the above-mentioned terminal device can execute the implementation manners provided in the steps in Figures 1 to 3 through its built-in functional modules. For details, please refer to the implementation manners provided in the above-mentioned steps, which will not be repeated here.
在本申请实施例中,终端设备可基于训练样本集合中来自第一数据来源的样本文本数据构建第一判别器,基于训练样本集合中来自第二数据来源的样本文本数据构建第二判别器。进一步地,可根据训练样本集合中至少两个数据来源的样本文本数据与上述两个判别器进行对抗训练以得到生成式对抗网络中的生成器。因此,通过获取任一数据来源的待核验文本数据,可得到待核验文本数据中包括目标对象的状态描述数据和针对目标对象的状态判定数据。通过获取状态描述数据对应的第一特征向量,并将第一特征向量输入生成式对抗网络中的生成器,可通过生成器输出第二特征向量。进一步地,通过获取状态判定数据对应的第三特征向量,可根据第二特征向量与第三特征向量确定状态判定数据是否为错误数据。采用本申请实施例,可提高对文本数据的检测准确性,适用性高。In this embodiment of the application, the terminal device may construct a first discriminator based on sample text data from a first data source in the training sample set, and construct a second discriminator based on sample text data from a second data source in the training sample set. Further, confrontation training can be performed on the two discriminators according to the sample text data of at least two data sources in the training sample set to obtain the generator in the generative confrontation network. Therefore, by obtaining the to-be-verified text data from any data source, the to-be-verified text data can be obtained including the state description data of the target object and the state determination data for the target object. By obtaining the first feature vector corresponding to the state description data, and inputting the first feature vector into the generator in the generative confrontation network, the second feature vector can be output through the generator. Further, by acquiring the third feature vector corresponding to the state determination data, whether the state determination data is error data can be determined according to the second feature vector and the third feature vector. By adopting the embodiments of the present application, the detection accuracy of text data can be improved, and the applicability is high.
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令被处理器执行时实现图1至图3中各个步骤所提供的文本数据的错误检测方法,具体可参见上述各个步骤所提供的实现方式,在此不再赘述。可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, the program instructions are executed by a processor to implement the steps shown in FIGS. 1 to 3 For the error detection method of the provided text data, please refer to the implementation manners provided in the above steps for details, which will not be repeated here. Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
上述计算机可读存储介质可以是前述任一实施例提供的文本数据的错误检测装置或者上述终端设备的内部存储单元,例如电子设备的硬盘或内存。该计算机可读存储介质也可以是该电子设备的外部存储设备,例如该电子设备上配备的插接式硬盘,智能存储卡(smart media card, SMC),安全数字(secure digital, SD)卡,闪存卡(flash card)等。进一步地,该计算机可读存储介质还可以既包括该电子设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该电子设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The foregoing computer-readable storage medium may be the text data error detection apparatus provided in any of the foregoing embodiments or the internal storage unit of the foregoing terminal device, such as a hard disk or memory of an electronic device. The computer-readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a smart media card (SMC), or a secure digital (SD) card equipped on the electronic device. Flash memory card card) etc. Further, the computer-readable storage medium may also include both an internal storage unit of the electronic device and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.
本申请的权利要求书和说明书及附图中的术语“第一”、“第二”、“第三”、“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. in the claims, specification and drawings of this application are used to distinguish different objects, rather than describing a specific order. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置展示该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。在本申请说明书和所附权利要求书中使用的术语“和/ 或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The display of the phrase in various positions in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments. The term "and/or" used in the description of this application and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations. A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
本申请实施例提供的方法及相关装置是参照本申请实施例提供的方法流程图和/或结构示意图来描述的,具体可由计算机程序指令实现方法流程图和/或结构示意图的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。这些计算机程序指令可提供到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或结构示意图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或结构示意图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或结构示意一个方框或多个方框中指定的功能的步骤。The methods and related devices provided in the embodiments of the present application are described with reference to the method flowcharts and/or structural schematic diagrams provided in the embodiments of the present application, and each process and/or structural schematic diagrams of the method flowcharts and/or structural schematic diagrams can be implemented by computer program instructions. Or a block, and a combination of processes and/or blocks in the flowcharts and/or block diagrams. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing equipment to generate a machine, so that instructions executed by the processor of the computer or other programmable data processing equipment are generated for use. It is a device that implements the functions specified in one block or multiple blocks in a flow chart or multiple flows and/or a schematic structural diagram. These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the schematic structural diagram. These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one block or multiple blocks in the flow chart or the flow chart and/or the structure.

Claims (20)

  1. 一种文本数据的错误检测方法,所述方法包括:An error detection method for text data, the method includes:
    获取任一数据来源的待核验文本数据,所述待核验文本数据中包括目标对象的状态描述数据和针对所述目标对象的状态判定数据;Acquiring text data to be verified from any data source, where the text data to be verified includes state description data of the target object and state determination data for the target object;
    获取所述状态描述数据对应的第一特征向量,并将所述第一特征向量输入生成式对抗网络中的生成器以通过所述生成器输出第二特征向量,所述生成器基于至少两个数据来源的样本文本数据与所述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由所述至少两个数据来源中的一个数据来源的样本文本数据训练得到;Obtain a first feature vector corresponding to the state description data, and input the first feature vector into a generator in a generative confrontation network to output a second feature vector through the generator, and the generator is based on at least two The sample text data of the data source is obtained by adversarial training with at least two discriminators in the generative confrontation network, where one discriminator is obtained by training the sample text data of one of the at least two data sources;
    获取所述状态判定数据对应的第三特征向量,根据所述第二特征向量与所述第三特征向量确定所述状态判定数据是否为错误数据。Acquire a third feature vector corresponding to the state determination data, and determine whether the state determination data is error data according to the second feature vector and the third feature vector.
  2. 根据权利要求1所述的方法,其中,所述根据所述第二特征向量与所述第三特征向量确定所述状态判定数据中是否为错误数据,包括:The method according to claim 1, wherein the determining whether the state determination data is error data according to the second feature vector and the third feature vector comprises:
    将所述第二特征向量与所述第三特征向量输入数据对匹配模型,基于所述数据对匹配模型的输出结果确定所述状态判定数据是否为错误数据;Inputting the second feature vector and the third feature vector to a data pair matching model, and determining whether the state determination data is error data based on the output result of the data pair matching model;
    其中,所述数据对匹配模型基于至少一个样本数据对以及各样本数据对的匹配标签训练得到,其中一个样本数据对中包括一个样本文本数据中的状态描述数据对应的第四特征向量和状态判定数据对应的第五特征向量,任一样本数据对的匹配标签用于标识所述任一样本数据对中的第四特征向量和第五特征向量是否匹配。Wherein, the data pair matching model is obtained by training based on at least one sample data pair and the matching label of each sample data pair, and one of the sample data pairs includes a fourth feature vector corresponding to the state description data in the sample text data and the state judgment The fifth feature vector corresponding to the data, and the matching label of any sample data pair is used to identify whether the fourth feature vector and the fifth feature vector in any sample data pair match.
  3. 根据权利要求1或2所述的方法,其中,所述至少两个数据来源包括第一数据来源和第二数据来源,所述至少两个判别器包括第一判别器和第二判别器,所述获取待核验文本数据之前,所述方法还包括:The method according to claim 1 or 2, wherein the at least two data sources include a first data source and a second data source, the at least two discriminators include a first discriminator and a second discriminator, so Before obtaining the text data to be verified, the method further includes:
    获取训练样本集合,所述训练样本集合包括所述第一数据来源的样本文本数据和所述第二数据来源的样本文本数据,其中一个样本数据对中包括一个样本文本数据中的状态描述数据和所述状态描述数据的状态判定标签;Obtain a training sample set, the training sample set includes sample text data from the first data source and sample text data from the second data source, wherein a pair of sample data includes state description data in one sample text data and The state determination label of the state description data;
    基于所述训练样本集合中来自所述第一数据来源的样本文本数据构建所述第一判别器,基于所述训练样本集合中来自所述第二数据来源的样本文本数据构建所述第二判别器。The first discriminator is constructed based on the sample text data from the first data source in the training sample set, and the second discriminator is constructed based on the sample text data from the second data source in the training sample set Device.
  4. 根据权利要求3所述的方法,其中,所述方法还包括:The method according to claim 3, wherein the method further comprises:
    获取所述训练样本集合中各样本文本数据中的状态描述数据;Acquiring state description data in each sample text data in the training sample set;
    将所述各样本文本数据中的状态描述数据对应的第一状态描述特征向量输入所述生成器,并获取所述生成器输出的第二状态描述特征向量;Input the first state description feature vector corresponding to the state description data in each sample text data into the generator, and obtain the second state description feature vector output by the generator;
    将所述第二状态描述特征向量分别输入所述第一判别器和所述第二判别器,并获取所述第一判别器输出的第一判定结果概率分布和所述第二判别器输出的第二判定结果概率分布;The second state description feature vector is input to the first discriminator and the second discriminator respectively, and the probability distribution of the first judgment result output by the first discriminator and the output probability distribution of the second discriminator are obtained. The probability distribution of the second judgment result;
    根据所述第一判定结果概率分布和所述第二判定结果概率分布调整所述生成器的模型参数以得到满足收敛条件的生成器。The model parameters of the generator are adjusted according to the probability distribution of the first determination result and the probability distribution of the second determination result to obtain a generator that satisfies the convergence condition.
  5. 根据权利要求4所述的方法,其中,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    计算所述第一判定结果概率分布中包括的多个判断结果概率的第一标准差与所述第二判定结果概率分布中包括的多个判断结果概率的第二标准差;Calculating the first standard deviation of the probabilities of the multiple judgment results included in the probability distribution of the first judgment result and the second standard deviation of the probabilities of the multiple judgment results included in the probability distribution of the second judgment result;
    当所述第一标准差与所述第二标准差皆小于或者等于预设标准差阈值时,确定调整模型参数后所述生成器满足收敛条件。When the first standard deviation and the second standard deviation are both less than or equal to a preset standard deviation threshold, it is determined that the generator satisfies the convergence condition after adjusting the model parameters.
  6. 根据权利要求1所述的方法,其中,所述待核验文本数据包括病历数据,所述待核验文本数据中针对所述目标对象的状态描述数据包括患者的病情描述数据,所述待核验文本数据中针对所述目标对象的状态判定数据包括针对所述患者的病情诊断数据。The method according to claim 1, wherein the text data to be verified includes medical record data, the state description data for the target object in the text data to be verified includes disease description data of the patient, and the text data to be verified The state determination data for the target object in includes disease diagnosis data for the patient.
  7. 根据权利要求6所述的方法,其中,所述获取所述状态描述数据对应的第一特征向量,包括:The method according to claim 6, wherein said obtaining the first feature vector corresponding to the state description data comprises:
    对所述病情描述数据进行分词处理,以得到组成所述病情描述数据的多个词;Perform word segmentation processing on the disease description data to obtain multiple words that make up the disease description data;
    获取组成所述病情描述数据的多个词中每个词对应的词向量,根据所述每个词对应的词向量生成所述病情描述数据对应的第一特征向量。A word vector corresponding to each of the multiple words constituting the disease description data is obtained, and a first feature vector corresponding to the disease description data is generated according to the word vector corresponding to each word.
  8. 一种文本数据的错误检测装置,所述装置包括:An error detection device for text data, the device comprising:
    数据获取模块,用于获取任一数据来源的待核验文本数据,所述待核验文本数据中包括目标对象的状态描述数据和针对所述目标对象的状态判定数据;A data acquisition module for acquiring text data to be verified from any data source, where the text data to be verified includes state description data of the target object and state determination data for the target object;
    数据处理模块,用于获取所述状态描述数据对应的第一特征向量,并将所述第一特征向量输入生成式对抗网络中的生成器以通过所述生成器输出第二特征向量,所述生成器基于至少两个数据来源的样本文本数据与所述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由所述至少两个数据来源中的一个数据来源的样本文本数据训练得到;The data processing module is configured to obtain a first feature vector corresponding to the state description data, and input the first feature vector into a generator in a generative confrontation network to output a second feature vector through the generator, the The generator is obtained by adversarial training based on sample text data from at least two data sources and at least two discriminators in the generative adversarial network, where one discriminator is derived from one of the at least two data sources. The sample text data is trained;
    数据检测模块,用于获取所述状态判定数据对应的第三特征向量,根据所述第二特征向量与所述第三特征向量确定所述状态判定数据是否为错误数据。The data detection module is configured to obtain a third feature vector corresponding to the state determination data, and determine whether the state determination data is error data according to the second feature vector and the third feature vector.
  9. 一种终端设备,包括处理器和存储器,所述处理器和存储器相互连接;A terminal device includes a processor and a memory, and the processor and the memory are connected to each other;
    所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行以下方法:The memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the following methods:
    获取任一数据来源的待核验文本数据,所述待核验文本数据中包括目标对象的状态描述数据和针对所述目标对象的状态判定数据;Acquiring text data to be verified from any data source, where the text data to be verified includes state description data of the target object and state determination data for the target object;
    获取所述状态描述数据对应的第一特征向量,并将所述第一特征向量输入生成式对抗网络中的生成器以通过所述生成器输出第二特征向量,所述生成器基于至少两个数据来源的样本文本数据与所述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由所述至少两个数据来源中的一个数据来源的样本文本数据训练得到;Obtain a first feature vector corresponding to the state description data, and input the first feature vector into a generator in a generative confrontation network to output a second feature vector through the generator, and the generator is based on at least two The sample text data of the data source is obtained by adversarial training with at least two discriminators in the generative confrontation network, where one discriminator is obtained by training the sample text data of one of the at least two data sources;
    获取所述状态判定数据对应的第三特征向量,根据所述第二特征向量与所述第三特征向量确定所述状态判定数据是否为错误数据。Acquire a third feature vector corresponding to the state determination data, and determine whether the state determination data is error data according to the second feature vector and the third feature vector.
  10. 根据权利要求9所述的终端设备,其中,所述根据所述第二特征向量与所述第三特征向量确定所述状态判定数据中是否为错误数据时,具体执行:The terminal device according to claim 9, wherein when determining whether the state determination data is error data according to the second feature vector and the third feature vector, specifically execute:
    将所述第二特征向量与所述第三特征向量输入数据对匹配模型,基于所述数据对匹配模型的输出结果确定所述状态判定数据是否为错误数据;Inputting the second feature vector and the third feature vector to a data pair matching model, and determining whether the state determination data is error data based on the output result of the data pair matching model;
    其中,所述数据对匹配模型基于至少一个样本数据对以及各样本数据对的匹配标签训练得到,其中一个样本数据对中包括一个样本文本数据中的状态描述数据对应的第四特征向量和状态判定数据对应的第五特征向量,任一样本数据对的匹配标签用于标识所述任一样本数据对中的第四特征向量和第五特征向量是否匹配。Wherein, the data pair matching model is obtained by training based on at least one sample data pair and the matching label of each sample data pair, and one of the sample data pairs includes a fourth feature vector corresponding to the state description data in the sample text data and the state judgment The fifth feature vector corresponding to the data, and the matching label of any sample data pair is used to identify whether the fourth feature vector and the fifth feature vector in any sample data pair match.
  11. 根据权利要求9或10所述的终端设备,其中,所述至少两个数据来源包括第一数据来源和第二数据来源,所述至少两个判别器包括第一判别器和第二判别器,所述获取待核验文本数据之前,所述处理器还用于执行:The terminal device according to claim 9 or 10, wherein the at least two data sources include a first data source and a second data source, and the at least two discriminators include a first discriminator and a second discriminator, Before obtaining the text data to be verified, the processor is further configured to execute:
    获取训练样本集合,所述训练样本集合包括所述第一数据来源的样本文本数据和所述第二数据来源的样本文本数据,其中一个样本数据对中包括一个样本文本数据中的状态描述数据和所述状态描述数据的状态判定标签;Obtain a training sample set, the training sample set includes sample text data from the first data source and sample text data from the second data source, wherein a pair of sample data includes state description data in one sample text data and The state determination label of the state description data;
    基于所述训练样本集合中来自所述第一数据来源的样本文本数据构建所述第一判别器,基于所述训练样本集合中来自所述第二数据来源的样本文本数据构建所述第二判别器。The first discriminator is constructed based on the sample text data from the first data source in the training sample set, and the second discriminator is constructed based on the sample text data from the second data source in the training sample set Device.
  12. 根据权利要求11所述的终端设备,其中,所述处理器还用于执行:The terminal device according to claim 11, wherein the processor is further configured to execute:
    获取所述训练样本集合中各样本文本数据中的状态描述数据;Acquiring state description data in each sample text data in the training sample set;
    将所述各样本文本数据中的状态描述数据对应的第一状态描述特征向量输入所述生成器,并获取所述生成器输出的第二状态描述特征向量;Input the first state description feature vector corresponding to the state description data in each sample text data into the generator, and obtain the second state description feature vector output by the generator;
    将所述第二状态描述特征向量分别输入所述第一判别器和所述第二判别器,并获取所述第一判别器输出的第一判定结果概率分布和所述第二判别器输出的第二判定结果概率分布;The second state description feature vector is input to the first discriminator and the second discriminator respectively, and the probability distribution of the first judgment result output by the first discriminator and the output probability distribution of the second discriminator are obtained. The probability distribution of the second judgment result;
    根据所述第一判定结果概率分布和所述第二判定结果概率分布调整所述生成器的模型参数以得到满足收敛条件的生成器。The model parameters of the generator are adjusted according to the probability distribution of the first determination result and the probability distribution of the second determination result to obtain a generator that satisfies the convergence condition.
  13. 根据权利要求12所述的终端设备,其中,所述处理器还用于执行:The terminal device according to claim 12, wherein the processor is further configured to execute:
    计算所述第一判定结果概率分布中包括的多个判断结果概率的第一标准差与所述第二判定结果概率分布中包括的多个判断结果概率的第二标准差;Calculating the first standard deviation of the probabilities of the multiple judgment results included in the probability distribution of the first judgment result and the second standard deviation of the probabilities of the multiple judgment results included in the probability distribution of the second judgment result;
    当所述第一标准差与所述第二标准差皆小于或者等于预设标准差阈值时,确定调整模型参数后所述生成器满足收敛条件。When the first standard deviation and the second standard deviation are both less than or equal to a preset standard deviation threshold, it is determined that the generator satisfies the convergence condition after adjusting the model parameters.
  14. 根据权利要求9所述的终端设备,其中,所述待核验文本数据包括病历数据,所述待核验文本数据中针对所述目标对象的状态描述数据包括患者的病情描述数据,所述待核验文本数据中针对所述目标对象的状态判定数据包括针对所述患者的病情诊断数据;The terminal device according to claim 9, wherein the text data to be verified includes medical record data, the state description data for the target object in the text data to be verified includes disease description data of the patient, and the text to be verified The state determination data for the target object in the data includes disease diagnosis data for the patient;
    所述获取所述状态描述数据对应的第一特征向量时,具体执行:When the first feature vector corresponding to the state description data is obtained, the following is specifically executed:
    对所述病情描述数据进行分词处理,以得到组成所述病情描述数据的多个词;Perform word segmentation processing on the disease description data to obtain multiple words that make up the disease description data;
    获取组成所述病情描述数据的多个词中每个词对应的词向量,根据所述每个词对应的词向量生成所述病情描述数据对应的第一特征向量。A word vector corresponding to each of the multiple words constituting the disease description data is obtained, and a first feature vector corresponding to the disease description data is generated according to the word vector corresponding to each word.
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下方法:A computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to perform the following method:
    获取任一数据来源的待核验文本数据,所述待核验文本数据中包括目标对象的状态描述数据和针对所述目标对象的状态判定数据;Acquiring text data to be verified from any data source, where the text data to be verified includes state description data of the target object and state determination data for the target object;
    获取所述状态描述数据对应的第一特征向量,并将所述第一特征向量输入生成式对抗网络中的生成器以通过所述生成器输出第二特征向量,所述生成器基于至少两个数据来源的样本文本数据与所述生成式对抗网络中的至少两个判别器进行对抗训练得到,其中,一个判别器由所述至少两个数据来源中的一个数据来源的样本文本数据训练得到;Obtain a first feature vector corresponding to the state description data, and input the first feature vector into a generator in a generative confrontation network to output a second feature vector through the generator, and the generator is based on at least two The sample text data of the data source is obtained by adversarial training with at least two discriminators in the generative confrontation network, where one discriminator is obtained by training the sample text data of one of the at least two data sources;
    获取所述状态判定数据对应的第三特征向量,根据所述第二特征向量与所述第三特征向量确定所述状态判定数据是否为错误数据。Acquire a third feature vector corresponding to the state determination data, and determine whether the state determination data is error data according to the second feature vector and the third feature vector.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述第二特征向量与所述第三特征向量确定所述状态判定数据中是否为错误数据时,具体执行:15. The computer-readable storage medium according to claim 15, wherein when determining whether the state determination data is error data according to the second feature vector and the third feature vector, specifically execute:
    将所述第二特征向量与所述第三特征向量输入数据对匹配模型,基于所述数据对匹配模型的输出结果确定所述状态判定数据是否为错误数据;Inputting the second feature vector and the third feature vector to a data pair matching model, and determining whether the state determination data is error data based on the output result of the data pair matching model;
    其中,所述数据对匹配模型基于至少一个样本数据对以及各样本数据对的匹配标签训练得到,其中一个样本数据对中包括一个样本文本数据中的状态描述数据对应的第四特征向量和状态判定数据对应的第五特征向量,任一样本数据对的匹配标签用于标识所述任一样本数据对中的第四特征向量和第五特征向量是否匹配。Wherein, the data pair matching model is obtained by training based on at least one sample data pair and the matching label of each sample data pair, and one of the sample data pairs includes a fourth feature vector corresponding to the state description data in the sample text data and the state judgment The fifth feature vector corresponding to the data, and the matching label of any sample data pair is used to identify whether the fourth feature vector and the fifth feature vector in any sample data pair match.
  17. 根据权利要求15或16所述的计算机可读存储介质,其中,所述至少两个数据来源包括第一数据来源和第二数据来源,所述至少两个判别器包括第一判别器和第二判别器,所述获取待核验文本数据之前,所述程序指令当被处理器执行时还使所述处理器执行:The computer-readable storage medium according to claim 15 or 16, wherein the at least two data sources include a first data source and a second data source, and the at least two discriminators include a first discriminator and a second data source. The discriminator, before acquiring the text data to be verified, when the program instructions are executed by the processor, the processor also executes:
    获取训练样本集合,所述训练样本集合包括所述第一数据来源的样本文本数据和所述第二数据来源的样本文本数据,其中一个样本数据对中包括一个样本文本数据中的状态描述数据和所述状态描述数据的状态判定标签;Obtain a training sample set, the training sample set includes sample text data from the first data source and sample text data from the second data source, wherein a pair of sample data includes state description data in one sample text data and The state determination label of the state description data;
    基于所述训练样本集合中来自所述第一数据来源的样本文本数据构建所述第一判别器,基于所述训练样本集合中来自所述第二数据来源的样本文本数据构建所述第二判别器。The first discriminator is constructed based on the sample text data from the first data source in the training sample set, and the second discriminator is constructed based on the sample text data from the second data source in the training sample set Device.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述程序指令当被处理器执行时还使所述处理器执行:18. The computer-readable storage medium of claim 17, wherein the program instructions, when executed by the processor, also cause the processor to execute:
    获取所述训练样本集合中各样本文本数据中的状态描述数据;Acquiring state description data in each sample text data in the training sample set;
    将所述各样本文本数据中的状态描述数据对应的第一状态描述特征向量输入所述生成器,并获取所述生成器输出的第二状态描述特征向量;Input the first state description feature vector corresponding to the state description data in each sample text data into the generator, and obtain the second state description feature vector output by the generator;
    将所述第二状态描述特征向量分别输入所述第一判别器和所述第二判别器,并获取所述第一判别器输出的第一判定结果概率分布和所述第二判别器输出的第二判定结果概率分布;The second state description feature vector is input to the first discriminator and the second discriminator respectively, and the probability distribution of the first judgment result output by the first discriminator and the output probability distribution of the second discriminator are obtained. The probability distribution of the second judgment result;
    根据所述第一判定结果概率分布和所述第二判定结果概率分布调整所述生成器的模型参数以得到满足收敛条件的生成器。The model parameters of the generator are adjusted according to the probability distribution of the first determination result and the probability distribution of the second determination result to obtain a generator that satisfies the convergence condition.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述程序指令当被处理器执行时还使所述处理器执行:18. The computer-readable storage medium of claim 18, wherein the program instructions, when executed by the processor, also cause the processor to execute:
    计算所述第一判定结果概率分布中包括的多个判断结果概率的第一标准差与所述第二判定结果概率分布中包括的多个判断结果概率的第二标准差;Calculating the first standard deviation of the probabilities of the multiple judgment results included in the probability distribution of the first judgment result and the second standard deviation of the probabilities of the multiple judgment results included in the probability distribution of the second judgment result;
    当所述第一标准差与所述第二标准差皆小于或者等于预设标准差阈值时,确定调整模型参数后所述生成器满足收敛条件。When the first standard deviation and the second standard deviation are both less than or equal to a preset standard deviation threshold, it is determined that the generator satisfies the convergence condition after adjusting the model parameters.
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述待核验文本数据包括病历数据,所述待核验文本数据中针对所述目标对象的状态描述数据包括患者的病情描述数据,所述待核验文本数据中针对所述目标对象的状态判定数据包括针对所述患者的病情诊断数据;The computer-readable storage medium according to claim 15, wherein the text data to be verified includes medical record data, and the state description data for the target object in the text data to be verified includes the patient's condition description data, and The state determination data for the target object in the text data to be verified includes disease diagnosis data for the patient;
    所述获取所述状态描述数据对应的第一特征向量时,具体执行:When the first feature vector corresponding to the state description data is obtained, the following is specifically executed:
    对所述病情描述数据进行分词处理,以得到组成所述病情描述数据的多个词;Perform word segmentation processing on the disease description data to obtain multiple words that make up the disease description data;
    获取组成所述病情描述数据的多个词中每个词对应的词向量,根据所述每个词对应的词向量生成所述病情描述数据对应的第一特征向量。A word vector corresponding to each of the multiple words constituting the disease description data is obtained, and a first feature vector corresponding to the disease description data is generated according to the word vector corresponding to each word.
PCT/CN2020/132478 2020-09-28 2020-11-27 Text data error detection method and apparatus, terminal device, and storage medium WO2021159814A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011042326.2A CN111883222B (en) 2020-09-28 2020-09-28 Text data error detection method and device, terminal equipment and storage medium
CN202011042326.2 2020-09-28

Publications (1)

Publication Number Publication Date
WO2021159814A1 true WO2021159814A1 (en) 2021-08-19

Family

ID=73198706

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132478 WO2021159814A1 (en) 2020-09-28 2020-11-27 Text data error detection method and apparatus, terminal device, and storage medium

Country Status (2)

Country Link
CN (1) CN111883222B (en)
WO (1) WO2021159814A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883222B (en) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 Text data error detection method and device, terminal equipment and storage medium
CN112820367B (en) * 2021-01-11 2023-06-30 平安科技(深圳)有限公司 Medical record information verification method and device, computer equipment and storage medium
CN112863683B (en) * 2021-02-19 2023-07-25 平安科技(深圳)有限公司 Medical record quality control method and device based on artificial intelligence, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563995A (en) * 2017-08-14 2018-01-09 华南理工大学 A kind of confrontation network method of more arbiter error-duration models
CN109003678A (en) * 2018-06-12 2018-12-14 清华大学 A kind of generation method and system emulating text case history
CN110188172A (en) * 2019-05-31 2019-08-30 清华大学 Text based event detecting method, device, computer equipment and storage medium
CN111402979A (en) * 2020-03-24 2020-07-10 清华大学 Method and device for detecting consistency of disease description and diagnosis
CN111710383A (en) * 2020-06-16 2020-09-25 平安科技(深圳)有限公司 Medical record quality control method and device, computer equipment and storage medium
CN111883222A (en) * 2020-09-28 2020-11-03 平安科技(深圳)有限公司 Text data error detection method and device, terminal equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682397B (en) * 2016-12-09 2020-05-19 江西中科九峰智慧医疗科技有限公司 Knowledge-based electronic medical record quality control method
CN109308450A (en) * 2018-08-08 2019-02-05 杰创智能科技股份有限公司 A kind of face's variation prediction method based on generation confrontation network
CN109508669B (en) * 2018-11-09 2021-07-23 厦门大学 Facial expression recognition method based on generative confrontation network
CN109993072B (en) * 2019-03-14 2021-05-25 中山大学 Low-resolution pedestrian re-identification system and method based on super-resolution image generation
CN110910976A (en) * 2019-10-12 2020-03-24 平安国际智慧城市科技股份有限公司 Medical record detection method, device, equipment and storage medium
CN111126622B (en) * 2019-12-19 2023-11-03 中国银联股份有限公司 Data anomaly detection method and device
CN111444967B (en) * 2020-03-30 2023-10-31 腾讯科技(深圳)有限公司 Training method, generating method, device, equipment and medium for generating countermeasure network
CN111639547B (en) * 2020-05-11 2021-04-30 山东大学 Video description method and system based on generation countermeasure network
CN111696637A (en) * 2020-05-15 2020-09-22 平安科技(深圳)有限公司 Quality detection method and related device for medical record data
CN111696636B (en) * 2020-05-15 2023-09-22 平安科技(深圳)有限公司 Data processing method and device based on deep neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563995A (en) * 2017-08-14 2018-01-09 华南理工大学 A kind of confrontation network method of more arbiter error-duration models
CN109003678A (en) * 2018-06-12 2018-12-14 清华大学 A kind of generation method and system emulating text case history
CN110188172A (en) * 2019-05-31 2019-08-30 清华大学 Text based event detecting method, device, computer equipment and storage medium
CN111402979A (en) * 2020-03-24 2020-07-10 清华大学 Method and device for detecting consistency of disease description and diagnosis
CN111710383A (en) * 2020-06-16 2020-09-25 平安科技(深圳)有限公司 Medical record quality control method and device, computer equipment and storage medium
CN111883222A (en) * 2020-09-28 2020-11-03 平安科技(深圳)有限公司 Text data error detection method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN111883222B (en) 2020-12-22
CN111883222A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
WO2021159814A1 (en) Text data error detection method and apparatus, terminal device, and storage medium
US20180218127A1 (en) Generating a Knowledge Graph for Determining Patient Symptoms and Medical Recommendations Based on Medical Information
US20180218126A1 (en) Determining Patient Symptoms and Medical Recommendations Based on Medical Information
US9230060B2 (en) Associating records in healthcare databases with individuals
Zhou et al. PSweight: An R package for propensity score weighting analysis
CN111242793B (en) Medical insurance data abnormality detection method and device
CN112151170A (en) Method for calculating a score of a medical advice for use as a medical decision support
CN112201342B (en) Medical auxiliary diagnosis method, device, equipment and storage medium based on federal learning
CN109783479B (en) Data standardization processing method and device and storage medium
CN107767924A (en) Initial data checking method, device, electronic equipment and storage medium
CN115269613B (en) Patient main index construction method, system, equipment and storage medium
CN112883157A (en) Method and device for standardizing multi-source heterogeneous medical data
WO2021051496A1 (en) Diagnosis result identification and model training method, computer device, and storage medium
CN112035619A (en) Medical questionnaire screening method, device, equipment and medium based on artificial intelligence
Wang et al. Bayesian adaptive lasso for additive hazard regression with current status data
WO2020082796A1 (en) Method, device and apparatus for processing medical visit information based on data analysis, and medium
WO2020034874A1 (en) Medical document examining method and apparatus, computer device, and storage medium
CN113066531B (en) Risk prediction method, risk prediction device, computer equipment and storage medium
CN112800187B (en) Data mapping method, medical text data mapping method and device and electronic equipment
WO2021151330A1 (en) User grouping method, apparatus and device, and computer-readable storage medium
WO2021151343A1 (en) Test sample category determination method and apparatus for siamese network, and terminal device
BR102022016487A2 (en) METHOD FOR SCORING AND EVALUATION OF DATA FOR EXCHANGE
CN111899844B (en) Sample generation method and device, server and storage medium
WO2021114626A1 (en) Method for detecting quality of medical record data and related device
US11537742B2 (en) Sampling from a remote dataset with a private criterion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918617

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20918617

Country of ref document: EP

Kind code of ref document: A1