CN114637843A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114637843A
CN114637843A CN202011482703.4A CN202011482703A CN114637843A CN 114637843 A CN114637843 A CN 114637843A CN 202011482703 A CN202011482703 A CN 202011482703A CN 114637843 A CN114637843 A CN 114637843A
Authority
CN
China
Prior art keywords
text
text data
model
data
post
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011482703.4A
Other languages
Chinese (zh)
Inventor
陈谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202011482703.4A priority Critical patent/CN114637843A/en
Publication of CN114637843A publication Critical patent/CN114637843A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例提供了一种数据处理方法、装置、电子设备及存储介质,涉及人工智能技术领域。其中,数据处理方法包括:采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签;基于无标签文本数据和伪标签,对预训练模型进行更新,得到更新后模型;基于第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行更新,得到训练完成的文本后处理模型。通过本申请实施例,可以使得文本后处理模型的准确率更高。

Figure 202011482703

Embodiments of the present application provide a data processing method, apparatus, electronic device, and storage medium, which relate to the technical field of artificial intelligence. The data processing method includes: using a first text data training sample and a label corresponding to the first text data training sample, pre-training a text post-processing model to obtain a pre-training model; obtaining unlabeled text data and processing the unlabeled text The pseudo-label of the unlabeled text data obtained after the data is processed; based on the unlabeled text data and the pseudo-label, the pre-training model is updated to obtain the updated model; based on the second text data training samples and the second text data training samples Corresponding labels, update the updated model, and obtain the trained text post-processing model. Through the embodiments of the present application, the accuracy rate of the text post-processing model can be made higher.

Figure 202011482703

Description

数据处理方法、装置、电子设备及存储介质Data processing method, device, electronic device and storage medium

技术领域technical field

本申请实施例涉及人工智能技术领域,尤其涉及一种数据处理方法、装置、电子设备及存储介质。The embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to a data processing method, apparatus, electronic device, and storage medium.

背景技术Background technique

文本后处理是一种对基于自动语音识别技术获得的文本进行后续处理的过程。通常,文本后处理包括:标点预测和顺滑检测两个任务,可以通过多任务学习方式建立文本后处理模型,进而实现对待处理文本的文本后处理操作。Text post-processing is a process of subsequent processing of text obtained based on automatic speech recognition technology. Generally, text post-processing includes two tasks: punctuation prediction and smooth detection. A text post-processing model can be established through a multi-task learning method, thereby realizing the text post-processing operation of the text to be processed.

目前,主要是采用有监督的文本数据对文本后处理模型进行训练的。具体的:先从标准文库(例如:维基百科等文库)中获取大量标准文本数据作为训练标签,基于上述标准文本数据生成训练样本,然后根据上述模型训练样本和标签,对初始文本后处理模型进行训练,得到相应的模型。Currently, text post-processing models are mainly trained on supervised text data. Specifically: first obtain a large amount of standard text data from standard libraries (such as Wikipedia and other libraries) as training labels, generate training samples based on the above standard text data, and then train the samples and labels according to the above model, and perform the initial text post-processing model. Train to get the corresponding model.

上述过程,采用了大量的标准文本数据来进行模型训练。但因这些标准文本数据通常来源单一,涉及的应用领域有限,从而使得采用上述训练方法得到的文本后处理模型的准确率较低。In the above process, a large amount of standard text data is used for model training. However, because these standard text data usually come from a single source and involve limited application fields, the accuracy of the text post-processing model obtained by the above training method is low.

发明内容SUMMARY OF THE INVENTION

本申请的目的在于提出一种数据处理方法、装置、电子设备及计算机存储介质,至少部分解决上述现有技术中存在的问题。The purpose of the present application is to provide a data processing method, apparatus, electronic device and computer storage medium, which at least partially solve the above problems in the prior art.

根据本申请实施例的第一方面,提供了一种数据处理方法,包括:According to a first aspect of the embodiments of the present application, a data processing method is provided, including:

采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;Using the first text data training sample and the label corresponding to the first text data training sample, pre-training the text post-processing model to obtain a pre-training model;

获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签;Obtain the unlabeled text data and the pseudo-label of the unlabeled text data obtained by processing the unlabeled text data;

基于无标签文本数据和伪标签,对预训练模型进行训练更新,得到更新后模型;Based on unlabeled text data and pseudo-labels, the pre-training model is trained and updated, and the updated model is obtained;

基于第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行训练更新,得到训练完成的文本后处理模型。Based on the second text data training samples and the labels corresponding to the second text data training samples, the updated model is trained and updated to obtain a trained text post-processing model.

根据本申请实施例的第二方面,提供了一种数据处理方法,包括:According to a second aspect of the embodiments of the present application, a data processing method is provided, including:

获取待处理文本数据;Get the text data to be processed;

将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于第一方面的方法得到。Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first aspect.

根据本申请实施例的第三方面,提供了一种数据处理方法,包括:According to a third aspect of the embodiments of the present application, a data processing method is provided, including:

接收到通过即时通信应用的界面输入的、用于指示将输入的语音数据转换为文本数据的指令;Receive an instruction input through the interface of the instant messaging application for instructing to convert the input voice data into text data;

根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;Perform text conversion on the voice data according to the instruction to obtain text data to be processed;

将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于第一方面的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first aspect.

根据本申请实施例的第四方面,提供了一种数据处理方法,包括:According to a fourth aspect of the embodiments of the present application, a data processing method is provided, including:

接收到通过一体机设备输入设置的、用于指示将输入的语音数据转换为文本数据的指令;receiving an instruction set through the all-in-one device input and used to instruct to convert the input voice data into text data;

根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;Perform text conversion on the voice data according to the instruction to obtain text data to be processed;

将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于第一方面的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first aspect.

根据本申请实施例的第五方面,提供了一种数据处理方法,包括:According to a fifth aspect of the embodiments of the present application, a data processing method is provided, including:

接收公有云客户端上传的语音数据;Receive the voice data uploaded by the public cloud client;

对所述语音数据进行文本转换,得到待处理文本数据;performing text conversion on the voice data to obtain text data to be processed;

将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于第一方面的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first aspect.

根据本申请实施例的第六方面,提供了一种数据处理方法,包括:According to a sixth aspect of the embodiments of the present application, a data processing method is provided, including:

接收公有云客户端上传的待处理文本数据,其中,所述待处理文本数据为所述公有云客户端对接收到的语音数据进行文本转换之后得到的;Receive the to-be-processed text data uploaded by the public cloud client, wherein the to-be-processed text data is obtained after the public cloud client performs text conversion on the received voice data;

将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于第一方面的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first aspect.

根据本申请实施例的第七方面,提供了一种数据处理装置。装置包括:According to a seventh aspect of the embodiments of the present application, a data processing apparatus is provided. The device includes:

模型预训练模块,用于采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;The model pre-training module is used for pre-training the text post-processing model by using the first text data training sample and the label corresponding to the first text data training sample to obtain a pre-training model;

无标签文本数据及伪标签获取模块,用于获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签;The unlabeled text data and pseudo-label acquisition module is used to obtain the unlabeled text data and the pseudo-label of the unlabeled text data obtained by processing the unlabeled text data;

第一训练更新模块,用于基于无标签文本数据和伪标签,对预训练模型进行训练更新,得到更新后模型;The first training update module is used for training and updating the pre-training model based on the unlabeled text data and pseudo-labels to obtain the updated model;

第二训练更新模块,用于基于第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行训练更新,得到训练完成的文本后处理模型。The second training and updating module is used for training and updating the updated model based on the second text data training samples and the labels corresponding to the second text data training samples to obtain a trained text post-processing model.

根据本申请实施例的第八方面,提供了一种数据处理装置。装置包括:According to an eighth aspect of the embodiments of the present application, a data processing apparatus is provided. The device includes:

待处理文本数据获取模块,用于获取待处理文本数据;A pending text data acquisition module, used to acquire pending text data;

第一处理后文本数据获取模块,用于将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于上述第一方面的方法得到。The first post-processing text data acquisition module is used to input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first aspect.

根据本申请实施例的第九方面,提供了一种数据处理装置。装置包括:According to a ninth aspect of the embodiments of the present application, a data processing apparatus is provided. The device includes:

第一指令接收模块,用于接收到通过即时通信应用的界面输入的、用于指示将输入的语音数据转换为文本数据的指令;a first instruction receiving module, configured to receive an instruction input through the interface of the instant messaging application for instructing to convert the input voice data into text data;

第一文本转换模块,用于根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;a first text conversion module, configured to perform text conversion on the voice data according to the instruction to obtain text data to be processed;

第二处理后文本数据获取模块,用于将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于上述第一方面的方法得到。The second post-processing text data acquisition module is configured to input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is based on the above-mentioned first aspect. method to get.

根据本申请实施例的第十方面,提供了一种数据处理装置。装置包括:According to a tenth aspect of the embodiments of the present application, a data processing apparatus is provided. The device includes:

第二指令接收模块,用于接收到通过一体机设备输入设置的、用于指示将输入的语音数据转换为文本数据的指令;The second instruction receiving module is used to receive an instruction input and set through the all-in-one device and used to instruct to convert the input voice data into text data;

第二文本转换模块,用于根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;A second text conversion module, configured to perform text conversion on the voice data according to the instruction to obtain text data to be processed;

第三处理后文本数据获取模块,用于将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于上述第一方面的方法得到。A third post-processing text data acquisition module, configured to input the to-be-processed text data into a text post-processing model, and acquire processed text data output by the text post-processing model; wherein, the text post-processing model is based on the above-mentioned first aspect method to get.

根据本申请实施例的第十一方面,提供了一种数据处理装置。装置包括:According to an eleventh aspect of the embodiments of the present application, a data processing apparatus is provided. The device includes:

语音数据接收模块,用于接收公有云客户端上传的语音数据;The voice data receiving module is used to receive the voice data uploaded by the public cloud client;

第三文本转换模块,用于对所述语音数据进行文本转换,得到待处理文本数据;A third text conversion module, configured to perform text conversion on the voice data to obtain text data to be processed;

第四处理后文本数据获取模块,将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于上述第一方面的方法得到。The fourth post-processing text data acquisition module inputs the to-be-processed text data into a text post-processing model, and obtains processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first aspect above. .

根据本申请实施例的第十二方面,提供了一种数据处理装置。装置包括:According to a twelfth aspect of the embodiments of the present application, a data processing apparatus is provided. The device includes:

待处理文本数据接收模块,用于接收公有云客户端上传的待处理文本数据,其中,所述待处理文本数据为所述公有云客户端对接收到的语音数据进行文本转换之后得到的;a to-be-processed text data receiving module, configured to receive the to-be-processed text data uploaded by the public cloud client, wherein the to-be-processed text data is obtained after the public cloud client performs text conversion on the received voice data;

第五处理后文本数据获取模块,用于将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于上述第一方面的方法得到。A fifth post-processing text data acquisition module, configured to input the to-be-processed text data into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is based on the above-mentioned first aspect method to get.

根据本申请实施例的第十三方面,提供了一种电子设备,包括:一个或多个处理器;计算机可读介质,配置为存储一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如上述实施例的第一方面至第六方面中任一方面的数据处理方法。According to a thirteenth aspect of the embodiments of the present application, there is provided an electronic device, comprising: one or more processors; a computer-readable medium configured to store one or more programs, when the one or more programs are executed by one or more A plurality of processors execute, so that one or more processors implement the data processing method according to any one of the first to sixth aspects of the above-described embodiments.

根据本申请实施例的第十四方面,提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如上述实施例的第一方面至第六方面中任一方面的数据处理方法。According to a fourteenth aspect of an embodiment of the present application, there is provided a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements any one of the first to sixth aspects of the foregoing embodiments Aspects of data processing methods.

根据本申请实施例的第十五方面,提供了一种计算机程序,其包含有计算机可执行指令,该计算机可执行指令在被执行时实现上述实施例的第一方面至第六方面中任一方面的数据处理方法。According to a fifteenth aspect of the embodiments of the present application, a computer program is provided, which includes computer-executable instructions, and when executed, the computer-executable instructions implement any one of the first to sixth aspects of the foregoing embodiments Aspects of data processing methods.

根据本申请实施例提供的数据处理方案,在采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,本申请实施例提供的方案中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用本申请实施例提供的数据处理方法得到的文本后处理模型的准确率更高。According to the data processing solution provided by the embodiment of the present application, after the pre-training model is obtained by using the first text data training sample and the label corresponding to the first text data training sample, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the solution provided by the embodiment of the present application, the pre-trained model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the embodiment of the present application has a higher accuracy rate.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1a为本申请实施例一中数据处理方法的步骤流程图;Fig. 1a is a flow chart of the steps of the data processing method in the first embodiment of the present application;

图1b为根据本申请实施例一提供的数据处理流程的示意图;FIG. 1b is a schematic diagram of a data processing flow according to Embodiment 1 of the present application;

图2为本申请实施例二中数据处理方法的步骤流程图;2 is a flow chart of the steps of the data processing method in the second embodiment of the application;

图3a为本申请实施例三中数据处理方法的步骤流程图;3a is a flowchart of the steps of the data processing method in the third embodiment of the present application;

图3b为根据本申请实施例三提供的数据处理流程的示意图;3b is a schematic diagram of a data processing flow provided according to Embodiment 3 of the present application;

图4为本申请实施例四中数据处理方法的步骤流程图;4 is a flowchart of the steps of the data processing method in Embodiment 4 of the present application;

图5为本申请实施例五中数据处理方法的步骤流程图;5 is a flowchart of the steps of the data processing method in Embodiment 5 of the present application;

图6为本申请实施例六中数据处理方法的步骤流程图;6 is a flowchart of the steps of the data processing method in Embodiment 6 of the present application;

图7为本申请实施例七中数据处理方法的步骤流程图;FIG. 7 is a flowchart of the steps of the data processing method in Embodiment 7 of the present application;

图8为本申请实施例八中数据处理装置的结构示意图;8 is a schematic structural diagram of a data processing apparatus in Embodiment 8 of the present application;

图9为本申请实施例九中数据处理装置的结构示意图;9 is a schematic structural diagram of a data processing apparatus in Embodiment 9 of the present application;

图10为本申请实施例十中数据处理装置的结构示意图;10 is a schematic structural diagram of a data processing apparatus in Embodiment 10 of the present application;

图11为本申请实施例十一中数据处理装置的结构示意图;11 is a schematic structural diagram of a data processing apparatus in Embodiment 11 of the present application;

图12为本申请实施例十二中数据处理装置的结构示意图;12 is a schematic structural diagram of a data processing apparatus in Embodiment 12 of the present application;

图13为本申请实施例十三中数据处理装置的结构示意图;13 is a schematic structural diagram of a data processing apparatus in Embodiment 13 of the present application;

图14为本申请实施例十四中电子设备的结构示意图;14 is a schematic structural diagram of an electronic device in Embodiment 14 of the present application;

图15为本申请实施例十五中电子设备的硬件结构。FIG. 15 is the hardware structure of the electronic device in the fifteenth embodiment of the present application.

具体实施方式Detailed ways

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅配置为解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only configured to explain the related invention, rather than limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

参照图1a,示出了本申请实施例一的数据处理方法的步骤流程图。Referring to Fig. 1a, a flow chart of steps of the data processing method according to the first embodiment of the present application is shown.

具体地,本实施例提供的数据处理方法包括以下步骤:Specifically, the data processing method provided by this embodiment includes the following steps:

步骤101,采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型。Step 101 , using the first text data training sample and the label corresponding to the first text data training sample to pre-train the text post-processing model to obtain a pre-training model.

本步骤中,第一文本数据训练样本可以为任意的基于自动语音识别技术获得的文本数据。例如,使用现有的书面语料库(标准文本数据库)中的书面语料(标准文本数据),生成非顺滑语料,从而将生成的上述非顺滑语料作为第一文本数据训练样本,其对应的书面语料即可为标签。In this step, the first text data training sample may be any text data obtained based on automatic speech recognition technology. For example, use the written corpus (standard text data) in the existing written corpus (standard text database) to generate a non-smooth corpus, so that the generated above-mentioned non-smooth corpus is used as the first text data training sample, and its corresponding written language The material can be the label.

本发明实施例中的文本后处理模型可以为任意的深度学习模型,例如:卷积神经网络模型、循环神经网络模型等,此处,对于文本后处理模型的具体形式不做限定。The text post-processing model in this embodiment of the present invention may be any deep learning model, such as a convolutional neural network model, a cyclic neural network model, and the like. Here, the specific form of the text post-processing model is not limited.

通常训练样本的数量较为庞大,若通过人工标注的方式获取训练样本对应的标签,成本较高。因此,在一些可选实施例中,进行本步骤之前,可以:从标准文本数据库中获取标准文本数据,并采用预设规则生成对应的非顺滑文本数据;将非顺滑文本数据作为第一文本数据训练样本,并将标准文本数据作为与第一文本数据训练样本对应的标签。相比于人工标注的方式,上述方式可以降低人工成本。Usually, the number of training samples is relatively large. If the labels corresponding to the training samples are obtained by manual labeling, the cost is high. Therefore, in some optional embodiments, before this step is performed, standard text data may be obtained from a standard text database, and corresponding non-smooth text data may be generated by using preset rules; the non-smooth text data may be used as the first Text data training samples, and standard text data are used as labels corresponding to the first text data training samples. Compared with the manual labeling method, the above method can reduce labor costs.

其中,非顺滑文本数据意指存在重复词语或冗余的语气助词或者语义不通顺的文本数据。上述预设规则可以由本领域技术人员根据实际需求适当设置,例如,可以为:过滤掉标准文本数据中的标点符号,并在过滤后的标准文本数据中随机添加预设语气助词(如:“嗯”、“啊”等);也可以为:过滤掉标准文本数据中的标点符号,并在过滤后的标准文本数据中,随机重复其中的部分词语等。本申请实施例中,预设规则的具体内容可以根据实际情况来设定,此处,不做限定。The non-smooth text data means that there are repeated words or redundant modal particles or text data with unsmooth semantics. The above-mentioned preset rules can be appropriately set by those skilled in the art according to actual needs. For example, it can be: filtering out the punctuation marks in the standard text data, and randomly adding preset modal particles (such as: "umm" in the filtered standard text data ", "ah", etc.); it can also be: filtering out the punctuation marks in the standard text data, and randomly repeating some of the words in the filtered standard text data, etc. In the embodiment of the present application, the specific content of the preset rule may be set according to the actual situation, which is not limited here.

步骤102,获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签。Step 102 , acquiring unlabeled text data and pseudo-labels of the unlabeled text data obtained by processing the unlabeled text data.

本申请实施例中,无标签文本数据的伪标签则可以通过对无标签文本数据进行文本后处理预操作获得。由于此时模型还未训练完成,因此,获得的标签可能不足够准确,所以称之为伪标签。In the embodiment of the present application, the pseudo-label of the unlabeled text data can be obtained by performing a text post-processing pre-operation on the unlabeled text data. Since the model has not been trained at this time, the obtained labels may not be accurate enough, so they are called pseudo labels.

在一些可选实施例中,在获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签时,可以:获取无标签文本数据;采用标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签。In some optional embodiments, when acquiring the unlabeled text data and the pseudo-label of the unlabeled text data obtained by processing the unlabeled text data, it is possible to: obtain the unlabeled text data; Label prediction is performed on text data to obtain pseudo-labels for unlabeled text data.

可选地,可以通过如下方式获取无标签文本数据:Optionally, unlabeled text data can be obtained as follows:

获取待识别语音数据;采用自动语音识别技术,对待识别语音数据进行识别,得到无标签文本数据。Obtain the speech data to be recognized; use automatic speech recognition technology to recognize the speech data to be recognized, and obtain unlabeled text data.

上述方法中,先获取待识别语音数据,之后借助自动语音识别技术,得到无标签文本数据,这样,可以快速获取到大量的无标签文本数据,且可有效适用于ASR场景。In the above method, the speech data to be recognized is obtained first, and then the unlabeled text data is obtained by means of automatic speech recognition technology. In this way, a large amount of unlabeled text data can be quickly obtained and can be effectively applied to ASR scenarios.

可选地,上述标签预测模型的获取方式可以包括以下两种:Optionally, the acquisition methods of the above-mentioned label prediction model may include the following two:

第一种,可以为基于上述步骤101中得到的预训练模型,进行训练更新之后得到的。具体地,可以再次获取文本数据训练样本以及其对应的标签,然后基于再次获取到的文本数据训练样本及其对应的标签,对上述步骤101中得到的预训练模型中的网络参数进行训练更新,从而得到标签预测模型。The first type may be obtained after training and updating based on the pre-training model obtained in the above step 101 . Specifically, the text data training samples and their corresponding labels can be obtained again, and then based on the text data training samples and their corresponding labels obtained again, the network parameters in the pre-training model obtained in the above step 101 are trained and updated, Thus, the label prediction model is obtained.

其中,再次获取的文本数据训练样本,可以为与文本后处理模型的应用领域(目标领域)匹配度较高的待进行后处理操作文本数据,对应地,为保证得到的标签的准确性,可以通过人工参与的方式,对上述再次获取的文本数据训练样本进行文本后处理,进而得到其对应的标签。Among them, the text data training samples obtained again may be text data to be subjected to post-processing operations that have a high degree of matching with the application field (target field) of the text post-processing model. Correspondingly, in order to ensure the accuracy of the obtained labels, you can By means of manual participation, text post-processing is performed on the re-acquired text data training samples to obtain their corresponding labels.

第二种,可以是基于其他的、比上述步骤101中得到的预训练模型更大规模、具有更高准确率的模型,进行训练更新之后得到的。The second type may be obtained after training and updating based on other models with a larger scale and higher accuracy than the pre-trained model obtained in the above step 101 .

例如,可以先构建一个初始的标签预测模型,该标签预测模型中包含的网络层数多于上述预训练模型中包含的网络层数,或者,该标签预测模型中各网络层的维度大多于上述预训练模型中各网络层的维度;然后,采用现有语料库中的语料作为训练样本,对上述初始的标签预测模型进行预训练,得到预训练后的标签预测模型;再获取与文本后处理模型的应用领域(目标领域)匹配度较高的文本数据训练样本及其标签,对预训练后的标签预测模型的网络参数进行训练更新,最终得到训练完成的标签预测模型。由于该标签预测模型的网络层数或各网络层的维度更大,因此,该标签预测模型的准确率也更高。For example, an initial label prediction model can be constructed first, and the number of network layers included in the label prediction model is more than the number of network layers included in the above pre-training model, or the dimensions of each network layer in the label prediction model are larger than the above The dimensions of each network layer in the pre-training model; then, using the corpus in the existing corpus as a training sample, pre-training the above initial label prediction model to obtain a pre-trained label prediction model; then obtain and text post-processing model The application field (target field) of the text data training samples and their labels with high matching degree, the network parameters of the pre-trained label prediction model are trained and updated, and finally the trained label prediction model is obtained. Since the number of network layers of the label prediction model or the dimension of each network layer is larger, the accuracy of the label prediction model is also higher.

上述两种标签预测模型的获取方式相比:第一种方式,是在步骤101中得到的预训练模型的基础上,进行训练更新的,因此,实现过程较为简单,标签预测模型获取速度较快;第二种方式,并不是基于步骤101中得到的预训练模型的,而是基于另外的,比上述步骤101中得到的预训练模型更大规模、具有更高准确率的模型,进行训练更新得到的,因此,模型的训练过程较为复杂,标签预测模型获取速度较慢,但是,获取到的标签预测模型的准确率更高。Comparing the acquisition methods of the above two label prediction models: the first method is based on the pre-training model obtained in step 101, and is trained and updated. Therefore, the implementation process is relatively simple, and the acquisition speed of the label prediction model is faster. ; The second method is not based on the pre-training model obtained in step 101, but based on another model with a larger scale and higher accuracy than the pre-training model obtained in the above-mentioned step 101. Therefore, the training process of the model is more complicated, and the acquisition speed of the label prediction model is slow, but the accuracy of the acquired label prediction model is higher.

具体的,在一些可选实施例中,采用标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签,可以包括:Specifically, in some optional embodiments, a label prediction model is used to perform label prediction on unlabeled text data to obtain pseudo-labels of the unlabeled text data, which may include:

基于第三文本数据训练样本以及与第三文本数据训练样本对应的标签,对预训练模型进行训练更新,得到标签预测模型;采用标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签。Based on the third text data training samples and the labels corresponding to the third text data training samples, the pre-training model is trained and updated to obtain a label prediction model; the label prediction model is used to perform label prediction on the unlabeled text data to obtain the unlabeled text Pseudo labels for the data.

上述第三文本数据训练样本可以为:与文本后处理模型的应用领域(目标领域)匹配度较高的待进行后处理操作文本数据,对应地,为保证得到的标签的准确性,可以通过人工参与的方式,对第三文本数据训练样本进行文本后处理,进而得到与第三文本数据训练样本对应的标签。The above-mentioned third text data training samples may be: text data to be subjected to post-processing operations that have a high degree of matching with the application field (target field) of the text post-processing model. Correspondingly, in order to ensure the accuracy of the obtained labels, manual In the way of participation, text post-processing is performed on the third text data training sample, and then a label corresponding to the third text data training sample is obtained.

例如,上述第三文本数据训练样本及其标签,可以为:较小规模的人工标注口语语料库中的语料。具体地,可以将人工标注口语语料库中的语料作为第三文本数据训练样本,将其对应的人工标注内容作为标签。For example, the above-mentioned third text data training samples and their labels may be: corpora in a small-scale manually-labeled spoken language corpus. Specifically, the corpus in the manually labeled spoken language corpus may be used as the third text data training sample, and the corresponding manually labeled content may be used as the label.

在另一些可选实施例中,采用标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签,可以包括:In other optional embodiments, a label prediction model is used to perform label prediction on unlabeled text data to obtain pseudo-labels of the unlabeled text data, which may include:

获取预先构建的初始的标签预测模型;其中,标签预测模型中包含的网络层数多于文本后处理模型中包含的网络层数,和/或,标签预测模型中各网络层的维度大多于文本后处理模型中各网络层的维度;Obtain a pre-built initial label prediction model; wherein the label prediction model contains more network layers than the text post-processing model, and/or the dimensions of each network layer in the label prediction model are larger than the text The dimensions of each network layer in the post-processing model;

采用第四文本数据训练样本以及与第四文本数据训练样本对应的标签,对初始的标签预测模型进行预训练,得到预训练后标签预测模型;Using the fourth text data training sample and the label corresponding to the fourth text data training sample, pre-training the initial label prediction model to obtain the pre-trained label prediction model;

基于第五文本数据训练样本以及与第五文本数据训练样本对应的标签,对预训练后标签预测模型进行训练更新,得到训练完成的标签预测模型;Based on the fifth text data training sample and the label corresponding to the fifth text data training sample, training and updating the pre-trained label prediction model to obtain a trained label prediction model;

采用训练完成的标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签。The trained label prediction model is used to perform label prediction on unlabeled text data to obtain pseudo-labels of unlabeled text data.

其中,就第四文本数据训练样本而言,由于其作用与上述步骤101中,第一文本数据训练样本的作用相同,都是用于进行模型预训练的,因此,在一些可选实施例中,也可以使用现有的书面语料库中的书面语料,生成非顺滑语料,从而将生成的上述非顺滑语料作为第四文本数据训练样本,其对应的书面语料即可为标签。Among them, as far as the fourth text data training sample is concerned, since its function is the same as that of the first text data training sample in the above step 101, it is used for model pre-training. Therefore, in some optional embodiments , the written corpus in the existing written corpus can also be used to generate the non-smooth corpus, so that the generated non-smooth corpus is used as the fourth text data training sample, and the corresponding written corpus can be the label.

针对第五文本数据训练样本而言,其作用与上述第三文本数据训练样本的作用相同,都是用于对模型进行训练更新的,因此,在一些可选实施例中,也可以使用较小规模的人工标注口语语料库中的语料作为第五文本数据训练样本,将其对应的人工标注内容作为标签。For the fifth text data training sample, its function is the same as that of the above-mentioned third text data training sample, both of which are used to train and update the model. Therefore, in some optional embodiments, a smaller The corpus in the large-scale human-annotated spoken language corpus is used as the fifth text data training sample, and its corresponding artificially-annotated content is used as a label.

步骤103,基于无标签文本数据和伪标签,对预训练模型进行训练更新,得到更新后模型。Step 103: Based on the unlabeled text data and the pseudo-label, the pre-trained model is trained and updated to obtain an updated model.

预训练模型是基于现有的语料库中的语料进行预训练得到的,由于现有语料库涉及的应用领域有限,与文本后处理模型的应用领域之间的匹配度不高,因此会导致预训练模型的准确率较低。基于上述原因,可以在得到预训练模型之后,再获取大量的、涉及更多应用领域的无标签文本数据(例如ASR人工转录文本),然后基于上述无标签文本数据及其伪标签,对预训练模型的网络参数进行训练更新,这样,可以得到准确率更高的更新后模型。The pre-training model is obtained by pre-training based on the corpus in the existing corpus. Due to the limited application fields involved in the existing corpus, the matching degree with the application field of the text post-processing model is not high, so the pre-training model will be caused. of lower accuracy. Based on the above reasons, after obtaining the pre-training model, a large amount of unlabeled text data (such as ASR manual transcription text) involving more application fields can be obtained, and then based on the above unlabeled text data and its pseudo-labels, pre-training The network parameters of the model are trained and updated, so that an updated model with higher accuracy can be obtained.

步骤104,基于第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行训练更新,得到训练完成的文本后处理模型。Step 104: Based on the second text data training samples and the labels corresponding to the second text data training samples, the updated model is trained and updated to obtain a trained text post-processing model.

与上述第三文本数据训练样本、第五文本数据训练样本类似,本步骤中的第二文本数据训练样本作为用于进行模型训练更新的样本,可以为:与文本后处理模型的应用领域(目标领域)匹配度较高的待进行后处理操作文本数据,对应地,为保证得到的标签的准确性,可以通过人工参与的方式,对第五文本数据训练样本进行文本后处理,进而得到与第五文本数据训练样本对应的标签。例如,在一些可选实施例中,也可以使用较小规模的人工标注口语语料库中的语料作为第二文本数据训练样本,将其对应的人工标注内容作为标签。Similar to the above-mentioned third text data training sample and the fifth text data training sample, the second text data training sample in this step is used as a sample for model training and updating, and can be: the application field (target) of the text post-processing model. field) with a high degree of matching to the text data to be post-processed. Correspondingly, in order to ensure the accuracy of the obtained label, the fifth text data training sample can be subjected to text post-processing by manual participation, so as to obtain the same value as the first one. The labels corresponding to the five text data training samples. For example, in some optional embodiments, the corpus in a small-scale manually labeled spoken language corpus may also be used as the second text data training sample, and the corresponding manually labeled content may be used as a label.

根据本申请实施例提供的数据处理方法、装置、电子设备及存储介质,数据处理方法为:采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签;基于无标签文本数据和伪标签,对预训练模型进行更新,得到更新后模型;基于第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行更新,得到训练完成的文本后处理模型。According to the data processing method, device, electronic device, and storage medium provided by the embodiments of the present application, the data processing method is: using a first text data training sample and a label corresponding to the first text data training sample to pre-process a text post-processing model. training to obtain a pre-training model; obtaining unlabeled text data and pseudo-labels of the unlabeled text data obtained by processing the unlabeled text data; updating the pre-training model based on the unlabeled text data and pseudo-labels, and obtaining the updated A model; based on the second text data training samples and the labels corresponding to the second text data training samples, the updated model is updated to obtain a trained text post-processing model.

本申请实施例中,在采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,本申请实施例提供的模型训练方法中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用本申请实施例提供的数据处理方法得到的文本后处理模型的准确率更高。In the embodiment of the present application, after the pre-training model is obtained by using the first text data training sample and the label corresponding to the first text data training sample, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the model training method provided by the embodiment of the present application, the pre-training model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the embodiment of the present application has a higher accuracy rate.

本申请实施例提供的数据处理方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端、移动终端、PC机和服务器等。The data processing methods provided in the embodiments of the present application may be executed by any appropriate device with data processing capabilities, including but not limited to: terminals, mobile terminals, PCs, servers, and the like.

参照图1b,图1b为根据本申请实施例一提供的数据处理流程的示意图。以下结合图1b对本申请实施例一提供的数据处理流程进行简要说明,主要包括:Referring to FIG. 1b, FIG. 1b is a schematic diagram of a data processing flow according to Embodiment 1 of the present application. The following briefly describes the data processing flow provided in Embodiment 1 of the present application with reference to FIG. 1b, which mainly includes:

在构建初始的文本后处理模型,例如:构建初始的Transformer模型之后,After building the initial text post-processing model, e.g. after building the initial Transformer model,

第一步:采用第一文本数据训练样本以及其对应的标签,对构建的上述初始的文本后处理模型进行预训练,得到预训练模型。具体地:可以采用大规模的现有的书面语料库中的语料对初始的Transformer模型进行预训练,其中,可以利用预设规则对语料生成非顺滑数据,作为第一文本数据训练样本,并将非顺滑数据对应的语料作为标签。Step 1: Use the first text data training samples and their corresponding labels to pre-train the constructed initial text post-processing model to obtain a pre-trained model. Specifically: the initial Transformer model can be pre-trained by using the corpus in a large-scale existing written corpus, wherein the non-smooth data can be generated from the corpus by using preset rules, as the first text data training sample, and the The corpus corresponding to the non-smooth data is used as the label.

第二步:获取第三本文数据训练样本以及其对应的样本,对上述预训练模型进行训练更新,得到预测模型。具体地,可以将较小规模的人工标注口语语料库中的语料作为第三文本数据训练样本,将其对应的人工标注内容作为标签。Step 2: Obtain the third article data training samples and their corresponding samples, and train and update the above pre-training model to obtain a prediction model. Specifically, the corpus in a small-scale manually-labeled spoken language corpus may be used as the third text data training sample, and the corresponding manually-labeled content may be used as a label.

第三步:采用经自动语音识别技术得到的无标签文本数据进行模型的自训练,具体的:利用上述预测模型,对经自动语音识别技术得到的无标签文本数据,进行标签预测,得到无标签文本数据的伪标签;基于无标签文本数据和伪标签,对预训练模型进行训练更新,得到更新后模型;The third step: using the unlabeled text data obtained by the automatic speech recognition technology to carry out the self-training of the model, specifically: using the above prediction model to perform label prediction on the unlabeled text data obtained by the automatic speech recognition technology, and obtain the unlabeled Pseudo-labels of text data; based on unlabeled text data and pseudo-labels, the pre-training model is trained and updated to obtain the updated model;

第四步:基于第二文本数据训练样本以及其对应的标签,对更新后模型进行训练更新,得到训练完成的文本后处理模型。具体地,可以将较小规模的人工标注口语语料库中的语料作为第二文本数据训练样本,将其对应的人工标注内容作为标签。Step 4: Based on the second text data training samples and their corresponding labels, the updated model is trained and updated to obtain a trained text post-processing model. Specifically, the corpus in the small-scale manually-labeled spoken language corpus may be used as the second text data training sample, and the corresponding manually-labeled content may be used as the label.

参照图2,示出了本申请实施例二的数据处理方法的步骤流程图。Referring to FIG. 2 , a flow chart of the steps of the data processing method according to the second embodiment of the present application is shown.

具体地,本申请实施例提供的数据处理方法包括以下步骤:Specifically, the data processing method provided in the embodiment of the present application includes the following steps:

步骤201,获取待处理文本数据。Step 201, acquiring text data to be processed.

步骤202,将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据。Step 202: Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model.

其中,文本后处理模型可以为基于上述实施例一的数据处理方法得到的,在此不再赘述。The text post-processing model may be obtained based on the data processing method in the first embodiment, and details are not described herein again.

本申请实施例中,在获取待处理文本数据之后,将待处理文本数据输入至基于上述实施例一的数据处理方法得到的文本后处理模型,进而得到文本后处理模型输出的处理后文本数据。In this embodiment of the present application, after acquiring the text data to be processed, the text data to be processed is input into the text post-processing model obtained based on the data processing method of the first embodiment, and then the processed text data output by the text post-processing model is obtained.

由于上述实施例一在对文本后处理模型的数据处理过程中,采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,上述实施例一提供的数据处理方法中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用上述实施例一提供的数据处理方法得到的文本后处理模型的准确率更高。Because the first embodiment uses the first text data training sample and the label corresponding to the first text data training sample in the data processing process of the text post-processing model to obtain the pre-training model, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the data processing method provided in the above-mentioned first embodiment, the pre-training model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the first embodiment has a higher accuracy rate.

进而,将待处理的文本数据输入至采用上述实施例一提供的数据处理方法得到的文本后处理模型,可以得到准确率更高的处理后文本数据。Furthermore, by inputting the text data to be processed into the text post-processing model obtained by using the data processing method provided in the first embodiment, processed text data with higher accuracy can be obtained.

本申请实施例提供的数据处理方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端、移动终端、PC机和服务器等。The data processing methods provided in the embodiments of the present application may be executed by any appropriate device with data processing capabilities, including but not limited to: terminals, mobile terminals, PCs, servers, and the like.

参照图3a,示出了本申请实施例三的数据处理方法的步骤流程图。Referring to FIG. 3 a , a flowchart of steps of the data processing method in Embodiment 3 of the present application is shown.

具体地,本申请实施例提供的数据处理方法包括以下步骤:Specifically, the data processing method provided in the embodiment of the present application includes the following steps:

步骤301,获取线上日志回流的文本数据,作为待处理文本数据。Step 301: Acquire the text data of the online log reflow as the text data to be processed.

由于线上日志回流的文本数据是文本后处理模型在应用阶段的实际处理对象,也就是说,线上日志回流的文本数据涉及应用领域即为文本后处理模型应用的目标领域。Since the text data of the online log reflow is the actual processing object of the text post-processing model in the application stage, that is to say, the text data of the online log reflow involves the application field, which is the target field of the text post-processing model application.

因此,将其作为待处理文本数据,通过文本后处理模型预测出线上日志回流的伪标签之后,再采用线上日志回流的文本数据及其伪标签,作为训练样本,对模型进行训练更新(精调),随着线上日志回流文本数据的增多,可以不断提高文本后处理模型的准确率。Therefore, take it as the text data to be processed, after the pseudo-label of the online log reflow is predicted by the text post-processing model, the text data and its pseudo-label of the online log return are used as the training sample, and the model is trained and updated (fine Adjustment), with the increase of online log reflow text data, the accuracy of the text post-processing model can be continuously improved.

步骤302,将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据。Step 302: Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model.

其中,文本后处理模型可以为基于上述实施例一的数据处理方法得到的。The text post-processing model may be obtained based on the data processing method in the first embodiment.

将线上日志回流的文本数据输入至文本后处理模型,通过该模型对线上日志回流的文本数据进行标签预测,得到线上日志回流的伪标签数据,也就是本步骤中文本后处理模型输出的处理后文本数据。Input the text data of the online log reflow into the text post-processing model, and use the model to perform label prediction on the text data of the online log reflow, and obtain the pseudo-label data of the online log reflow, which is the output of the text post-processing model in this step. processed text data.

步骤303,基于待处理文本数据以及处理后文本数据,对更新后模型进行训练更新,得到过渡模型。Step 303: Based on the text data to be processed and the processed text data, the updated model is trained and updated to obtain a transition model.

其中,上述更新后模型为上述实施例一中的更新后模型。Wherein, the above-mentioned updated model is the updated model in the above-mentioned first embodiment.

本步骤中,利用线上日志回流的文本数据及其伪标签,对上述实施例一中的更新后模型进行模型精调,即:对模型的网络参数进行训练更新,得到过渡模型。In this step, the updated model in the above-mentioned first embodiment is fine-tuned by using the text data returned from the online log and its pseudo-tags, that is, the network parameters of the model are trained and updated to obtain a transition model.

步骤304,基于第六文本数据训练样本以及与第六文本数据训练样本对应的标签,对过渡模型进行训练更新,得到热备模型。Step 304 , based on the sixth text data training sample and the label corresponding to the sixth text data training sample, train and update the transition model to obtain a hot standby model.

本步骤中的第六文本数据训练样本作为用于进行模型训练更新的样本,也可以为:与文本后处理模型的应用领域(目标领域)匹配度较高的待进行后处理操作文本数据,对应地,为保证得到的标签的准确性,可以通过人工参与的方式,对第六文本数据训练样本进行文本后处理,进而得到与第五文本数据训练样本对应的标签。The sixth text data training sample in this step is used as a sample for model training and updating, and can also be: text data to be post-processed that has a high degree of matching with the application field (target field) of the text post-processing model, corresponding to Specifically, in order to ensure the accuracy of the obtained label, the sixth text data training sample can be subjected to text post-processing by means of manual participation, so as to obtain the label corresponding to the fifth text data training sample.

例如,可以使用较小规模的人工标注口语语料库中的语料作为第五文本数据训练样本,将其对应的具体文本内容作为标签。For example, the corpus in a small-scale manually-labeled spoken language corpus can be used as the fifth text data training sample, and the corresponding specific text content can be used as a label.

本步骤中,即利用较小规模的人工标注口语语料数据对步骤303得到的过渡模型进行模型精调,也就是,对过渡模型的网络参数进行训练更新,得到热备模型。In this step, the transition model obtained in step 303 is fine-tuned by using small-scale manually annotated spoken language corpus data, that is, the network parameters of the transition model are trained and updated to obtain a hot standby model.

步骤305,分别计算热备模型和文本后处理模型的准确率。In step 305, the accuracy rates of the hot standby model and the text post-processing model are calculated respectively.

在一些可选实施例中,可以通过如下方式,计算热备模型和文本后处理模型的准确率:In some optional embodiments, the accuracy of the hot-standby model and the text post-processing model can be calculated in the following manner:

获取第七文本数据训练样本以及与第七文本数据训练样本对应的标签;基于第七文本数据训练样本以及与第七文本数据训练样本对应的标签,分别计算热备模型和文本后处理模型的准确率。Obtain the seventh text data training sample and the label corresponding to the seventh text data training sample; based on the seventh text data training sample and the label corresponding to the seventh text data training sample, calculate the accuracy of the hot standby model and the text post-processing model respectively. Rate.

本步骤中的第七文本数据训练样本作为用于进行模型准确率验证的样本,可以为:与文本后处理模型的应用领域(目标领域)匹配度较高的待进行后处理操作文本数据。对应地,也可以通过人工参与的方式,对第七文本数据训练样本进行文本后处理,进而得到与第七文本数据训练样本对应的标签。例如,可以使用较小规模的人工标注口语语料库中的语料作为第七文本数据训练样本,将其对应的具体文本内容作为标签。The seventh text data training sample in this step, as a sample for model accuracy verification, may be text data to be post-processed that has a high degree of matching with the application field (target field) of the text post-processing model. Correspondingly, it is also possible to perform text post-processing on the seventh text data training sample by means of manual participation, so as to obtain a label corresponding to the seventh text data training sample. For example, the corpus in a small-scale manually labeled spoken language corpus can be used as the seventh text data training sample, and the corresponding specific text content can be used as a label.

步骤306,当文本后处理模型的准确率低于热备模型的准确率时,采用热备模型作为新的文本后处理模型以进行下一次文本后处理操作。Step 306 , when the accuracy rate of the text post-processing model is lower than that of the hot-standby model, the hot-standby model is used as a new text post-processing model for the next text post-processing operation.

步骤307,当热备模型的准确率低于文本后处理模型的准确率时,采用文本后处理模型进行下一次文本后处理操作。Step 307 , when the accuracy rate of the hot standby model is lower than the accuracy rate of the text post-processing model, the text post-processing model is used to perform the next text post-processing operation.

本申请实施例中,在获取线上日志回流的文本数据,作为待处理文本数据之后,将待处理文本数据输入至基于上述实施例一的数据处理方法得到的文本后处理模型,得到文本后处理模型输出的处理后文本数据。再基于待处理文本数据以及处理后文本数据,对上述实施例一中的更新后模型进行训练更新,得到过渡模型;基于第六文本数据训练样本以及与第六文本数据训练样本对应的标签,对过渡模型进行训练更新,得到热备模型;分别计算热备模型和文本后处理模型的准确率;当文本后处理模型的准确率低于热备模型的准确率时,采用热备模型作为新的文本后处理模型以进行下一次文本后处理操作。In the embodiment of the present application, after acquiring the text data of the online log reflow as the text data to be processed, the text data to be processed is input into the text post-processing model obtained based on the data processing method of the first embodiment, and the text post-processing is obtained. The processed text data output by the model. Then, based on the text data to be processed and the processed text data, the updated model in the above-mentioned first embodiment is trained and updated to obtain a transition model; based on the sixth text data training sample and the label corresponding to the sixth text data training sample, The transition model is trained and updated to obtain the hot standby model; the accuracy of the hot standby model and the text post-processing model are calculated separately; when the accuracy of the text post-processing model is lower than that of the hot standby model, the hot standby model is used as the new Text post-processing model for the next text post-processing operation.

由于上述实施例一在对文本后处理模型的数据处理过程中,采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,上述实施例一提供的数据处理方法中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用上述实施例一提供的数据处理方法得到的文本后处理模型的准确率更高。Because the first embodiment uses the first text data training sample and the label corresponding to the first text data training sample in the data processing process of the text post-processing model to obtain the pre-training model, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the data processing method provided in the above-mentioned first embodiment, the pre-training model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the first embodiment has a higher accuracy rate.

进而,将待处理的文本数据输入至采用上述实施例一提供的数据处理方法得到的文本后处理模型,可以得到准确率更高的处理后文本数据。Furthermore, by inputting the text data to be processed into the text post-processing model obtained by using the data processing method provided in the first embodiment, processed text data with higher accuracy can be obtained.

另外,本申请上述实施例三中,将线上日志回流的文本数据作为待处理文本数据,然后基于上述待处理文本数据以及与其对应的处理后文本数据,对上述实施例一得到的更新后模型进行了训练更新,得到过渡模型;再对过渡模型进行了训练更新,得到热备模型;分别计算热备模型和文本后处理模型的准确率,并将准确率高的模型作为进行下一次文本后处理操作时使用的文本后处理模型。因此,可以通过线上日志回流的文本数据,对文本后处理模型进行不断更新,进一步提高了文本后处理模型的准确率。In addition, in the third embodiment of the present application, the text data returned from the online log is used as the text data to be processed, and then based on the text data to be processed and the corresponding processed text data, the updated model obtained in the first embodiment After training and updating, the transition model was obtained; then the transition model was trained and updated to obtain the hot standby model; the accuracy of the hot standby model and the text post-processing model were calculated respectively, and the model with high accuracy was used as the next post-processing model. The text postprocessing model to use when processing operations. Therefore, the text post-processing model can be continuously updated through the text data returned from the online log, which further improves the accuracy of the text post-processing model.

本申请实施例提供的数据处理方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端、移动终端、PC机和服务器等。The data processing methods provided in the embodiments of the present application may be executed by any appropriate device with data processing capabilities, including but not limited to: terminals, mobile terminals, PCs, servers, and the like.

参照图3b,图3b为根据本申请实施例三提供的数据处理流程的示意图。Referring to FIG. 3b, FIG. 3b is a schematic diagram of a data processing flow according to Embodiment 3 of the present application.

以下结合图3b对本申请实施例三提供的数据处理流程进行简要说明,主要包括:The following briefly describes the data processing flow provided by the third embodiment of the present application with reference to FIG. 3b, which mainly includes:

在得到文本后处理模型之后,After getting the text post-processing model,

第一步:采用线上日志回流的文本数据,对文本后处理模型进行自训练,具体的:将线上日志回流的无标签文本数据输入文本后处理模型,得到线上日志回流文本数据伪标签(处理后文本数据),基于线上日志回流的文本数据以及线上日志回流文本数据伪标签,对(实施例一中得到的)更新后模型进行训练更新,得到过渡模型;Step 1: Self-train the text post-processing model by using the text data of the online log reflow, specifically: input the unlabeled text data of the online log reflow into the text post-processing model, and obtain the pseudo-label of the online log reflow text data (text data after processing), based on the text data of the online log reflow and the pseudo-label of the online log reflow text data, the updated model (obtained in the first embodiment) is trained and updated to obtain a transition model;

第二步:基于第六文本数据训练样本及其标签,对过渡模型进行训练更新,得到热备模型;Step 2: Based on the sixth text data training sample and its label, train and update the transition model to obtain a hot standby model;

第三步:检测热备模型的准确率是否高于第一步中的文本后处理模型,若高于,则采用热备模型替换第一步中的文本后处理模型,作为新的文本后处理模型以进行下一次文本后处理操作。Step 3: Check whether the accuracy of the hot standby model is higher than the text post-processing model in the first step. If it is higher, use the hot standby model to replace the text post-processing model in the first step as a new text post-processing. model for the next text post-processing operation.

参照图4,示出了本申请实施例四的数据处理方法的步骤流程图。该实施例的应用场景可以是:对即时通信应用中的即时通信语音数据进行文字转换,并对转换得到的文本数据进行后处理。Referring to FIG. 4 , a flowchart of the steps of the data processing method according to the fourth embodiment of the present application is shown. An application scenario of this embodiment may be: performing text conversion on the instant messaging voice data in the instant messaging application, and performing post-processing on the converted text data.

具体地,本申请实施例提供的数据处理方法包括以下步骤:Specifically, the data processing method provided in the embodiment of the present application includes the following steps:

步骤401,接收到通过即时通信应用的界面输入的、用于指示将输入的语音数据转换为文本数据的指令。Step 401, receiving an instruction input through an interface of an instant messaging application for instructing to convert the input voice data into text data.

步骤402,根据指令对语音数据进行文本转换,得到待处理文本数据。Step 402: Perform text conversion on the voice data according to the instruction to obtain text data to be processed.

具体的,可以采用自动语音识别技术,对输入的语音数据进行文本转换,从而得到待处理文本数据。Specifically, an automatic speech recognition technology may be used to perform text conversion on the input speech data, thereby obtaining the text data to be processed.

步骤403,将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据。Step 403: Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model.

其中,文本后处理模型可以为基于上述实施例一的数据处理方法得到的,在此不再赘述。The text post-processing model may be obtained based on the data processing method in the first embodiment, and details are not described herein again.

本申请实施例中,在对语音数据进行文本转换,得到待处理文本数据之后,将待处理文本数据输入至基于上述实施例一的数据处理方法得到的文本后处理模型,进而得到文本后处理模型输出的处理后文本数据。In the embodiment of the present application, after text conversion is performed on the speech data to obtain the text data to be processed, the text data to be processed is input into the text post-processing model obtained based on the data processing method of the first embodiment, and then the text post-processing model is obtained. The output processed text data.

由于上述实施例一在对文本后处理模型的数据处理过程中,采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,上述实施例一提供的数据处理方法中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用上述实施例一提供的数据处理方法得到的文本后处理模型的准确率更高。Because the first embodiment uses the first text data training sample and the label corresponding to the first text data training sample in the data processing process of the text post-processing model to obtain the pre-training model, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the data processing method provided in the above-mentioned first embodiment, the pre-training model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the first embodiment has a higher accuracy rate.

进而,将待处理的文本数据输入至采用上述实施例一提供的数据处理方法得到的文本后处理模型,可以得到准确率更高的处理后文本数据。Furthermore, by inputting the text data to be processed into the text post-processing model obtained by using the data processing method provided in the first embodiment, processed text data with higher accuracy can be obtained.

参照图5,示出了本申请实施例五的数据处理方法的步骤流程图。该实施例的应用场景可以是:对通过一体机设备输入的语音数据进行文字转换,并对转换得到的文本数据进行后处理。Referring to FIG. 5 , a flowchart of steps of the data processing method according to Embodiment 5 of the present application is shown. An application scenario of this embodiment may be: performing text conversion on the voice data input through the all-in-one device, and performing post-processing on the converted text data.

具体地,本申请实施例提供的数据处理方法包括以下步骤:Specifically, the data processing method provided in the embodiment of the present application includes the following steps:

步骤501,接收到通过一体机设备输入设置的、用于指示将输入的语音数据转换为文本数据的指令。Step 501: Receive an instruction input and set through the all-in-one machine device for instructing to convert the input voice data into text data.

具体的,可以采用自动语音识别技术,对输入的语音数据进行文本转换,从而得到待处理文本数据。Specifically, an automatic speech recognition technology may be used to perform text conversion on the input speech data, thereby obtaining the text data to be processed.

步骤502,根据指令对语音数据进行文本转换,得到待处理文本数据。Step 502: Perform text conversion on the voice data according to the instruction to obtain text data to be processed.

步骤503,将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据。Step 503: Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model.

其中,文本后处理模型可以为基于上述实施例一的数据处理方法得到的,在此不再赘述。The text post-processing model may be obtained based on the data processing method in the first embodiment, and details are not described herein again.

本申请实施例中,在对语音数据进行文本转换,得到待处理文本数据之后,将待处理文本数据输入至基于上述实施例一的数据处理方法得到的文本后处理模型,进而得到文本后处理模型输出的处理后文本数据。In the embodiment of the present application, after text conversion is performed on the speech data to obtain the text data to be processed, the text data to be processed is input into the text post-processing model obtained based on the data processing method of the first embodiment, and then the text post-processing model is obtained. The output processed text data.

由于上述实施例一在对文本后处理模型的数据处理过程中,采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,上述实施例一提供的数据处理方法中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用上述实施例一提供的数据处理方法得到的文本后处理模型的准确率更高。Because the first embodiment uses the first text data training sample and the label corresponding to the first text data training sample in the data processing process of the text post-processing model to obtain the pre-training model, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the data processing method provided in the above-mentioned first embodiment, the pre-training model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the first embodiment has a higher accuracy rate.

进而,将待处理的文本数据输入至采用上述实施例一提供的数据处理方法得到的文本后处理模型,可以得到准确率更高的处理后文本数据。Furthermore, by inputting the text data to be processed into the text post-processing model obtained by using the data processing method provided in the first embodiment, processed text data with higher accuracy can be obtained.

参照图6,示出了本申请实施例六的数据处理方法的步骤流程图。该实施例的应用场景可以是:公有云中的客户端将语音数据上传至云端服务器,由云端服务器进行文本转换,并对转换得到的文本数据进行后处理。Referring to FIG. 6 , a flowchart of the steps of the data processing method according to the sixth embodiment of the present application is shown. An application scenario of this embodiment may be: a client in a public cloud uploads voice data to a cloud server, the cloud server performs text conversion, and performs post-processing on the converted text data.

具体地,本申请实施例提供的数据处理方法包括以下步骤:Specifically, the data processing method provided in the embodiment of the present application includes the following steps:

步骤601,接收公有云客户端上传的语音数据。Step 601: Receive the voice data uploaded by the public cloud client.

步骤602,对语音数据进行文本转换,得到待处理文本数据。Step 602: Perform text conversion on the speech data to obtain text data to be processed.

具体的,可以采用自动语音识别技术,对接收的语音数据进行文本转换,从而得到待处理文本数据。Specifically, an automatic speech recognition technology may be used to perform text conversion on the received speech data, thereby obtaining the text data to be processed.

步骤603,将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据。Step 603: Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model.

其中,文本后处理模型可以为基于上述实施例一的数据处理方法得到的,在此不再赘述。The text post-processing model may be obtained based on the data processing method in the first embodiment, and details are not described herein again.

进一步地,在云端服务器获取到处理后文本数据,还可以将处理后文本数据返回至上述公有云客户端。Further, after the processed text data is obtained in the cloud server, the processed text data can also be returned to the above-mentioned public cloud client.

本申请实施例中,云端服务器在对语音数据进行文本转换,得到待处理文本数据之后,将待处理文本数据输入至基于上述实施例一的数据处理方法得到的文本后处理模型,进而得到文本后处理模型输出的处理后文本数据。In the embodiment of the present application, after the cloud server performs text conversion on the voice data to obtain the text data to be processed, the cloud server inputs the text data to be processed into the text post-processing model obtained based on the data processing method of the first embodiment, and further obtains the text post-processing model. Process the processed text data output by the model.

由于上述实施例一在对文本后处理模型的数据处理过程中,采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,上述实施例一提供的数据处理方法中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用上述实施例一提供的数据处理方法得到的文本后处理模型的准确率更高。Because the first embodiment uses the first text data training sample and the label corresponding to the first text data training sample in the data processing process of the text post-processing model to obtain the pre-training model, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the data processing method provided in the above-mentioned first embodiment, the pre-training model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the first embodiment has a higher accuracy rate.

参照图7,示出了本申请实施例七的数据处理方法的步骤流程图。该实施例的应用场景依然可以为公有云场景,具体的可以是:公有云中的客户端先对接收到的语音数据行文本转换,之后,将转换得到的文本数据上传至云端服务器,由云端服务器进行文本转换,并对转换得到的文本数据进行后处理。Referring to FIG. 7 , a flowchart of steps of the data processing method according to Embodiment 7 of the present application is shown. The application scenario of this embodiment can still be a public cloud scenario. Specifically, the client in the public cloud can first convert the received voice data to text, and then upload the converted text data to the cloud server, and the cloud The server performs text conversion and post-processes the converted text data.

具体地,本申请实施例提供的数据处理方法包括以下步骤:Specifically, the data processing method provided in the embodiment of the present application includes the following steps:

步骤701,接收公有云客户端上传的待处理文本数据。Step 701: Receive the to-be-processed text data uploaded by the public cloud client.

其中,待处理文本数据为公有云客户端对接收到的语音数据进行文本转换之后得到的。具体的,可以采用自动语音识别技术,对接收的语音数据进行文本转换,从而得到待处理文本数据。The to-be-processed text data is obtained after the public cloud client performs text conversion on the received voice data. Specifically, an automatic speech recognition technology may be used to perform text conversion on the received speech data, thereby obtaining the text data to be processed.

步骤702,将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据。Step 702: Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model.

其中,文本后处理模型可以为基于上述实施例一的数据处理方法得到的,在此不再赘述。The text post-processing model may be obtained based on the data processing method in the first embodiment, and details are not described herein again.

进一步地,在云端服务器获取到处理后文本数据,还可以将处理后文本数据返回至上述公有云客户端。Further, after the processed text data is obtained in the cloud server, the processed text data can also be returned to the above-mentioned public cloud client.

本申请实施例中,云端服务器在接收到待处理文本数据之后,将待处理文本数据输入至基于上述实施例一的数据处理方法得到的文本后处理模型,进而得到文本后处理模型输出的处理后文本数据。In this embodiment of the present application, after receiving the text data to be processed, the cloud server inputs the text data to be processed into the text post-processing model obtained based on the data processing method in the first embodiment, and then obtains the processed post-processing model output by the text post-processing model. text data.

由于上述实施例一在对文本后处理模型的数据处理过程中,采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,得到预训练模型之后,并不是作为最终的文本后处理模型的。而是基于无标签文本数据以及其对应的伪标签,对预训练模型进行更新,得到更新后模型,之后,再用第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行再次更新,从而得到训练完成的文本后处理模型。也就是说,与常规的模型训练过程相比,上述实施例一提供的数据处理方法中,对预训练模型还进行了两次更新,才得到训练完成的文本后处理模型。因此,采用上述实施例一提供的数据处理方法得到的文本后处理模型的准确率更高。Because the first embodiment uses the first text data training sample and the label corresponding to the first text data training sample in the data processing process of the text post-processing model to obtain the pre-training model, it is not used as the final text post-processing model. Instead, based on the unlabeled text data and its corresponding pseudo-label, the pre-training model is updated to obtain the updated model. After that, the second text data training sample and the label corresponding to the second text data training sample are used to update the updated model. After the model is updated again, the trained text post-processing model is obtained. That is to say, compared with the conventional model training process, in the data processing method provided in the above-mentioned first embodiment, the pre-training model is updated twice before the trained text post-processing model is obtained. Therefore, the text post-processing model obtained by using the data processing method provided in the first embodiment has a higher accuracy rate.

参照图8,示出了本申请实施例八中数据处理装置的结构示意图。Referring to FIG. 8 , a schematic structural diagram of a data processing apparatus in Embodiment 8 of the present application is shown.

本申请实施例提供的数据处理装置包括:The data processing apparatus provided by the embodiment of the present application includes:

模型预训练模块801,用于采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;The model pre-training module 801 is used for pre-training the text post-processing model by using the first text data training sample and the label corresponding to the first text data training sample to obtain a pre-training model;

无标签文本数据及伪标签获取模块802,用于获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签;An unlabeled text data and pseudo-label obtaining module 802 is used to obtain the unlabeled text data and the pseudo-label of the unlabeled text data obtained by processing the unlabeled text data;

第一训练更新模块803,用于基于无标签文本数据和伪标签,对预训练模型进行训练更新,得到更新后模型;The first training update module 803 is used for training and updating the pre-training model based on the unlabeled text data and the pseudo-label to obtain the updated model;

第二训练更新模块804,用于基于第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行训练更新,得到训练完成的文本后处理模型。The second training updating module 804 is configured to perform training and updating of the updated model based on the second text data training samples and the labels corresponding to the second text data training samples to obtain a trained text post-processing model.

可选的,本申请实施例的装置还包括:Optionally, the device in the embodiment of the present application further includes:

标准文本数据和非顺滑文本数据获取模块,用于从标准文本数据库中获取标准文本数据,并采用预设规则生成对应的非顺滑文本数据;将非顺滑文本数据作为第一文本数据训练样本,并将标准文本数据作为与第一文本数据训练样本对应的标签。The standard text data and non-smooth text data acquisition module is used to obtain standard text data from the standard text database, and use preset rules to generate corresponding non-smooth text data; use the non-smooth text data as the first text data for training sample, and use the standard text data as the label corresponding to the first text data training sample.

可选的,无标签文本数据及伪标签获取模块802,包括:Optionally, the unlabeled text data and pseudo-label acquisition module 802 includes:

无标签文本数据单元,用于获取无标签文本数据;Unlabeled text data unit, used to obtain unlabeled text data;

伪标签得到单元,用于采用标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签。The pseudo-label obtaining unit is used to perform label prediction on the unlabeled text data by using the label prediction model, and obtain the pseudo-label of the unlabeled text data.

可选的,无标签文本数据单元,具体用于:Optional, unlabeled text data unit, specifically for:

获取待识别语音数据;Obtain the speech data to be recognized;

采用自动语音识别技术,对待识别语音数据进行识别,得到无标签文本数据。Using automatic speech recognition technology, the speech data to be recognized is recognized to obtain unlabeled text data.

可选的,伪标签得到单元,具体用于:Optionally, pseudo-label get unit, specifically for:

基于第三文本数据训练样本以及与第三文本数据训练样本对应的标签,对预训练模型进行训练更新,得到标签预测模型;Based on the third text data training sample and the label corresponding to the third text data training sample, the pre-training model is trained and updated to obtain a label prediction model;

采用标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签。The label prediction model is used to predict the label of the unlabeled text data, and the pseudo-label of the unlabeled text data is obtained.

可选的,伪标签得到单元,具体用于:Optionally, pseudo-label get unit, specifically for:

获取预先构建的初始的标签预测模型;其中,标签预测模型中包含的网络层数多于文本后处理模型中包含的网络层数,和/或,标签预测模型中各网络层的维度大多于文本后处理模型中各网络层的维度;Obtain a pre-built initial label prediction model; wherein the label prediction model contains more network layers than the text post-processing model, and/or the dimensions of each network layer in the label prediction model are larger than the text The dimensions of each network layer in the post-processing model;

采用第四文本数据训练样本以及与第四文本数据训练样本对应的标签,对初始的标签预测模型进行预训练,得到预训练后标签预测模型;Using the fourth text data training sample and the label corresponding to the fourth text data training sample, pre-training the initial label prediction model to obtain the pre-trained label prediction model;

基于第五文本数据训练样本以及与第五文本数据训练样本对应的标签,对预训练后标签预测模型进行训练更新,得到训练完成的标签预测模型;Based on the fifth text data training sample and the label corresponding to the fifth text data training sample, training and updating the pre-trained label prediction model to obtain a trained label prediction model;

采用训练完成的标签预测模型,对无标签文本数据进行标签预测,得到无标签文本数据的伪标签。The trained label prediction model is used to perform label prediction on unlabeled text data to obtain pseudo-labels of unlabeled text data.

本申请实施例的数据处理装置用于实现前述实施例一中相应的数据处理方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的数据处理装置中的各个模块的功能实现均可参照前述方法实施例一中的相应部分的描述,在此亦不再赘述。The data processing apparatus of the embodiment of the present application is used to implement the corresponding data processing method in the foregoing first embodiment, and has the beneficial effects of the corresponding method embodiment, which will not be repeated here. In addition, for the function implementation of each module in the data processing apparatus of this embodiment, reference may be made to the description of the corresponding part in the foregoing method embodiment 1, and details are not repeated here.

参照图9,示出了本申请实施例九中数据处理装置的结构示意图。Referring to FIG. 9 , a schematic structural diagram of a data processing apparatus in Embodiment 9 of the present application is shown.

本申请实施例提供的数据处理装置包括:The data processing apparatus provided by the embodiment of the present application includes:

待处理文本数据获取模块901,用于获取待处理文本数据;A to-be-processed text data acquisition module 901, configured to acquire to-be-processed text data;

第一处理后文本数据获取模块902,用于将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的数据处理方法得到。The first processed text data acquisition module 902 is used to input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the data processing method of the first embodiment. .

可选地,待处理文本数据获取模块901,具体用于:获取线上日志回流的文本数据,作为待处理文本数据;Optionally, the to-be-processed text data acquisition module 901 is specifically configured to: acquire the text data of the online log reflow as the to-be-processed text data;

本申请实施例的装置还包括:The device of the embodiment of the present application further includes:

过渡模型得到模块,用于在获取文本后处理模型输出的处理后文本数据之后,基于待处理文本数据以及处理后文本数据,对实施例一中的更新后模型进行训练更新,得到过渡模型;The transition model obtaining module is used to train and update the updated model in the first embodiment based on the to-be-processed text data and the processed text data after obtaining the processed text data output by the text post-processing model to obtain the transition model;

热备模型得到模块,用于基于第六文本数据训练样本以及与第六文本数据训练样本对应的标签,对过渡模型进行训练更新,得到热备模型;The hot standby model obtaining module is used for training and updating the transition model based on the sixth text data training sample and the label corresponding to the sixth text data training sample to obtain the hot standby model;

准确率计算模块,用于分别计算热备模型和文本后处理模型的准确率;The accuracy rate calculation module is used to calculate the accuracy rate of the hot standby model and the text post-processing model respectively;

文本后处理模型更新模块,用于当文本后处理模型的准确率低于热备模型的准确率时,采用热备模型作为新的文本后处理模型以进行下一次文本后处理操作。The text post-processing model updating module is used to use the hot-standby model as a new text post-processing model for the next text post-processing operation when the accuracy of the text post-processing model is lower than that of the hot-standby model.

可选地,本申请实施例的装置还包括:Optionally, the device in the embodiment of the present application further includes:

文本后处理模型保留模块,用于当热备模型的准确率低于文本后处理模型的准确率时,采用文本后处理模型进行下一次文本后处理操作。The text post-processing model retention module is used to use the text post-processing model for the next text post-processing operation when the accuracy of the hot standby model is lower than that of the text post-processing model.

可选地,准确率计算模块,具体用于:Optionally, the accuracy calculation module is specifically used for:

获取第七文本数据训练样本以及与第七文本数据训练样本对应的标签;obtaining a seventh text data training sample and a label corresponding to the seventh text data training sample;

基于第七文本数据训练样本以及与第七文本数据训练样本对应的标签,分别计算热备模型和文本后处理模型的准确率。Based on the seventh text data training sample and the label corresponding to the seventh text data training sample, the accuracy rates of the hot standby model and the text post-processing model are calculated respectively.

本申请实施例的数据处理装置用于实现前述方法实施例二或实施例三中相应的数据处理方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本申请实施例的数据处理装置中的各个模块的功能实现均可参照前述方法实施例二或实施例三中的相应部分的描述,在此亦不再赘述。The data processing apparatus of the embodiment of the present application is used to implement the corresponding data processing method in the foregoing method embodiment 2 or embodiment 3, and has the beneficial effects of the corresponding method embodiment, which will not be repeated here. In addition, for the function implementation of each module in the data processing apparatus of the embodiment of the present application, reference may be made to the description of the corresponding part in the foregoing method embodiment 2 or embodiment 3, and details are not repeated here.

参照图10,示出了本申请实施例十中数据处理装置的结构示意图。Referring to FIG. 10 , a schematic structural diagram of a data processing apparatus in Embodiment 10 of the present application is shown.

本申请实施例提供的数据处理装置包括:The data processing apparatus provided by the embodiment of the present application includes:

第一指令接收模块1001,用于接收到通过即时通信应用的界面输入的、用于指示将输入的语音数据转换为文本数据的指令;The first instruction receiving module 1001 is configured to receive an instruction input through the interface of the instant messaging application and used to instruct to convert the input voice data into text data;

第一文本转换模块1002,用于根据指令对语音数据进行文本转换,得到待处理文本数据;The first text conversion module 1002 is configured to perform text conversion on the voice data according to the instruction to obtain the text data to be processed;

第二处理后文本数据获取模块1003,用于将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的数据处理方法得到。The second post-processing text data acquisition module 1003 is configured to input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the data processing method of the first embodiment. .

本申请实施例的数据处理装置用于实现前述方法实施例四中相应的数据处理方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本申请实施例的数据处理装置中的各个模块的功能实现均可参照前述方法实施例四中的相应部分的描述,在此亦不再赘述。The data processing apparatus in the embodiment of the present application is used to implement the corresponding data processing method in the foregoing method embodiment 4, and has the beneficial effects of the corresponding method embodiment, which will not be repeated here. In addition, for the function implementation of each module in the data processing apparatus of the embodiment of the present application, reference may be made to the description of the corresponding part in the foregoing method embodiment 4, and details are not repeated here.

参照图11,示出了本申请实施例十一中数据处理装置的结构示意图。Referring to FIG. 11 , a schematic structural diagram of a data processing apparatus in Embodiment 11 of the present application is shown.

本申请实施例提供的数据处理装置包括:The data processing apparatus provided by the embodiment of the present application includes:

第二指令接收模块1101,用于接收到通过一体机设备输入设置的、用于指示将输入的语音数据转换为文本数据的指令;The second instruction receiving module 1101 is configured to receive an instruction input and set through the all-in-one device and used to instruct to convert the input voice data into text data;

第二文本转换模块1102,用于根据指令对语音数据进行文本转换,得到待处理文本数据;The second text conversion module 1102 is configured to perform text conversion on the voice data according to the instruction to obtain the text data to be processed;

第三处理后文本数据获取模块1103,用于将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的数据处理方法得到。The third post-processing text data acquisition module 1103 is configured to input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the data processing method of the first embodiment. .

本申请实施例的数据处理装置用于实现前述方法实施例五中相应的数据处理方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本申请实施例的数据处理装置中的各个模块的功能实现均可参照前述方法实施例五中的相应部分的描述,在此亦不再赘述。The data processing apparatus in the embodiment of the present application is used to implement the corresponding data processing method in the foregoing method embodiment 5, and has the beneficial effects of the corresponding method embodiment, which will not be repeated here. In addition, for the function implementation of each module in the data processing apparatus of the embodiment of the present application, reference may be made to the description of the corresponding part in the fifth method embodiment, which is not repeated here.

参照图12,示出了本申请实施例十二中数据处理装置的结构示意图。Referring to FIG. 12 , a schematic structural diagram of a data processing apparatus in Embodiment 12 of the present application is shown.

本申请实施例提供的数据处理装置包括:The data processing apparatus provided by the embodiment of the present application includes:

语音数据接收模块1201,用于接收公有云客户端上传的语音数据;A voice data receiving module 1201, configured to receive voice data uploaded by the public cloud client;

第三文本转换模块1202,用于对语音数据进行文本转换,得到待处理文本数据;A third text conversion module 1202, configured to perform text conversion on the voice data to obtain text data to be processed;

第四处理后文本数据获取模块1203,将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的数据处理方法得到。The fourth processed text data acquisition module 1203 inputs the text data to be processed into the text post-processing model, and obtains processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the data processing method of the first embodiment.

可选地,本申请实施例的装置还可以包括:Optionally, the apparatus in this embodiment of the present application may further include:

第一处理后文本数据返回模块,用于向公有云客户端返回处理后文本数据。The first processed text data return module is used to return the processed text data to the public cloud client.

本申请实施例的数据处理装置用于实现前述方法实施例六中相应的数据处理方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本申请实施例的数据处理装置中的各个模块的功能实现均可参照前述方法实施例六中的相应部分的描述,在此亦不再赘述。The data processing apparatus of the embodiment of the present application is used to implement the corresponding data processing method in the foregoing method embodiment 6, and has the beneficial effects of the corresponding method embodiment, which will not be repeated here. In addition, for the function implementation of each module in the data processing apparatus of the embodiment of the present application, reference may be made to the description of the corresponding part in the foregoing method embodiment 6, and details are not repeated here.

参照图13,示出了本申请实施例十三中数据处理装置的结构示意图。Referring to FIG. 13 , a schematic structural diagram of a data processing apparatus in Embodiment 13 of the present application is shown.

本申请实施例提供的数据处理装置包括:The data processing apparatus provided by the embodiment of the present application includes:

待处理文本数据接收模块1301,用于接收公有云客户端上传的待处理文本数据,其中,待处理文本数据为公有云客户端对接收到的语音数据进行文本转换之后得到的;The to-be-processed text data receiving module 1301 is configured to receive the to-be-processed text data uploaded by the public cloud client, wherein the to-be-processed text data is obtained after the public cloud client performs text conversion on the received voice data;

第五处理后文本数据获取模块1302,用于将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的数据处理方法得到。The fifth processed text data acquisition module 1302 is used to input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the data processing method of the first embodiment. .

可选地,本申请实施例的装置还可以包括:Optionally, the apparatus in this embodiment of the present application may further include:

第二处理后文本数据返回模块,用于向公有云客户端返回处理后文本数据。The second processed text data return module is used to return the processed text data to the public cloud client.

本申请实施例的数据处理装置用于实现前述方法实施例七中相应的数据处理方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本申请实施例的数据处理装置中的各个模块的功能实现均可参照前述方法实施例七中的相应部分的描述,在此亦不再赘述。The data processing apparatus in the embodiment of the present application is used to implement the corresponding data processing method in the seventh method embodiment, and has the beneficial effects of the corresponding method embodiment, which will not be repeated here. In addition, for the function implementation of each module in the data processing apparatus of the embodiment of the present application, reference may be made to the description of the corresponding part in the seventh embodiment of the method, and details are not repeated here.

图14为本申请实施例十四中电子设备的结构示意图;该电子设备可以包括:14 is a schematic structural diagram of an electronic device in Embodiment 14 of the present application; the electronic device may include:

一个或多个处理器1401;one or more processors 1401;

计算机可读介质1402,可以配置为存储一个或多个程序,Computer-readable medium 1402, which can be configured to store one or more programs,

当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如上述实施例一至实施例七任一的数据处理方法。When one or more programs are executed by one or more processors, the one or more processors implement the data processing method according to any one of the first embodiment to the seventh embodiment.

图15为本申请实施例十五中电子设备的硬件结构;如图15所示,该电子设备的硬件结构可以包括:处理器1501,通信接口1502,计算机可读介质1503和通信总线1504;FIG. 15 is the hardware structure of the electronic device in the fifteenth embodiment of the application; as shown in FIG. 15 , the hardware structure of the electronic device may include: a processor 1501, a communication interface 1502, a computer-readable medium 1503, and a communication bus 1504;

其中处理器1501、通信接口1502、计算机可读介质1503通过通信总线704完成相互间的通信;The processor 1501, the communication interface 1502, and the computer-readable medium 1503 communicate with each other through the communication bus 704;

可选地,通信接口1502可以为通信模块的接口,如GSM模块的接口;Optionally, the communication interface 1502 can be an interface of a communication module, such as an interface of a GSM module;

其中,处理器1501具体可以配置为:采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;获取无标签文本数据及对无标签文本数据进行处理后得到的无标签文本数据的伪标签;基于无标签文本数据和伪标签,对预训练模型进行训练更新,得到更新后模型;基于第二文本数据训练样本以及与第二文本数据训练样本对应的标签,对更新后模型进行训练更新,得到训练完成的文本后处理模型。The processor 1501 may be specifically configured to: use the first text data training sample and the label corresponding to the first text data training sample to pre-train the text post-processing model to obtain a pre-training model; obtain unlabeled text data and pair the The pseudo-label of the unlabeled text data obtained after processing the unlabeled text data; based on the unlabeled text data and the pseudo-label, the pre-training model is trained and updated to obtain the updated model; the training samples based on the second text data and the second The labels corresponding to the text data training samples are used to train and update the updated model, and the trained text post-processing model is obtained.

或者,处理器1501还可以配置为:获取待处理文本数据;将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的方法得到。Alternatively, the processor 1501 can also be configured to: obtain the text data to be processed; input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is based on the first embodiment. method to get.

或者,处理器1501还可以配置为:接收到通过即时通信应用的界面输入的、用于指示将输入的语音数据转换为文本数据的指令;根据指令对语音数据进行文本转换,得到待处理文本数据;将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的方法得到。Alternatively, the processor 1501 can also be configured to: receive an instruction input through the interface of the instant messaging application for instructing to convert the input voice data into text data; perform text conversion on the voice data according to the instruction to obtain the text data to be processed ; Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first embodiment.

或者,处理器1501还可以配置为:接收到通过一体机设备输入设置的、用于指示将输入的语音数据转换为文本数据的指令;根据指令对语音数据进行文本转换,得到待处理文本数据;将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的方法得到。Alternatively, the processor 1501 can also be configured to: receive an instruction input and set through the all-in-one device for instructing to convert the input voice data into text data; perform text conversion on the voice data according to the instruction to obtain the text data to be processed; Input the text data to be processed into the text post-processing model, and obtain the processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first embodiment.

或者,处理器1501还可以配置为:接收公有云客户端上传的语音数据;Alternatively, the processor 1501 may also be configured to: receive the voice data uploaded by the public cloud client;

对语音数据进行文本转换,得到待处理文本数据;将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的方法得到。Perform text conversion on the voice data to obtain text data to be processed; input the text data to be processed into a text post-processing model to obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of the first embodiment.

或者,处理器1501还可以配置为:接收公有云客户端上传的待处理文本数据,其中,待处理文本数据为公有云客户端对接收到的语音数据进行文本转换之后得到的;将待处理文本数据输入文本后处理模型,获取文本后处理模型输出的处理后文本数据;其中,文本后处理模型基于实施例一的方法得到。Alternatively, the processor 1501 can also be configured to: receive the text data to be processed uploaded by the public cloud client, wherein the text data to be processed is obtained after the public cloud client performs text conversion on the received voice data; The data is input into the text post-processing model, and the processed text data output by the text post-processing model is obtained; wherein, the text post-processing model is obtained based on the method of the first embodiment.

处理器701可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 701 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC). ), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

计算机可读介质703可以是,但不限于,随机存取存储介质(Random AccessMemory,RAM),只读存储介质(Read Only Memory,ROM),可编程只读存储介质(Programmable Read-Only Memory,PROM),可擦除只读存储介质(Erasable ProgrammableRead-Only Memory,EPROM),电可擦除只读存储介质(Electric Erasable ProgrammableRead-Only Memory,EEPROM)等。The computer-readable medium 703 may be, but is not limited to, a random access storage medium (Random Access Memory, RAM), a read-only storage medium (Read Only Memory, ROM), a programmable read-only storage medium (Programmable Read-Only Memory, PROM) ), an erasable read-only storage medium (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable read-only storage medium (Electric Erasable Programmable Read-Only Memory, EEPROM), and the like.

特别地,根据本申请实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含配置为执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分从网络上被下载和安装,和/或从可拆卸介质被安装。在该计算机程序被中央处理单元(CPU)执行时,执行本申请的方法中限定的上述功能。需要说明的是,本申请的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读介质例如可以但不限于是电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储介质(RAM)、只读存储介质(ROM)、可擦式可编程只读存储介质(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储介质(CD-ROM)、光存储介质件、磁存储介质件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输配置为由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。In particular, according to embodiments of the present application, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program including program code configured to execute the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion, and/or installed from a removable medium. When the computer program is executed by a central processing unit (CPU), the above-mentioned functions defined in the method of the present application are performed. It should be noted that the computer-readable medium of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer readable medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access storage media (RAM), read only storage media (ROM), erasable storage media programmable read-only storage media (EPROM or flash memory), optical fiber, portable compact disk read-only storage media (CD-ROM), optical storage media devices, magnetic storage media devices, or any suitable combination of the foregoing. In this application, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a program configured for use by or in connection with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言或其组合来编写配置为执行本申请的操作的计算机程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如”C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络:包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code configured to perform the operations of the present application may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional procedures, or a combination thereof programming language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network: including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to connect).

附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个配置为实现规定的逻辑功能的可执行指令。上述具体实施例中有特定先后关系,但这些先后关系只是示例性的,在具体实现的时候,这些步骤可能会更少、更多或执行顺序有调整。即在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions configured to implement the specified functions executable instructions. There are specific sequence relationships in the above specific embodiments, but these sequence relationships are only exemplary, and during specific implementation, these steps may be fewer, more, or the execution order may be adjusted. That is, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

描述于本申请实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的模块也可以设置在处理器中,例如,可以描述为:一种处理器包括模型预训练模块、无标签文本数据及伪标签获取模块、第一训练更新模块和第二训练更新模块。其中,这些模块的名称在某种情况下并不构成对该模块本身的限定,例如,模型预训练模块还可以被描述为“采用第一文本数据训练样本以及与第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型的模块”。The modules involved in the embodiments of the present application may be implemented in a software manner, and may also be implemented in a hardware manner. The described module can also be set in the processor, for example, it can be described as: a processor includes a model pre-training module, an unlabeled text data and pseudo-label acquisition module, a first training update module and a second training update module. Among them, the names of these modules do not constitute a limitation of the module itself under certain circumstances. For example, the model pre-training module can also be described as "using the first text data training samples and the corresponding first text data training samples. label, pre-train the text post-processing model, and get the module of the pre-trained model".

作为另一方面,本申请还提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如上述实施例一至实施例七任一所描述的数据处理方法。As another aspect, the present application also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the data processing method described in any one of the foregoing Embodiments 1 to 7.

另一方面,本申请还提供了一种计算机程序,该计算机程序包含计算机可执行指令,所述计算机可执行指令在被执行时实现如上述实施例一至实施例七任一所描述的数据处理方法。本申请实施例中,计算机程序可以包括APP,也可以包括小程序等。On the other hand, the present application also provides a computer program, the computer program includes computer-executable instructions, and the computer-executable instructions, when executed, implement the data processing method described in any one of the above-mentioned Embodiments 1 to 7 . In this embodiment of the present application, the computer program may include an APP, and may also include a small program or the like.

在本公开的各种实施方式中所使用的表述“第一”、“第二”、“第一”或“第二”可修饰各种部件而与顺序和/或重要性无关,但是这些表述不限制相应部件。以上表述仅配置为将元件与其它元件区分开的目的。例如,第一用户设备和第二用户设备表示不同的用户设备,虽然两者均是用户设备。例如,在不背离本公开的范围的前提下,第一元件可称作第二元件,类似地,第二元件可称作第一元件。The expressions "first," "second," "first," or "second" as used in various embodiments of the present disclosure may modify various components regardless of order and/or importance, but these expressions The corresponding parts are not restricted. The above expressions are only configured for the purpose of distinguishing an element from other elements. For example, the first user equipment and the second user equipment represent different user equipments, although both are user equipments. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure.

当一个元件(例如,第一元件)称为与另一元件(例如,第二元件)“(可操作地或可通信地)联接”或“(可操作地或可通信地)联接至”另一元件(例如,第二元件)或“连接至”另一元件(例如,第二元件)时,应理解为该一个元件直接连接至该另一元件或者该一个元件经由又一个元件(例如,第三元件)间接连接至该另一个元件。相反,可理解,当元件(例如,第一元件)称为“直接连接”或“直接联接”至另一元件(第二元件)时,则没有元件(例如,第三元件)插入在这两者之间。When an element (eg, a first element) is referred to as being "(operatively or communicatively) coupled" or "(operatively or communicatively) coupled to" another element (eg, a second element) When an element (eg, a second element) or is "connected to" another element (eg, a second element), it should be understood that the one element is directly connected to the other element or the one element is via yet another element (eg, a third element) is indirectly connected to the other element. In contrast, it will be understood that when an element (eg, a first element) is referred to as being "directly connected" or "directly coupled" to another element (eg, a second element), no element (eg, a third element) is interposed between the two between.

以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present application and an illustration of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to the technical solution formed by the specific combination of the above technical features, and should also cover the above technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above-mentioned features with the technical features disclosed in this application (but not limited to) with similar functions.

Claims (27)

1.一种数据处理方法,所述方法包括:1. A data processing method, the method comprising: 采用第一文本数据训练样本以及与所述第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;Using the first text data training sample and the label corresponding to the first text data training sample, pre-training the text post-processing model to obtain a pre-training model; 获取无标签文本数据及对所述无标签文本数据进行处理后得到的所述无标签文本数据的伪标签;Obtaining unlabeled text data and the pseudo-label of the unlabeled text data obtained after processing the unlabeled text data; 基于所述无标签文本数据和所述伪标签,对所述预训练模型进行训练更新,得到更新后模型;Based on the unlabeled text data and the pseudo-label, the pre-training model is trained and updated to obtain an updated model; 基于第二文本数据训练样本以及与所述第二文本数据训练样本对应的标签,对所述更新后模型进行训练更新,得到训练完成的文本后处理模型。Based on the second text data training samples and the labels corresponding to the second text data training samples, the updated model is trained and updated to obtain a trained text post-processing model. 2.根据权利要求1所述的方法,其中,在所述采用第一文本数据训练样本以及与所述第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型之前,所述方法还包括:2. The method according to claim 1 , wherein, in the process of using the first text data training sample and the label corresponding to the first text data training sample, the text post-processing model is pre-trained to obtain a pre-training model Before, the method further includes: 从标准文本数据库中获取标准文本数据,并采用预设规则生成对应的非顺滑文本数据;Obtain standard text data from a standard text database, and use preset rules to generate corresponding non-smooth text data; 将所述非顺滑文本数据作为第一文本数据训练样本,并将所述标准文本数据作为与所述第一文本数据训练样本对应的标签。The non-smooth text data is used as a first text data training sample, and the standard text data is used as a label corresponding to the first text data training sample. 3.根据权利要求1或2所述的方法,其中,所述获取无标签文本数据及对所述无标签文本数据进行处理后得到的所述无标签文本数据的伪标签,包括:3. The method according to claim 1 or 2, wherein the acquisition of unlabeled text data and the pseudo-label of the unlabeled text data obtained after processing the unlabeled text data comprises: 获取无标签文本数据;Get unlabeled text data; 采用标签预测模型,对所述无标签文本数据进行标签预测,得到所述无标签文本数据的伪标签。A label prediction model is used to perform label prediction on the unlabeled text data to obtain a pseudo-label of the unlabeled text data. 4.根据权利要求3所述的方法,其中,所述获取无标签文本数据,包括:4. The method according to claim 3, wherein said obtaining unlabeled text data comprises: 获取待识别语音数据;Obtain the speech data to be recognized; 采用自动语音识别技术,对所述待识别语音数据进行识别,得到无标签文本数据。Using automatic speech recognition technology, the to-be-recognized speech data is recognized to obtain unlabeled text data. 5.根据权利要求3所述的方法,其中,所述采用标签预测模型,对所述无标签文本数据进行标签预测,得到所述无标签文本数据的伪标签,包括:5. The method according to claim 3, wherein said adopting a label prediction model to perform label prediction on the unlabeled text data to obtain a pseudo-label of the unlabeled text data, comprising: 基于第三文本数据训练样本以及与所述第三文本数据训练样本对应的标签,对所述预训练模型进行训练更新,得到标签预测模型;Based on the third text data training sample and the label corresponding to the third text data training sample, the pre-training model is trained and updated to obtain a label prediction model; 采用所述标签预测模型,对所述无标签文本数据进行标签预测,得到所述无标签文本数据的伪标签。Using the label prediction model, label prediction is performed on the unlabeled text data to obtain a pseudo-label of the unlabeled text data. 6.根据权利要求3所述的方法,其中,所述采用标签预测模型,对所述无标签文本数据进行标签预测,得到所述无标签文本数据的伪标签,包括:6. The method according to claim 3, wherein the method of using a label prediction model to perform label prediction on the unlabeled text data to obtain a pseudo-label of the unlabeled text data comprises: 获取预先构建的初始的标签预测模型;其中,所述标签预测模型中包含的网络层数多于所述文本后处理模型中包含的网络层数,和/或,所述标签预测模型中各网络层的维度大多于所述文本后处理模型中各网络层的维度;Obtain a pre-built initial label prediction model; wherein, the number of network layers included in the label prediction model is more than the number of network layers included in the text post-processing model, and/or, each network layer in the label prediction model The dimension of the layer is more than the dimension of each network layer in the text post-processing model; 采用第四文本数据训练样本以及与所述第四文本数据训练样本对应的标签,对所述初始的标签预测模型进行预训练,得到预训练后标签预测模型;Using a fourth text data training sample and a label corresponding to the fourth text data training sample, pre-training the initial label prediction model to obtain a pre-trained label prediction model; 基于第五文本数据训练样本以及与所述第五文本数据训练样本对应的标签,对所述预训练后标签预测模型进行训练更新,得到训练完成的标签预测模型;Based on the fifth text data training sample and the label corresponding to the fifth text data training sample, training and updating the pre-trained label prediction model to obtain a trained label prediction model; 采用所述训练完成的标签预测模型,对所述无标签文本数据进行标签预测,得到所述无标签文本数据的伪标签。Using the trained label prediction model, perform label prediction on the unlabeled text data to obtain a pseudo-label of the unlabeled text data. 7.一种数据处理方法,所述方法包括:7. A data processing method, the method comprising: 获取待处理文本数据;Get the text data to be processed; 将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of any one of claims 1-6. 8.根据权利要求7所述的方法,其中,所述获取待处理文本数据,包括:8. The method according to claim 7, wherein the acquiring the text data to be processed comprises: 获取线上日志回流的文本数据,作为待处理文本数据;Obtain the text data of the online log reflow as the text data to be processed; 在所述获取所述文本后处理模型输出的处理后文本数据之后,所述方法还包括:After obtaining the processed text data output by the text post-processing model, the method further includes: 基于所述待处理文本数据以及所述处理后文本数据,对权利要求1-6任一项中所述的更新后模型进行训练更新,得到过渡模型;Based on the to-be-processed text data and the processed text data, the updated model described in any one of claims 1-6 is trained and updated to obtain a transition model; 基于第六文本数据训练样本以及与所述第六文本数据训练样本对应的标签,对所述过渡模型进行训练更新,得到热备模型;Based on the sixth text data training sample and the label corresponding to the sixth text data training sample, training and updating the transition model to obtain a hot standby model; 分别计算所述热备模型和所述文本后处理模型的准确率;Calculate the accuracy of the hot standby model and the text post-processing model respectively; 当所述文本后处理模型的准确率低于所述热备模型的准确率时,采用所述热备模型作为新的文本后处理模型以进行下一次文本后处理操作。When the accuracy rate of the text post-processing model is lower than the accuracy rate of the hot-standby model, the hot-standby model is used as a new text post-processing model for the next text post-processing operation. 9.根据权利要求8所述的方法,其中,在所述分别计算所述热备模型和所述文本后处理模型的准确率之后,所述方法还包括:9. The method according to claim 8, wherein after calculating the accuracy of the hot standby model and the text post-processing model respectively, the method further comprises: 当所述热备模型的准确率低于所述文本后处理模型的准确率时,采用所述文本后处理模型进行下一次文本后处理操作。When the accuracy rate of the hot standby model is lower than the accuracy rate of the text post-processing model, the text post-processing model is used to perform the next text post-processing operation. 10.根据权利要求8或9所述的方法,其中,所述分别计算所述热备模型和所述文本后处理模型的准确率,包括:10. The method according to claim 8 or 9, wherein the calculating the accuracy of the hot standby model and the text post-processing model respectively comprises: 获取第七文本数据训练样本以及与所述第七文本数据训练样本对应的标签;obtaining a seventh text data training sample and a label corresponding to the seventh text data training sample; 基于所述第七文本数据训练样本以及与所述第七文本数据训练样本对应的标签,分别计算所述热备模型和所述文本后处理模型的准确率。Based on the seventh text data training sample and the label corresponding to the seventh text data training sample, the accuracy rates of the hot standby model and the text post-processing model are calculated respectively. 11.一种数据处理方法,所述方法包括:11. A data processing method, the method comprising: 接收到通过即时通信应用的界面输入的、用于指示将输入的语音数据转换为文本数据的指令;Receive an instruction input through the interface of the instant messaging application for instructing to convert the input voice data into text data; 根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;Perform text conversion on the voice data according to the instruction to obtain text data to be processed; 将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of any one of claims 1-6. 12.一种数据处理方法,所述方法包括:12. A data processing method, the method comprising: 接收到通过一体机设备输入设置的、用于指示将输入的语音数据转换为文本数据的指令;receiving an instruction set through the all-in-one device input and used to instruct to convert the input voice data into text data; 根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;Perform text conversion on the voice data according to the instruction to obtain text data to be processed; 将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of any one of claims 1-6. 13.一种数据处理方法,所述方法包括:13. A data processing method, the method comprising: 接收公有云客户端上传的语音数据;Receive the voice data uploaded by the public cloud client; 对所述语音数据进行文本转换,得到待处理文本数据;performing text conversion on the voice data to obtain text data to be processed; 将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of any one of claims 1-6. 14.根据权利要求13所述的方法,所述方法还包括:14. The method of claim 13, further comprising: 向所述公有云客户端返回所述处理后文本数据。Return the processed text data to the public cloud client. 15.一种数据处理方法,所述方法包括:15. A data processing method, the method comprising: 接收公有云客户端上传的待处理文本数据,其中,所述待处理文本数据为所述公有云客户端对接收到的语音数据进行文本转换之后得到的;Receive the to-be-processed text data uploaded by the public cloud client, wherein the to-be-processed text data is obtained after the public cloud client performs text conversion on the received voice data; 将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。Input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is obtained based on the method of any one of claims 1-6. 16.根据权利要求15所述的方法,所述方法还包括:16. The method of claim 15, further comprising: 向所述公有云客户端返回所述处理后文本数据。Return the processed text data to the public cloud client. 17.一种数据处理装置,所述装置包括:17. A data processing apparatus comprising: 模型预训练模块,用于采用第一文本数据训练样本以及与所述第一文本数据训练样本对应的标签,对文本后处理模型进行预训练,得到预训练模型;a model pre-training module, configured to pre-train the text post-processing model by using the first text data training sample and the label corresponding to the first text data training sample to obtain a pre-training model; 无标签文本数据及伪标签获取模块,用于获取无标签文本数据及对所述无标签文本数据进行处理后得到的所述无标签文本数据的伪标签;an unlabeled text data and pseudo-label acquisition module, used for obtaining the unlabeled text data and the pseudo-label of the unlabeled text data obtained by processing the unlabeled text data; 第一训练更新模块,用于基于所述无标签文本数据和所述伪标签,对所述预训练模型进行训练更新,得到更新后模型;a first training update module, used for training and updating the pre-training model based on the unlabeled text data and the pseudo-label to obtain an updated model; 第二训练更新模块,用于基于第二文本数据训练样本以及与所述第二文本数据训练样本对应的标签,对所述更新后模型进行训练更新,得到训练完成的文本后处理模型。The second training and updating module is used for training and updating the updated model based on the second text data training samples and the labels corresponding to the second text data training samples, to obtain a trained text post-processing model. 18.一种数据处理装置,所述装置包括:18. A data processing apparatus comprising: 待处理文本数据获取模块,用于获取待处理文本数据;A pending text data acquisition module, used to acquire pending text data; 第一处理后文本数据获取模块,用于将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。The first post-processing text data acquisition module is used to input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is based on claim 1 -6 Obtained by any of the described methods. 19.一种数据处理装置,所述装置包括:19. A data processing apparatus comprising: 第一指令接收模块,用于接收到通过即时通信应用的界面输入的、用于指示将输入的语音数据转换为文本数据的指令;a first instruction receiving module, configured to receive an instruction input through the interface of the instant messaging application for instructing to convert the input voice data into text data; 第一文本转换模块,用于根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;a first text conversion module, configured to perform text conversion on the voice data according to the instruction to obtain text data to be processed; 第二处理后文本数据获取模块,用于将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。The second post-processing text data acquisition module is configured to input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is based on claim 1 -6 Obtained by any one of the methods. 20.一种数据处理装置,所述装置包括:20. A data processing apparatus, the apparatus comprising: 第二指令接收模块,用于接收到通过一体机设备输入设置的、用于指示将输入的语音数据转换为文本数据的指令;a second instruction receiving module, configured to receive an instruction input and set through the all-in-one device for instructing to convert the input voice data into text data; 第二文本转换模块,用于根据所述指令对所述语音数据进行文本转换,得到待处理文本数据;A second text conversion module, configured to perform text conversion on the voice data according to the instruction to obtain text data to be processed; 第三处理后文本数据获取模块,用于将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。A third post-processing text data acquisition module, configured to input the to-be-processed text data into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is based on claim 1 -6 Obtained by any of the described methods. 21.一种数据处理装置,所述装置包括:21. A data processing apparatus comprising: 语音数据接收模块,用于接收公有云客户端上传的语音数据;The voice data receiving module is used to receive the voice data uploaded by the public cloud client; 第三文本转换模块,用于对所述语音数据进行文本转换,得到待处理文本数据;A third text conversion module, configured to perform text conversion on the voice data to obtain text data to be processed; 第四处理后文本数据获取模块,将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。The fourth post-processing text data acquisition module inputs the to-be-processed text data into a text post-processing model, and obtains processed text data output by the text post-processing model; wherein, the text post-processing model is based on claims 1-6 obtained by any of the described methods. 22.根据权利要求21所述的装置,所述装置还包括:22. The apparatus of claim 21, further comprising: 第一处理后文本数据返回模块,用于向所述公有云客户端返回所述处理后文本数据。The first processed text data return module is configured to return the processed text data to the public cloud client. 23.一种数据处理装置,所述装置包括:23. A data processing apparatus comprising: 待处理文本数据接收模块,用于接收公有云客户端上传的待处理文本数据,其中,所述待处理文本数据为所述公有云客户端对接收到的语音数据进行文本转换之后得到的;a to-be-processed text data receiving module, configured to receive the to-be-processed text data uploaded by the public cloud client, wherein the to-be-processed text data is obtained after the public cloud client performs text conversion on the received voice data; 第五处理后文本数据获取模块,用于将所述待处理文本数据输入文本后处理模型,获取所述文本后处理模型输出的处理后文本数据;其中,所述文本后处理模型基于权利要求1-6任一所述的方法得到。The fifth post-processing text data acquisition module is used to input the text data to be processed into a text post-processing model, and obtain processed text data output by the text post-processing model; wherein, the text post-processing model is based on claim 1 -6 Obtained by any of the described methods. 24.根据权利要求23所述的装置,所述装置还包括:24. The apparatus of claim 23, further comprising: 第二处理后文本数据返回模块,用于向所述公有云客户端返回所述处理后文本数据。The second processed text data return module is configured to return the processed text data to the public cloud client. 25.一种电子设备,其特征在于,包括:处理器;以及被配置成存储计算机可执行指令的存储器,所述计算机可执行指令在被执行时使所述处理器实现上述权利要求1至6任一所述的方法,或权利要求7-16任一所述的方法。25. An electronic device, comprising: a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to implement the above claims 1 to 6 The method of any one, or the method of any one of claims 7-16. 26.一种存储介质,其特征在于,所述存储介质存储有计算机可执行指令,所述计算机可执行指令在被执行时实现上述权利要求1至6任一所述的方法,或权利要求7-16任一所述的方法。26. A storage medium, characterized in that the storage medium stores computer-executable instructions that, when executed, implement the method according to any one of claims 1 to 6, or claim 7 -16 The method of any one. 27.一种计算机程序,其特征在于,所述计算机程序包含计算机可执行指令,所述计算机可执行指令在被执行时实现上述权利要求1至6任一所述的方法,或权利要求7-16任一所述的方法。27. A computer program, characterized in that the computer program comprises computer-executable instructions which, when executed, implement the method of any one of the preceding claims 1 to 6, or claims 7- 16. The method of any one.
CN202011482703.4A 2020-12-15 2020-12-15 Data processing method and device, electronic equipment and storage medium Pending CN114637843A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011482703.4A CN114637843A (en) 2020-12-15 2020-12-15 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011482703.4A CN114637843A (en) 2020-12-15 2020-12-15 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114637843A true CN114637843A (en) 2022-06-17

Family

ID=81944636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011482703.4A Pending CN114637843A (en) 2020-12-15 2020-12-15 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114637843A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687935A (en) * 2023-01-05 2023-02-03 粤港澳大湾区数字经济研究院(福田) Post-processing method, device and equipment for voice recognition and storage medium
CN116072096A (en) * 2022-08-10 2023-05-05 荣耀终端有限公司 Model training method, acoustic model, voice synthesis system and electronic equipment
CN117558296A (en) * 2024-01-11 2024-02-13 腾讯科技(深圳)有限公司 Determination method and device for target audio recognition model and computing equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110704599A (en) * 2019-09-30 2020-01-17 支付宝(杭州)信息技术有限公司 Method and device for generating samples for prediction model and method and device for training prediction model
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN110704599A (en) * 2019-09-30 2020-01-17 支付宝(杭州)信息技术有限公司 Method and device for generating samples for prediction model and method and device for training prediction model
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072096A (en) * 2022-08-10 2023-05-05 荣耀终端有限公司 Model training method, acoustic model, voice synthesis system and electronic equipment
CN116072096B (en) * 2022-08-10 2023-10-20 荣耀终端有限公司 Model training methods, acoustic models, speech synthesis systems and electronic devices
CN115687935A (en) * 2023-01-05 2023-02-03 粤港澳大湾区数字经济研究院(福田) Post-processing method, device and equipment for voice recognition and storage medium
CN117558296A (en) * 2024-01-11 2024-02-13 腾讯科技(深圳)有限公司 Determination method and device for target audio recognition model and computing equipment
CN117558296B (en) * 2024-01-11 2024-04-09 腾讯科技(深圳)有限公司 Determination method and device for target audio recognition model and computing equipment

Similar Documents

Publication Publication Date Title
JP7208952B2 (en) Method and apparatus for generating interaction models
CN112966712B (en) Language model training method and device, electronic equipment and computer readable medium
US11308671B2 (en) Method and apparatus for controlling mouth shape changes of three-dimensional virtual portrait
US10380996B2 (en) Method and apparatus for correcting speech recognition result, device and computer-readable storage medium
CN114637843A (en) Data processing method and device, electronic equipment and storage medium
CN112699991A (en) Method, electronic device, and computer-readable medium for accelerating information processing for neural network training
CN111968647B (en) Voice recognition method, device, medium and electronic equipment
CN111523640A (en) Training method and device of neural network model
WO2023273628A1 (en) Video loop recognition method and apparatus, computer device, and storage medium
US20200286470A1 (en) Method and apparatus for outputting information
US20240282027A1 (en) Method, apparatus, device and storage medium for generating animal figures
CN112133287A (en) Speech recognition model training method, speech recognition method and related device
CN115359314A (en) Model training method, image editing method, device, medium and electronic device
CN113434683A (en) Text classification method, device, medium and electronic equipment
CN114120975A (en) Method, device and storage medium for speech recognition punctuation recovery
CN113656573B (en) Text information generation method, device and terminal equipment
JP7640738B2 (en) Adaptive Visual Speech Recognition
CN112906403A (en) Semantic analysis model training method and device, terminal equipment and storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN114792388A (en) Image description character generation method and device and computer readable storage medium
CN113297974B (en) Model training method, information generation method, device, equipment and medium
CN112101257B (en) Training sample generation method, image processing method, device, equipment and medium
CN113822135A (en) Video processing method, device and equipment based on artificial intelligence and storage medium
CN115101075B (en) Voice recognition method and related device
CN115392374A (en) Radar intra-pulse modulation signal identification method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination