WO2021114634A1

WO2021114634A1 - Text annotation method, device, and storage medium

Info

Publication number: WO2021114634A1
Application number: PCT/CN2020/099493
Authority: WO
Inventors: 李文斌; 喻宁; 冯晶凌; 柳阳
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-05-28
Filing date: 2020-06-30
Publication date: 2021-06-17
Also published as: CN111695357A

Abstract

A text annotation method, a device, and a storage medium, related to the technical field for emotion recognition in artificial intelligence. The text annotation method comprises: acquiring a first text dataset from a first third-party platform, each piece of text data in the first text dataset comprising an emoji expression; annotating each piece of text data on the basis of the emoji expression of each piece of first text data in the first text dataset to produce a first annotation result for each piece of first text data, the first annotation result comprising a positive comment or a negative comment; producing a first training sample set on the basis of the first annotation result of each piece of first text data; using the first training sample set to train a first neural network; acquiring a second text dataset from a second third-party platform; and using the first neural network to annotate the second text dataset to produce a second annotation result for each piece of second text data in the second text dataset, the second annotation result comprising one of a positive comment, a negative comment, or a neutral comment.

Description

Text marking method, equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 28, 2020, with the application number 2020104658114 and the title of the invention "text labeling method and related products", the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the technical field of emotion recognition in artificial intelligence, and specifically relates to a text labeling method, device and storage medium.

Background technique

With the development of artificial intelligence, the scope of neural network applications has become wider and wider. For example, in the field of video surveillance, neural networks can be used to recognize people in surveillance videos or in the medical field, neural networks can be used to recognize tumors in MRI images; in the field of text recognition, neural networks can be used to affect text. classification.

Although the neural network has a good performance for image recognition. However, the training of the neural network in the early stage requires a sufficient number of training data sets of sufficiently high quality. The production of training data sets is a very costly project. First, we need to obtain some high-quality original data sets from the database and label the original data sets. For example, when training a text emotion classification network, a large amount of semantically complete and emotionally clear text needs to be obtained, and then the large amount of text is manually labeled. However, the inventor found that due to the extremely large amount of text, manual labeling requires a lot of time and labor costs, and labeling efficiency is low.

Summary of the invention

The embodiments of the present application provide a text labeling method, device, and storage medium. Increase the application scenarios of text annotation, and improve the efficiency of text annotation.

The first aspect of the embodiments of the present application provides a text labeling method applied to an electronic device, including: the electronic device obtains a first text data set from a first third party platform, and each first text data set in the first text data set The text data includes emoji expressions; the electronic device labels each piece of first text data according to the emoji expression of each piece of first text data in the first text data set to obtain the first label of each piece of first text data As a result, the first annotation result includes a positive evaluation or a negative evaluation; the electronic device obtains a first training sample set according to the first annotation result of each piece of first text data; the electronic device uses the first training sample set The first neural network is trained; the electronic device obtains a second text data set from a second third party platform; the electronic device uses the first neural network to annotate the second text data set to obtain the first A second annotation result of each piece of second text data in the second text data set, where the second annotation result includes one of a positive evaluation, a negative evaluation, or a neutral evaluation.

A second aspect of the embodiments of the present application provides an electronic device, including: an acquiring unit configured to acquire a first text data set from a first third party platform, and each piece of first text data in the first text data set includes an emoji expression The labeling unit, according to the emoji expression of each piece of first text data in the first text data set, label each piece of first text data to obtain the first labeling result of each piece of first text data, the first The labeling result includes a positive evaluation or a negative evaluation; a training unit for obtaining a first training sample set according to the first labeling result of each piece of first text data, and using the first training sample set to train the first neural network; The acquiring unit is further configured to acquire a second text data set from a second third party platform; the labeling unit is further configured to use the first neural network to label the second text data set to obtain the first A second annotation result of each piece of second text data in the second text data set, where the second annotation result includes one of a positive evaluation, a negative evaluation, or a neutral evaluation.

The third aspect of the embodiments of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are generated by The processor executes instructions to execute the following steps: obtain a first text data set from a first third party platform, each piece of first text data in the first text data set includes an emoji expression; according to the first text data Collect the emoji expressions of each piece of first text data in the collection, and mark each piece of first text data to obtain the first annotation result of each piece of first text data. The first annotation result includes a positive evaluation or a negative evaluation; The first annotation result of the first text data obtains the first training sample set; the first training sample set is used to train the first neural network; the second text data set is obtained from the second third party platform; the first training sample set is used The neural network annotates the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, and the second annotation result includes positive evaluation, negative evaluation, or neutral evaluation. Kind of.

The fourth aspect of the embodiments of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium is used to store a computer program, and the stored computer program is executed by a processor to implement the following steps: The three-party platform obtains a first text data set, and each piece of first text data in the first text data set includes an emoji expression; according to the emoji expression of each piece of first text data in the first text data set, each article is One piece of text data is annotated to obtain a first annotation result of each piece of first text data, and the first annotation result includes a positive evaluation or a negative evaluation; a first training sample set is obtained according to the first annotation result of each piece of first text data Use the first training sample set to train a first neural network; obtain a second text data set from a second third party platform; use the first neural network to annotate the second text data set to obtain the A second annotation result of each piece of second text data in the second text data set, where the second annotation result includes one of a positive evaluation, a negative evaluation, or a neutral evaluation.

It can be seen that in the embodiments of the present application, the comment data is annotated by emoji expressions in the text data, and there is no need to perform semantic analysis on the comment data, so that the annotation will not be restricted by the language type of the text data. The application scenario of text annotation; in addition, the text data can be automatically annotated through emoji expressions, without manual annotation, which saves human and material resources.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.

FIG. 1 is a schematic flowchart of a labeling method provided by an embodiment of the application.

Fig. 2 is a schematic flowchart of another labeling method provided by an embodiment of the application.

FIG. 3 is a schematic flowchart of another labeling method provided by an embodiment of the application.

FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the application.

FIG. 5 is a block diagram of the functional unit composition of an electronic device provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms "first", "second", "third" and "fourth" in the specification and claims of this application and the drawings are used to distinguish different objects, not to describe a specific order . In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.

Reference to "embodiments" herein means that specific features, results or characteristics described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

The electronic devices in this application can include smart phones (such as Android phones, iOS phones, Windows Phone phones, etc.), tablets, handheld computers, laptops, mobile Internet devices MID (Mobile Internet Devices, referred to as MID) or wearable devices Wait. The above electronic devices are only examples, not exhaustive, including but not limited to the above electronic devices. In practical applications, the above-mentioned electronic equipment may also include: intelligent vehicle-mounted terminals, computer equipment, and so on.

Referring to Figure 1, Figure 1 is a schematic flow diagram of a text labeling method provided by an embodiment of the application, the method is applied to an electronic device, the method includes the following steps.

101: The electronic device obtains the first text data set from the first third party platform.

Among them, the first three-party platform can be Weibo, twitter, Facebook, and other social applications or Amazon, Taobao, Jingdong, and other e-commerce platforms. That is, the first third-party platform is a third-party platform that contains more text data of positive reviews and text data of negative reviews. The electronic device obtains a first text data set from randomly multiple pieces of first text data from the first platform through an application programming interface (Application Programming Interface, API) provided by the first third party platform. That is, the electronic device complies with the Robot protocol of the first third party platform, and obtains the first text data set from the first third party platform through the API of the first third party platform.

In some possible implementation manners, since the first text data is obtained through the API of the first third party platform, and manual review is not performed, some of the first text data may not meet the requirements. For example, it does not contain emoji or the text content is too short. Therefore, after multiple pieces of first text data are obtained, the first text data in the first text data set is cleaned first to clean the first text data that does not contain emoji expressions or the text content is too short, and the cleaned The first text data constitutes the first text data set.

Therefore, each piece of first text data in the first text data set contains emoji expressions.

102: The electronic device labels each piece of first text data according to the emoji expression of each piece of first text data in the first text data to obtain a first labeling result of each piece of first text data.

Wherein, the first labeling result includes proof evaluation or negative evaluation.

Exemplarily, if the first text data is cleaned, each piece of first text data in the first text data set includes an emoji expression. Because emoji expressions themselves carry emotional evaluation. For example, emoji

The emotional evaluation expressed is a positive evaluation, and Emoji expressions

Express negative reviews. Therefore, the first emotion evaluation of each piece of first text data can be determined according to the emoji expression of each piece of first text data; then, each piece of first text data is labeled according to the first emotion evaluation of each piece of first text data, namely Add emotional tags to each piece of first text data. That is, in the case where the emoji expression of any piece of first text data belongs to the emoji expression set of positive evaluation, the first text data is marked as a positive evaluation, and when the emoji expression belongs to the emoji expression set of negative evaluation, then The first text data is marked as a negative evaluation.

Wherein, corresponding to the first text data, the first annotation result includes a positive evaluation and a negative evaluation. The emotions corresponding to the positive evaluation include happiness, approval, appreciation, etc., and the emotions corresponding to the negative evaluation include anger, pessimism, Disagree, wait for emotion.

It should be noted that some emoji expressions are not sure of determining the emotional evaluation corresponding to the emoji expression. For example, emoji

It can be used to express happiness, which is a positive feeling, or it can be used to express sarcasm, which is a negative feeling. The first text data containing these emoji expressions in the first text data set is not labeled, and only the first text data containing emoji expressions corresponding to positive reviews or emoji expressions corresponding to negative reviews are labeled.

Further, in order to improve the accuracy of the emoji expression annotation, the text content of each piece of first text data may be extracted, and the text content of each piece of first text data may be semantically analyzed to obtain the semantic information of each piece of first text data; According to the semantic information of each piece of first text data, determine the first sentiment evaluation of each piece of first text data; retain the first text data in the first text data set that is consistent with the first sentiment evaluation and the second sentiment evaluation, and delete the first sentiment The first text data in which the evaluation and the second emotion evaluation are inconsistent. Double labeling through semantic analysis and emoji expressions reduces the labeling error caused by unilateral emoji labeling and improves the accuracy of labeling the first text data set.

103: The electronic device obtains the first training sample set according to the first annotation result of each piece of first text data.

The labeled first text data is used as a labeled training sample, and the first training sample set is obtained.

104: The electronic device uses the first training sample set to train the first neural network.

Specifically, the initial parameters of the first neural network are constructed first, and the training samples in the first training sample set are input to the first neural network to obtain the prediction results of the training samples; then, based on the prediction results and the training The labeling result of the sample determines the loss gradient, and the loss function is constructed based on the loss gradient; finally, the parameter value of the initial parameter is updated inversely based on the loss function and the gradient descent method; until the first neural network converges, the first neural network is completed. Network training.

105: The electronic device obtains the second text data set from the second third party platform.

Among them, the second third party platform may be a news platform that publishes science and technology news or wiki or summary text. That is, the second third-party platform is a third-party platform that contains a large amount of neutrally evaluated text data.

Similarly, the electronic device complies with the Robot protocol of the second third party platform, and obtains multiple pieces of second text data from the second third party platform through the API of the second third party platform to obtain the second text data set.

Of course, after acquiring multiple pieces of second text data, the multiple pieces of second text data can be cleaned to clean out illegal second text data with too short text content.

106: The electronic device uses the first neural network to annotate the second text data set to obtain an annotation result of each piece of second text data in the second text data set.

Wherein, the second labeling result includes one of a positive evaluation, a negative evaluation, or a neutral evaluation.

Specifically, the electronic device uses the first neural network to classify each piece of second text data in the second text data set to obtain the first probability that each piece of second text data is a positive evaluation and a second probability that the second text data is a negative evaluation; , Mark the second text data with the first probability greater than the first threshold (that is, 100% sure that the emotional evaluation of the second text data is positive) as positive evaluation; mark the second probability greater than the first threshold (that is, there is 100% certainty that the emotional evaluation of the second text data is negative evaluation) the second text data is marked as negative evaluation; the training sample rate of the first general evaluation is less than the first threshold and greater than the second threshold (that is, no With 100% certainty, whether the sentiment evaluation of the second text data is a positive evaluation or a negative evaluation), the second text data is marked as a neutral evaluation.

Wherein, the first threshold may be 0.7, 0.75, 0.8 or other values. The second threshold may be 0.4, 0.45, 0.5 or other values.

It can be seen that, in the embodiment of the present application, the text data is annotated by emoji expressions in the text data, and there is no need to perform semantic analysis on the text data, so that the annotation will not be restricted by the language type of the text data, thereby increasing The application scenario of the annotation method; in addition, the text data can be automatically annotated through emoji expressions, and the text data can be annotated without manual annotation, thereby saving human and material resources.

In some possible implementation manners, the method further includes: the electronic device obtains the second training sample set according to the second labeling result of each piece of second text data in the second text data set, that is, according to each second text data set in the second text data set. Annotated results of the second text data, the second text data set is formed into a labeled second training sample set; then, the second training sample set is used to train the second neural network; and any one to be published is obtained For comment data, the second neural network is used to classify the comment data to be published to obtain a classification result of the comment data to be published; according to the classification result, it is determined whether to publish the comment data to be published.

Wherein, in the case that the comment data to be published can be comment data to be published under any news website, when the classification result is a positive evaluation or a neutral evaluation, the comment data to be published is disclosed. When the classification result is negative, the comment data to be published will not be disclosed. Compared with the existing review data to be published through manual review, in this application, the review data to be published can be automatically reviewed through the second neural network, thereby saving human resources.

Wherein, in the case that the review data to be published can be review data under any e-commerce platform, when the classification result is a positive review or a negative review, the review data to be published is combined with the user's purchase The records are checked to determine the authenticity of the comment data to be published, and in the case where it is determined that the comment data to be published is a malicious review, the comment data to be published is not disclosed. In this application, the review data to be published can be automatically reviewed through the second neural network to determine the authenticity of the review data to be published, thereby saving human resources.

In some possible implementation manners, since most of the second text data obtained from the second third party platform are neutral text data, and most of the first text data obtained from the first third party platform are positively rated texts. Data and text data of negative reviews. Therefore, in order to increase the number of positively evaluated training samples and negatively evaluated training samples in the second training sample set, the second training sample set can be combined with the first training sample set to obtain a new second training sample with sufficient training samples. The sample set is used to train the second neural network using the new second training sample set, thereby making the trained second neural network more accurate.

In some possible implementation manners, after determining the first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the method further includes: extracting each piece of first text data. The text content of the first text data; convert the text content into a second emoji expression; determine the second emotional evaluation corresponding to each piece of first text data according to the second emoji expression; determine the value of each piece of first text data Whether the first sentiment evaluation and the second sentiment evaluation are consistent, if they are consistent, each piece of first text data is labeled according to the first sentiment evaluation of each piece of first text data. Through the text-to-emoji operation, the sentiment evaluation corresponding to each piece of first text data is verified, thereby improving the accuracy of subsequent labeling of the first text data.

In some possible implementation manners, the method further includes: obtaining comment data of any user, the comment data being the user’s comment data on a target product, and the target product includes wealth management products; The user’s comment data is classified to obtain a classification result of the user’s comment data; target users are screened according to the classification result of the user’s comment data, that is, users whose classification results are positively rated are regarded as target users; The target user recommends the target product.

It can be seen that, in this embodiment, the second neural network is used to screen out users who are interested in the target product (financial management product) to ensure the accuracy of user screening and improve the success rate of recommendation.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of another text labeling method provided by an embodiment of this application. The content of this embodiment is the same as that of the embodiment shown in FIG. 1, and the description will not be repeated here. The method is applied to electronic equipment, and the method includes the following steps.

201: The electronic device obtains the first text data set from the first platform.

202: The electronic device cleans each piece of first text data in the first text data set, deletes the first text data that does not contain emoji expressions, obtains a new first text data set, and replaces the new first text data The data set serves as the first text data set.

203: The electronic device determines a first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, and the first emotional evaluation includes a positive evaluation or a negative evaluation.

204: The electronic device extracts the text content of each piece of first text data, and performs semantic analysis on the text content of each piece of first text data to obtain semantic information of each piece of first text data.

205: The electronic device determines the second sentiment evaluation of each piece of first text data according to the semantic information of each piece of first text data.

206: The electronic device retains the first text data in the first text data set that has the same first sentiment evaluation and the second sentiment evaluation, and deletes the first text data in which the first sentiment evaluation and the second sentiment evaluation are inconsistent.

207: The electronic device labels the remaining first text data according to the first sentiment evaluation of the remaining first text data to obtain the first training sample set.

The remaining first text data is the remaining first text data after deleting the first comment data in which the first sentiment evaluation and the second sentiment evaluation are inconsistent in the first text data set.

208: The electronic device uses the first training sample set to train the first neural network.

209: The electronic device obtains the second text data set from the second platform.

210: The electronic device uses the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, and the second annotation result includes a positive One of evaluation, negative evaluation, or neutral evaluation.

It can be seen that in the embodiment of the present application, the comment data is annotated by emoji expressions in the comment data, and there is no need to perform semantic analysis on the comment data, so that the annotation will not be restricted by the language type of the comment data, thereby increasing The application scenario of the labeling method; in addition, the comment data can be automatically annotated through emoji expressions, and training sample sets containing emotion classification labels can be obtained without manual labeling, thereby saving human and material resources; moreover, in the first text Before the data set is annotated, the first text data set is cleaned to retain high-quality first text data, thereby improving the accuracy of the annotation.

Referring to FIG. 3, FIG. 3 is a schematic flowchart of another text labeling method provided by an embodiment of the application. The content in this embodiment is the same as the embodiment shown in FIG. 1 and FIG. 2, and the description will not be repeated here. The method is applied to electronic equipment, and the method includes the following steps.

301: The electronic device obtains the first text data set from the first platform.

302: The electronic device cleans each piece of first text data in the first text data set, deletes the first text data that does not contain emoji expressions, obtains a new first text data set, and replaces the new first text data The data set serves as the first text data set.

303: The electronic device determines a first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, and the first emotional evaluation includes a positive evaluation or a negative evaluation.

304: The electronic device extracts the text content of each piece of first text data, performs semantic analysis on the text content of each piece of first text data, and obtains semantic information of each piece of first text data.

305: The electronic device determines the second sentiment evaluation of each piece of first text data according to the semantic information of each piece of first text data.

306: The electronic device retains the first text data in the first text data set that has the same first sentiment evaluation and the second sentiment evaluation, and deletes the first text data in which the first sentiment evaluation and the second sentiment evaluation are inconsistent.

307: The electronic device labels the remaining first text data according to the first sentiment evaluation of the remaining first text data to obtain a first training sample set.

308: The electronic device uses the first training sample set to train the first neural network.

309: The electronic device obtains the second text data set from the second platform.

310: The electronic device uses the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, where the second annotation result includes a positive One of evaluation, negative evaluation, or neutral evaluation.

311: The electronic device uses the second labeling result according to each piece of second text data to obtain a second training sample set, and uses the second training sample set to train the second neural network.

312: The electronic device obtains any piece of comment data, uses the second neural network to classify the comment data to obtain a classification result of the comment data, and determines whether to disclose the comment data according to the classification result.

It can be seen that, in the embodiment of the present application, the text data is annotated by emoji expressions in the text data, and there is no need to perform semantic analysis on the text data, so that the annotation will not be restricted by the language type of the text data, thereby increasing The application scenario of the labeling method; in addition, the text data can be automatically annotated by emoji expressions, and training sample sets containing emotion classification labels can be obtained without manual labeling, thereby saving human and material resources; moreover, in the first text Before annotating the data set, clean the first text data set to retain high-quality first text data, thereby improving the accuracy of annotation; in addition, use the trained second neural network to classify the comment data to be published. Automatically block the comment data that does not meet the requirements to be published, without human review, saving human resources.

Refer to FIG. 4, which is a schematic structural diagram of an electronic device according to an embodiment of the application. As shown in FIG. 4, the electronic device 400 includes a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and are configured to be executed by the processor to execute Instructions for the following steps: Obtain a first text data set from the first third party platform, each piece of first text data in the first text data set includes an emoji expression; according to each piece of first text data in the first text data set The emoji expression of each piece of first text data is annotated, and the first annotation result of each piece of first text data is obtained, and the first annotation result includes a positive evaluation or a negative evaluation; according to the first annotation of each piece of first text data The first training sample set is obtained by the labeling result; the first neural network is trained using the first training sample set; the second text data set is obtained from the second third party platform; the second text data set is obtained by using the first neural network The data set is annotated to obtain a second annotation result of each piece of second text data in the second text data set, and the second annotation result includes one of a positive evaluation, a negative evaluation, or a neutral evaluation.

In some possible implementation manners, in terms of labeling each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the processor is specifically configured to: Describe the emoji expression of each piece of first text data in the first text data set, and determine the first emotional evaluation of each piece of first text data, where the first emotional evaluation includes a positive evaluation or a negative evaluation; according to each piece of first text data The first sentiment evaluation of each piece of first text data is marked.

In some possible implementation manners, after determining the first sentiment evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the processor is further configured to: Extract the text content of each piece of first text data; perform semantic analysis on the text content of each piece of first text data to obtain the semantic information of each piece of first text data; determine each piece of data according to the semantic information of each piece of first text data The second sentiment evaluation of the first text data; the first text data in the first text data set with the first sentiment evaluation consistent with the second sentiment evaluation is retained, and the first text data in which the first sentiment evaluation is inconsistent with the second sentiment evaluation is deleted .

In some possible implementation manners, before annotating the first text data set, the processor is further configured to: clean each piece of first text data in the first text data set, and delete The first text data containing emoji expressions is used to obtain a new first text data set; and the new first text data set is used as the first text data set.

In some possible implementation manners, in terms of using the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, the The processor is specifically configured to: use the first neural network to classify each piece of second text data in the second text data set to obtain the first probability that each piece of second text data is a positive evaluation and a negative evaluation Second probability; determining that the second annotation result of the second text data with the first probability greater than the first threshold is a positive evaluation; determining the second annotation result of the second text data with the second probability greater than the first threshold as a negative evaluation; The second marking result of the second text data whose first probability is less than the first threshold and greater than the second threshold is a neutral evaluation.

In some possible implementation manners, the processor is further configured to: obtain a second training sample set according to the second annotation result of each piece of second text data in the second text data set; use the second training The sample set trains the second neural network; obtains any piece of comment data to be published; uses the second neural network to perform sentiment classification on the comment data to be published to obtain the classification result of the comment data to be published ; According to the classification result, determine whether to publish the comment data to be published.

In some possible implementation manners, after obtaining a second training sample set according to the second annotation result of each piece of second text data in the second text data set, the processor is further configured to: The two training samples are combined with the first training sample set to obtain a new second training sample set; in terms of using the second training sample set to train a second neural network, the processor is specifically configured to: The second neural network is trained using the new second training sample set.

Refer to FIG. 5, which is a block diagram of a functional unit composition of an electronic device provided by an embodiment of the present application. The electronic device 500 includes: an acquisition unit 510, a labeling unit 520, and a training unit 530.

The obtaining unit 510 is configured to obtain a first text data set from the first third party platform, each piece of first text data in the first text data set includes an emoji expression; the labeling unit 520, according to each of the first text data set An emoji expression of the first text data is annotated for each first text data, and the first annotation result of each first text data is obtained. The first annotation result includes a positive evaluation or a negative evaluation; the training unit 530 uses To obtain a first training sample set according to the first annotation result of each piece of first text data, and use the first training sample set to train the first neural network; the obtaining unit 510 is also used to obtain from the second third party platform The second text data set; the labeling unit 520 is further configured to use the first neural network to label the second text data set to obtain a second labeling result of each piece of second text data in the second text data set , The second annotation result includes one of a positive evaluation, a negative evaluation, or a neutral evaluation.

In some possible implementation manners, in terms of labeling each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the labeling unit 520 is specifically configured to: The emoji expression of each piece of first text data in the first text data set determines the first emotional evaluation of each piece of first text data, and the first emotional evaluation includes a positive evaluation or a negative evaluation; The first sentiment evaluation is to label each piece of first text data.

In some possible implementation manners, the electronic device 500 further includes a cleaning unit 540. After determining the first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, The cleaning unit 540 is used to: extract the text content of each piece of first text data; perform semantic analysis on the text content of each piece of first text data to obtain the semantic information of each piece of first text data; according to each piece of first text data To determine the second sentiment evaluation of each piece of first text data; retain the first text data in the first text data set with the first sentiment evaluation consistent with the second sentiment evaluation, and delete the first sentiment evaluation and the second sentiment The first text data with inconsistent evaluations.

In some possible implementation manners, the electronic device 500 further includes a cleaning unit 540. Before annotating the first text data set, the cleaning unit 540 is configured to: The data is cleaned, the first text data that does not contain emoji expressions is deleted, and a new first text data set is obtained; the new first text data set is used as the first text data set.

In some possible implementation manners, in terms of using the first neural network to annotate the second text data set to obtain the second annotation result of each piece of second text data in the second text data set, the annotation unit 520, specifically configured to: use the first neural network to classify each piece of second text data in the second text data set to obtain the first probability that each piece of second text data is a positive evaluation and the first probability of a negative evaluation Two probabilities; determine that the second annotation result of the second text data with the first probability greater than the first threshold is a positive evaluation; determine the second annotation result of the second text data with the second probability greater than the first threshold as a negative evaluation; The second marking result of the second text data whose first probability is less than the first threshold and greater than the second threshold is a neutral evaluation.

In some possible implementation manners, it further includes a determining unit 550; a training unit 530, further configured to obtain a second training sample set according to the second annotation result of each piece of second text data in the second text data set; and training unit 530, is further configured to use the second training sample set to train a second neural network; the determining unit 550, is configured to obtain any piece of comment data to be published; use the second neural network to perform a comment on the comment to be published The data is emotionally classified to obtain a classification result of the comment data to be published; according to the classification result, it is determined whether to disclose the comment data to be published.

In some possible implementation manners, after obtaining a second training sample set according to the second annotation result of each piece of second text data in the second text data set, the training unit 530 is further configured to: The training samples are combined with the first training sample set to obtain a new second training sample set; in terms of using the second training sample set to train the second neural network, the training unit 530 is specifically used to: The new second training sample set trains the second neural network.

The embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, the computer-readable storage medium stores a computer program, and the storage computer The program is executed by the processor to implement the following steps: obtain a first text data set from the first third party platform, each piece of first text data in the first text data set includes an emoji expression; The emoji expression of each piece of first text data is annotated for each piece of first text data, and the first annotation result of each piece of first text data is obtained. The first annotation result includes a positive evaluation or a negative evaluation; A first labeling result of text data obtains a first training sample set; using the first training sample set to train a first neural network; obtaining a second text data set from a second third party platform; using the first neural network Annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, where the second annotation result includes one of a positive evaluation, a negative evaluation, or a neutral evaluation Kind.

In some possible implementation manners, after determining the first sentiment evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the processor is further configured to: Extract the text content of each piece of first text data; perform semantic analysis on the text content of each piece of first text data to obtain the semantic information of each piece of first text data; determine each piece of data according to the semantic information of each piece of first text data The second sentiment evaluation of the first text data; retain the first text data in the first text data set whose first sentiment evaluation is consistent with the second sentiment evaluation, and delete the first text data in which the first sentiment evaluation and the second sentiment evaluation are inconsistent .

It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that this application is not limited by the described sequence of actions. Because according to this application, some steps can be performed in other order or at the same time. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by this application.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed device may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be realized in the form of hardware or software program module.

If the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by a program instructing relevant hardware. The program can be stored in a computer-readable memory, and the memory can include: a flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disk, etc.

The embodiments of the application are described in detail above, and specific examples are used in this article to illustrate the principles and implementation of the application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the application; at the same time, for Those of ordinary skill in the art, based on the ideas of the application, will have changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as limiting the application.

Claims

A text labeling method, which is applied to electronic equipment, includes:

The electronic device obtains a first text data set from a first third party platform, and each piece of first text data in the first text data set includes an emoji expression;

The electronic device tags each piece of first text data according to the emoji expression of each piece of first text data in the first text data set to obtain the first tagging result of each piece of first text data. The marked results include positive or negative comments;

The electronic device obtains the first training sample set according to the first annotation result of each piece of first text data;

The electronic device uses the first training sample set to train a first neural network;

The electronic device obtains the second text data set from the second third party platform;

The electronic device uses the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, and the second annotation result includes a positive One of evaluation, negative evaluation, or neutral evaluation.
The method according to claim 1, wherein the electronic device marking each piece of first text data according to the emoji expression of each piece of first text data in the first text data set comprises:

Determine the first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, where the first emotional evaluation includes a positive evaluation or a negative evaluation;

According to the first sentiment evaluation of each piece of first text data, mark each piece of first text data.
The method according to claim 2, wherein, after determining the first emotion evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the method further comprises:

Extract the text content of each piece of first text data;

Perform semantic analysis on the text content of each piece of first text data to obtain semantic information of each piece of first text data;

Determine the second sentiment evaluation of each piece of first text data according to the semantic information of each piece of first text data;

Retaining the first text data in the first text data set where the first emotion evaluation and the second emotion evaluation are consistent, and deleting the first text data where the first emotion evaluation and the second emotion evaluation are inconsistent.
The method according to any one of claims 1 to 3, wherein, before the electronic device annotates the first text data set, the method further comprises:

Clean each piece of first text data in the first text data set, delete the first text data that does not contain emoji expressions, and obtain a new first text data set;

Use the new first text data set as the first text data set.
The method according to claim 1, wherein the first neural network is used to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set ,include:

Use the first neural network to classify each piece of second text data in the second text data set to obtain a first probability of a positive evaluation and a second probability of a negative evaluation for each piece of second text data;

Determining that the second annotation result of the second text data whose first probability is greater than the first threshold is a positive evaluation;

Determining that the second annotation result of the second text data whose second probability is greater than the first threshold is a negative evaluation;

The second marking result of the second text data whose first probability is less than the first threshold and greater than the second threshold is a neutral evaluation.
The method according to claim 1, wherein the method further comprises:

The electronic device obtains a second training sample set according to the second annotation result of each piece of second text data in the second text data set;

The electronic device uses the second training sample set to train a second neural network;

The electronic device obtains any piece of comment data to be published;

The electronic device uses the second neural network to perform emotional classification on the comment data to be published, and obtain a classification result of the comment data to be published;

The electronic device determines whether to disclose the comment data to be published according to the classification result.
The method according to claim 6, wherein after the electronic device obtains the second training sample set according to the second annotation result of each piece of second text data in the second text data set, the method further comprises:

Combining the second training sample with the first training sample set to obtain a new second training sample set;

The electronic device using the second training sample set to train a second neural network includes:

The electronic device uses the new second training sample set to train the second neural network.
An electronic device, including:

The acquiring unit is configured to acquire a first text data set from the first third party platform, and each piece of first text data in the first text data set includes an emoji expression;

The labeling unit labels each piece of first text data according to the emoji expression of each piece of first text data in the first text data set to obtain a first labeling result of each piece of first text data, and the first label Results include positive or negative comments;

A training unit, configured to obtain a first training sample set according to the first annotation result of each piece of first text data, and use the first training sample set to train the first neural network;

The acquiring unit is further configured to acquire a second text data set from a second third party platform;

The labeling unit is further configured to use the first neural network to label the second text data set to obtain a second labeling result of each piece of second text data in the second text data set, and the second The labeling result includes one of positive evaluation, negative evaluation or neutral evaluation.
An electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory, and are generated and executed by the processor to Follow the instructions for the following steps:

Acquiring a first text data set from the first third party platform, where each piece of first text data in the first text data set includes an emoji expression;

According to the emoji expression of each piece of first text data in the first text data set, each piece of first text data is annotated to obtain a first annotation result of each piece of first text data, and the first annotation result includes a positive Evaluation or negative evaluation;

Obtain the first training sample set according to the first annotation result of each piece of first text data;

Use the first training sample set to train the first neural network;

Obtain the second text data set from the second third party platform;

Use the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, and the second annotation result includes a positive evaluation and a negative evaluation Or one of the neutral evaluations.
The device according to claim 9, wherein, in terms of labeling each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the processor is specifically configured to :

Determine the first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, where the first emotional evaluation includes a positive evaluation or a negative evaluation;

According to the first sentiment evaluation of each piece of first text data, mark each piece of first text data.
The device according to claim 10, wherein, after determining the first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the processor, Also used for:

Extract the text content of each piece of first text data;

Perform semantic analysis on the text content of each piece of first text data to obtain semantic information of each piece of first text data;

Determine the second sentiment evaluation of each piece of first text data according to the semantic information of each piece of first text data;

Retaining the first text data in the first text data set where the first emotion evaluation and the second emotion evaluation are consistent, and deleting the first text data where the first emotion evaluation and the second emotion evaluation are inconsistent.
The device according to any one of claims 9-11, wherein, before annotating the first text data set, the processor is further configured to: A text data is cleaned, the first text data that does not contain emoji expressions is deleted, and a new first text data set is obtained; the new first text data set is used as the first text data set.
The device according to claim 9, wherein, in terms of using the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set , The processor is specifically configured to: use the first neural network to classify each piece of second text data in the second text data set to obtain a first probability that each piece of second text data is a positive evaluation and The second probability of negative evaluation;

Determining that the second annotation result of the second text data whose first probability is greater than the first threshold is a positive evaluation;

Determining that the second annotation result of the second text data whose second probability is greater than the first threshold is a negative evaluation;

The second marking result of the second text data whose first probability is less than the first threshold and greater than the second threshold is a neutral evaluation.
The device according to claim 9, wherein the processor is further configured to:

Obtaining a second training sample set according to the second annotation result of each piece of second text data in the second text data set;

Use the second training sample set to train a second neural network;

Get any piece of comment data to be published;

Using the second neural network to perform sentiment classification on the comment data to be published to obtain a classification result of the comment data to be published;

According to the classification result, it is determined whether to disclose the comment data to be published.
The device according to claim 14, wherein, after obtaining the second training sample set according to the second annotation result of each piece of second text data in the second text data set, the processor is further configured to transfer the Combining the second training sample with the first training sample set to obtain a new second training sample set;

In terms of using the second training sample set to train the second neural network, the processor is specifically configured to: use the new second training sample set to train the second neural network.
A computer-readable storage medium, wherein the computer-readable storage medium is used to store a computer program, and the stored computer program is executed by a processor to implement the following steps:

Acquiring a first text data set from the first third party platform, where each piece of first text data in the first text data set includes an emoji expression;

According to the emoji expression of each piece of first text data in the first text data set, each piece of first text data is annotated to obtain a first annotation result of each piece of first text data, and the first annotation result includes a positive Evaluation or negative evaluation;

Obtain the first training sample set according to the first annotation result of each piece of first text data;

Use the first training sample set to train the first neural network;

Obtain the second text data set from the second third party platform;

Use the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set, and the second annotation result includes a positive evaluation and a negative evaluation Or one of the neutral evaluations.
The medium according to claim 16, wherein, in terms of labeling each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the processor is specifically configured to :

Determine the first emotional evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, where the first emotional evaluation includes a positive evaluation or a negative evaluation;

According to the first sentiment evaluation of each piece of first text data, mark each piece of first text data.
The medium according to claim 17, wherein, after determining the first emotion evaluation of each piece of first text data according to the emoji expression of each piece of first text data in the first text data set, the processor, Also used for:

Extract the text content of each piece of first text data;

Perform semantic analysis on the text content of each piece of first text data to obtain semantic information of each piece of first text data;

Determine the second sentiment evaluation of each piece of first text data according to the semantic information of each piece of first text data;

Retaining the first text data in the first text data set where the first emotion evaluation and the second emotion evaluation are consistent, and deleting the first text data where the first emotion evaluation and the second emotion evaluation are inconsistent.
The medium according to any one of claims 16-18, wherein, before annotating the first text data set, the processor is further configured to: A text data is cleaned, the first text data that does not contain emoji expressions is deleted, and a new first text data set is obtained; the new first text data set is used as the first text data set.
The medium according to claim 16, wherein, in terms of using the first neural network to annotate the second text data set to obtain a second annotation result of each piece of second text data in the second text data set , The processor is specifically configured to: use the first neural network to classify each piece of second text data in the second text data set to obtain a first probability that each piece of second text data is a positive evaluation and The second probability of negative evaluation;

Determining that the second annotation result of the second text data whose first probability is greater than the first threshold is a positive evaluation;

Determining that the second annotation result of the second text data whose second probability is greater than the first threshold is a negative evaluation;

The second marking result of the second text data whose first probability is less than the first threshold and greater than the second threshold is a neutral evaluation.