CN114065709A

CN114065709A - Punctuation mark adding method and device, electronic equipment and storage medium

Info

Publication number: CN114065709A
Application number: CN202111424072.5A
Authority: CN
Inventors: 叶永龙; 何维华; 刘宝强
Original assignee: Shenzhen Skieer Information Technology Co ltd
Current assignee: Shenzhen Skieer Information Technology Co ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-02-18

Abstract

The invention discloses a punctuation mark adding method, a punctuation mark adding device, an electronic device and a storage medium, wherein the punctuation mark adding method comprises the following steps: constructing an initial shallow neural network model; acquiring initial text data, and performing data preprocessing on the initial text data to generate text data to be labeled; labeling the text data to be labeled to generate pre-training text data with labels; mapping the pre-training text data to the initial shallow neural network model for training to obtain a target shallow neural network model; acquiring preprocessed text data, and mapping the preprocessed text data to the target shallow neural network model for label prediction to obtain a label prediction result; and adding punctuation marks to the preprocessed text data according to the label prediction result to generate target text data. The invention can realize the text analysis without punctuation marks, quickly and automatically add punctuation marks and has accurate result.

Description

Punctuation mark adding method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a punctuation adding method, a punctuation adding device, electronic equipment and a storage medium.

Background

With the continuous development of internet technology, more and more people are willing to share own opinions or opinions on the network, the attitude of people on a certain thing can be obtained by analyzing the data, and different decisions can be made by using the results aiming at different groups. For example, when people publish their shopping experience and use experience with respect to a certain product, the merchant can obtain the view of the user on different attributes of the product by analyzing the view of the user, so as to find the advantages and disadvantages of the product and improve the product; as a potential purchaser, the user can analyze the use experience of the purchased people on the product so as to decide whether to purchase the product. However, due to the writing habit of people, punctuate-free texts often appear, which causes trouble to the automatic analysis of the viewpoint of the texts, so that the analysis result is not accurate enough, and certain inconvenience is brought to the manual understanding of the semantics of the texts.

There are related art methods that can add punctuation marks to text data without punctuation marks. The most common method is to add punctuation marks to the text data without punctuation marks through rule statistics. This method mainly has two problems: a large amount of manpower is required to be consumed for modifying and checking the added data, the understanding between people is different, and the checking results of different people on the same text are possibly different; for different social platforms in different fields, the writing modes of people may change, and the addition of punctuation marks based on statistical rules cannot well achieve the cross-field and cross-platform punctuation mark addition.

Disclosure of Invention

The invention aims to provide a punctuation mark adding method, a punctuation mark adding device, electronic equipment and a storage medium, wherein the punctuation mark adding method, the punctuation mark adding device and the storage medium can be used for analyzing text data without punctuation marks, quickly and automatically adding punctuation marks and accurately obtaining results.

In order to solve the technical problems, the invention adopts a technical scheme that: there is provided a punctuation mark adding method, the method comprising:

constructing an initial shallow neural network model;

acquiring initial text data, and performing data preprocessing on the initial text data to generate text data to be labeled;

labeling the text data to be labeled to generate pre-training text data with labels;

mapping the pre-training text data to the initial shallow neural network model for training to obtain a target shallow neural network model;

acquiring preprocessed text data, and mapping the preprocessed text data to the target shallow neural network model for label prediction to obtain a label prediction result;

and adding punctuation marks to the preprocessed text data according to the label prediction result to generate target text data.

In order to solve the technical problem, the invention adopts another technical scheme that: there is provided a punctuation mark addition apparatus comprising means for performing the punctuation mark addition method as described above.

In order to solve the technical problem, the invention adopts another technical scheme that: the electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus.

A memory for storing a computer program.

And a processor for implementing the steps of the punctuation mark adding method as described above when executing the program stored in the memory.

In order to solve the technical problem, the invention adopts another technical scheme that: a computer readable storage medium having stored thereon a computer program which when processed and executed implements the steps of a punctuation addition method as described above.

According to the method, the initial shallow neural network model is constructed, the initial text data is imported to train the initial text data by utilizing the strong self-learning capacity of the shallow neural network model, the target shallow neural network model is obtained, then the target shallow neural network model is utilized to realize the rapid automatic processing of a large amount of data, manpower and material resources are saved, manual errors can be avoided, and the accuracy of a processing result is improved. The method comprises the steps of carrying out data preprocessing before initial text data is imported, carrying out labeling processing on text data to be labeled generated after the data preprocessing, and generating pre-training text data for training, wherein the pre-training text data comprises a processing result of the labeling processing on the text data to be labeled, and can be used as an initial shallow neural network model for learning training, so that the obtained target shallow neural network model also has the labeling processing capability. After the preprocessed text data are imported, the target shallow neural network model can process the preprocessed text data according to the result of the learning training to obtain a label prediction result for performing label prediction on the preprocessed text data, and corresponding punctuation mark addition is performed on the preprocessed text data according to the label prediction result, so that the target text data with complete punctuation marks can be generated.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings that are needed to be used in the invention will be briefly described below, it being understood that the following drawings only show some embodiments of the invention and therefore should not be considered as limiting the scope, and that for a person skilled in the art, other related drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a flow chart illustrating a punctuation mark adding method according to an embodiment of the present invention.

Fig. 2 is a sub-flow diagram of a punctuation mark adding method according to an embodiment of the present invention.

Fig. 3 is a sub-flow diagram of a punctuation mark adding method according to an embodiment of the present invention.

Fig. 4 is a sub-flow diagram of a punctuation mark adding method according to an embodiment of the present invention.

Fig. 5 is a sub-flow diagram of a punctuation mark adding method according to an embodiment of the present invention.

Fig. 6 is a sub-flow diagram of a punctuation mark adding method according to an embodiment of the present invention.

Fig. 7 is a sub-flow diagram of a punctuation mark adding method according to an embodiment of the present invention.

Fig. 8 is a schematic block diagram of an electronic device according to an embodiment of the present invention.

Fig. 9 is a schematic block diagram of a punctuation mark adding apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, wherein like reference numerals represent like elements in the drawings. It is apparent that the embodiments to be described below are only a part of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the invention. As used in the description of embodiments of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a punctuation mark adding method according to an embodiment of the present invention.

The invention provides a punctuation adding method, which comprises the following steps:

s101: and constructing an initial shallow neural network model.

Understandably, a neural network is a complex network system formed by a large number of simple processing units (called neurons) widely connected to each other, which reflects many basic features of human brain functions, and is a highly complex nonlinear dynamical learning system. The neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, the neural network model is described on the basis of a mathematical model of a neuron, the neural network model is represented by network topology, node characteristics and learning rules, and the shallow neural network model is a single-hidden neural network model.

S102: acquiring initial text data, and performing data preprocessing on the initial text data to generate text data to be labeled.

Understandably, the initial text data comprises e-commerce comment data meeting set requirements of industries, fields and brands, data preprocessing is carried out on the initial text data, the initial text data subjected to data preprocessing is used as text data to be labeled, some noise data which possibly influence analysis can be filtered, and the processing speed and accuracy of subsequent analysis results can be improved.

S103: and labeling the text data to be labeled to generate pre-training text data with labels.

Understandably, after the text data to be labeled is labeled, the generated pre-training text data contains the labeling result, namely, the pre-training text data has a label and can be used for subsequent learning training, so that the learning model also has the labeling processing capability.

S104: and mapping the pre-training text data to the initial shallow neural network model for training to obtain a target shallow neural network model.

Understandably, the neural network model has good representation learning ability.

S105: and acquiring preprocessed text data, and mapping the preprocessed text data to the target shallow neural network model for label prediction to obtain a label prediction result.

Understandably, the preprocessed text data comprise e-commerce comment data meeting set requirements of industries, fields and brands, and the target shallow neural network model after learning and training has labeling processing capacity, can predict labels of characters of the preprocessed text data, and judges label types of different characters of the preprocessed text data, which need to be labeled.

S106: and adding punctuation marks to the preprocessed text data according to the label prediction result to generate target text data.

Understandably, after the label type corresponding to the character of the preprocessed text is judged in step S105, the position of the punctuation mark to be added in the preprocessed text data can be judged according to the label prediction result, the punctuation mark is added according to the judged position, and finally, the target text data with the complete punctuation mark is generated, and the target text data can be used for accurately analyzing the e-commerce comment data by people.

Referring to fig. 2, fig. 2 is a sub-flow diagram illustrating a punctuation mark adding method according to an embodiment of the present invention.

Further, the acquiring the initial text data and performing data preprocessing on the initial text data to generate the text data to be labeled according to the present invention includes:

s201: sample text data is obtained.

Understandably, in order to expand the generalization capability of the target shallow neural network model, the sample text data contains E-commerce comment data of various fields and industries, so that the subsequent cross-field punctuation-free text analysis is convenient to realize. The acquisition of the sample text data comprises modes of uploading, downloading and the like. The generalization ability refers to the adaptability of a learning algorithm to a fresh sample, the learning purpose is to learn the rule hidden behind the data, and the trained network can also give appropriate output to data except for a learning set with the same rule.

S202: and screening the sample text data according to a set requirement, wherein the screened sample text data is used as the initial text data.

Understandably, for example, because the source of the sample text data is wide, screening can be performed according to the set requirements, for example, eighteen brands of e-commerce comment data in five fields of 3C, household appliances, food, clothes and skin care products are screened from massive sample text data to be used as initial text data.

S203: and acquiring the initial text data and performing data preprocessing.

Understandably, data preprocessing is performed, which can filter and discard useless data that affects the analysis speed and results.

S204: and taking the initial text data subjected to the data preprocessing as the text data to be annotated.

Referring to fig. 3, fig. 3 is a sub-flow diagram illustrating a punctuation mark adding method according to an embodiment of the present invention.

Further, the data preprocessing of the present invention includes:

s301: and screening to obtain initial text data which meets the requirement of the set data length and has punctuation marks.

Specifically, the filtering discards initial text data that meets the set data length requirement and has no punctuation marks.

S302: and screening to obtain initial text data containing Chinese data.

Specifically, the filtering discards initial text data that does not contain chinese data.

S303: and screening to obtain meaningful initial text data in the preprocessed text data.

Specifically, filtering discards meaningless initial text data in the preprocessed text data.

S304: and converting the expression modes and formats of punctuations/symbols/English/letters in the initial text data into uniform expression modes and formats according to the setting requirements.

Understandably, data preprocessing is performed, which can filter out useless initial text data that affects analysis speed and results.

Referring to fig. 4, fig. 4 is a sub-flow diagram illustrating a punctuation mark adding method according to an embodiment of the present invention.

Further, the labeling the text data to be labeled according to the present invention, and generating the pre-training text data includes:

s401: and acquiring text data to be marked.

Understandably, the text data to be labeled is subjected to data preprocessing and can be directly used for labeling processing.

S402: and segmenting the text data to be labeled according to a set requirement to generate fragment text data, labeling the first character of the fragment text data with a B label, labeling other characters of the fragment text data with an O label, and generating BO label text data.

Understandably, the invention improves the BIO labeling mode in the original naming entity labeling mode based on the naming prompt recognition technology, and defines a text labeling mode, wherein the text marking mode comprises a B label and an O label, the B label is used for representing the first character contained in the text data, and the O label is used for representing other characters contained in the text data. Firstly, dividing text data to be labeled into segment text data according to punctuation marks of the text data to be labeled, wherein each segment text data is a section of complete text data, then labeling the first character of each segment text data as a B label, and other characters of each segment text data as P labels, integrating and summarizing all the labeled segment text data, and generating BO label text data.

S403: and taking the BO label text data as the pre-training text data.

Referring to fig. 5, fig. 5 is a sub-flow diagram illustrating a punctuation mark adding method according to an embodiment of the present invention.

Further, the constructing an initial shallow neural network model, mapping the pre-training text data to the initial shallow neural network model for training, and obtaining a target shallow neural network model includes:

s501: constructing a BilSTM + CRF network model capable of carrying out named entity identification, and taking the BiLSTM + CRF network model as an initial shallow neural network model.

Understandably, named entity recognition is a processing task in natural language processing, namely, recognizing a naming nominal item from a text and laying the foundation for tasks such as relationship extraction and the like. And in particular, identify various types of entities in various fields, industries.

S502: and acquiring pre-training text data, and mapping the pre-training text data to the BilSTM + CRF network model for training.

S503: and taking the trained BilSTM + CRF network model as the target shallow neural network model.

Understandably, the pre-training text data comprises a labeling processing result, and the BilSTM + CRF network model can also have the corresponding labeling processing capability after being trained, so that the text data analysis of multiple fields and industries can be processed, and the cross-field is realized.

Referring to fig. 6, fig. 6 is a sub-flow diagram illustrating a punctuation mark adding method according to an embodiment of the present invention.

Further, the obtaining of the preprocessed text data and mapping the preprocessed text data to the target shallow neural network model for tag prediction to obtain a tag prediction result includes:

s601: input text data is acquired.

Understandably, the input text data includes a large amount of e-commerce comment data of various fields, industries and brands, and is text data without punctuation marks.

S602: and processing the input text data, and taking the processed input text data as the preprocessed text data.

Understandably, the processing of the input text data includes cleaning, filtering, and screening out the text data meeting the set requirements for the next task.

S603: and mapping the preprocessed text data into the target shallow neural network model.

S604: and performing label prediction on characters contained in the target text data by using the target shallow neural network model to generate pre-processing text data with labels, and taking the pre-processing text data with the labels as a label prediction result.

Understandably, after the target text data is mapped into the target shallow neural network model, the target shallow neural network model predicts a B label and an O label corresponding to characters contained in the target text data according to the previous learning training result, so that predicted label text data with BO labels is generated. According to the labeling processing procedure in the previous step, the character contained in the predicted label text data can be identified and labeled as the first character of the B label, the punctuation mark can be added at the position between the character and the previous character, the process is completed after all the punctuation marks are added between the first character with the B label and the previous character, and finally the target text data with the complete punctuation mark which can be used for analysis is generated.

Referring to fig. 7, fig. 7 is a sub-flow diagram illustrating a punctuation mark adding method according to an embodiment of the present invention.

Further, the adding punctuation marks according to the label prediction result of the predicted label text data of the present invention includes:

s701: and acquiring the pre-processing text data with the label.

S702: and judging the label type of the characters contained in the labeled preprocessed text data.

Understandably, the tag types include B tags and O tags.

S703: and if the label type of the character is judged to be a B label, adding punctuation marks at the position between the character and the previous character.

Step S703 further includes, if the type of the tag is a B tag, determining whether a character is present before the character included in the predicted tag text data having the B tag, if the character is present, adding a punctuation mark between the character and a previous character, and if the character is not present, not adding the punctuation mark.

The invention provides a punctuation mark adding method for adding punctuation marks to cross-domain data based on a named entity recognition technology, which automatically marks without manual participation in the punctuation mark adding process, and perfectly solves the problem of adding the punctuation marks of the cross-domain cross-platform by extracting data of different brands of different fields of different platforms as training data in a data selection stage. In addition, the network model used in the invention is a widely used BilSTM + CRF network structure, the number of network layers and parameters is much smaller than that of the conventional BERT model, the processing of online real-time tasks is more practical, and higher accuracy can be obtained.

Referring to fig. 8, fig. 8 is a schematic block diagram of an electronic device according to an embodiment of the invention.

The invention also provides an electronic device which comprises a processor 801, a communication interface 802, a memory 803 and a communication bus 804, wherein the processor 801, the communication interface 802 and the memory 803 are communicated with each other through the communication bus 804.

A memory 803 for storing a computer program.

The processor 801 is configured to implement the steps of the punctuation mark adding method as described above when executing the program stored in the memory.

Referring to fig. 9, fig. 9 is a schematic block diagram of a punctuation mark adding device according to an embodiment of the present invention.

The present invention also provides a punctuation mark adding device 900, comprising a unit for executing the punctuation mark adding method as described above, including:

a constructing unit 901, configured to construct an initial shallow neural network model;

a first obtaining unit 902, configured to obtain initial text data, perform data preprocessing on the initial text data, and generate text data to be annotated;

a labeling unit 903, configured to label the text data to be labeled, and generate labeled pre-training text data;

a mapping unit 904, configured to map the pre-training text data to the initial shallow neural network model for training, so as to obtain a target shallow neural network model;

a second obtaining unit 905, configured to obtain preprocessed text data, and map the preprocessed text data to the target shallow neural network model for performing label prediction to obtain a label prediction result.

An adding unit 906, configured to add punctuation marks to the preprocessed text data according to the label prediction result, so as to generate target text data.

In an embodiment, the obtaining initial text data and performing data preprocessing on the initial text data to generate text data to be labeled includes:

acquiring sample text data;

screening the sample text data according to a set requirement, wherein the screened sample text data is used as the initial text data;

acquiring the initial text data and performing data preprocessing;

and taking the initial text data subjected to the data preprocessing as the text data to be annotated.

In one embodiment, the data preprocessing comprises:

screening to obtain initial text data which meets the requirement of the set data length and has punctuation marks;

screening to obtain initial text data containing Chinese data;

screening and obtaining meaningful initial text data in the preprocessed text data;

and converting the expression modes and formats of punctuations/symbols/English/letters in the initial text data into uniform expression modes and formats according to the setting requirements.

In an embodiment, the labeling the text data to be labeled to generate pre-training text data includes:

acquiring text data to be marked;

segmenting the text data to be labeled according to a set requirement to generate fragment text data, labeling the first character of the fragment text data with a B label, labeling other characters of the fragment text data with an O label, and generating BO label text data;

and taking the BO label text data as the pre-training text data.

In an embodiment, the constructing an initial shallow neural network model, and mapping the pre-training text data to the initial shallow neural network model for training to obtain a target shallow neural network model includes:

constructing a BilSTM + CRF network model capable of carrying out named entity identification, and taking the BiLSTM + CRF network model as an initial shallow neural network model;

acquiring pre-training text data, and mapping the pre-training text data to the BilSTM + CRF network model for training;

and taking the trained BilSTM + CRF network model as the target shallow neural network model.

In an embodiment, the obtaining of the preprocessed text data and mapping the preprocessed text data to the target shallow neural network model for tag prediction to obtain a tag prediction result includes:

acquiring input text data;

processing the input text data, and taking the processed input text data as the preprocessed text data;

mapping the preprocessed text data into the target shallow neural network model;

and performing label prediction on characters contained in the target text data by using the target shallow neural network model to generate pre-processing text data with labels, and taking the pre-processing text data with the labels as a label prediction result.

In an embodiment, the adding punctuation marks to the preprocessed text data according to the label prediction result, and generating target text data includes:

acquiring the preprocessed text data with the label;

judging the label type of characters contained in the pre-processing text data with labels;

and if the label type of the character is judged to be a B label, adding punctuation marks at the position between the character and the previous character.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when processed and executed implements the steps of the punctuation addition method as described above.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above should not be understood to necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by one skilled in the art.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, while the invention has been described with respect to the above-described embodiments, it will be understood that the invention is not limited thereto but may be embodied with various modifications and changes.

While the invention has been described with reference to specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A punctuation mark adding method, the method comprising:

constructing an initial shallow neural network model;

2. The punctuation mark adding method of claim 1, wherein the obtaining initial text data, performing data preprocessing on the initial text data, and generating text data to be annotated comprises:

acquiring sample text data;

acquiring the initial text data and performing data preprocessing;

3. The punctuation mark adding method of claim 2 wherein said data preprocessing comprises:

screening to obtain initial text data containing Chinese data;

4. The punctuation mark adding method of claim 1, wherein the labeling the text data to be labeled to generate pre-training text data comprises:

acquiring text data to be marked;

and taking the BO label text data as the pre-training text data.

5. The punctuation mark adding method of claim 4, wherein the mapping the pre-training text data to the initial shallow neural network model for training to obtain a target shallow neural network model comprises:

6. The punctuation mark adding method of claim 5, wherein the obtaining of the preprocessed text data, mapping the preprocessed text data into the target shallow neural network model for label prediction to obtain a label prediction result comprises:

acquiring input text data;

7. The punctuation mark adding method of claim 6, wherein the adding punctuation marks to the preprocessed text data according to the label prediction result, and generating target text data comprises:

acquiring the preprocessed text data with the label;

8. A punctuation mark adding apparatus comprising means for performing the punctuation mark adding method according to any one of claims 1 to 7.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the punctuation mark adding method of any one of claims 1 to 7 when executing a program stored on a memory.

10. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program is configured to, when executed, perform the steps of the punctuation addition method according to any one of claims 1 to 7.