CN110532391B

CN110532391B - Text part-of-speech tagging method and device

Info

Publication number: CN110532391B
Application number: CN201910817945.5A
Authority: CN
Inventors: 李金锋; 杨绳春; 洪文龙
Original assignee: Wangsu Science and Technology Co Ltd
Current assignee: Wangsu Science and Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-07-05
Anticipated expiration: 2039-08-30
Also published as: CN110532391A

Abstract

The invention discloses a method and a device for text part-of-speech tagging, wherein the method comprises the steps of determining the part-of-speech set by a user, acquiring a first class of words selected by the user from a sentence, dividing the sentence into a plurality of language segments according to the selected first class of words for storage, tagging the part-of-speech of the selected first class of words as the part-of-speech set by the user and displaying the part-of-speech. The part of speech is labeled on the first class of words selected from the sentences by the user according to the part of speech set by the user, the words with the same part of speech can be labeled quickly at one time, the part of speech labeling efficiency is effectively improved, the sentences are divided into a plurality of language sections according to the first class of words to be stored, the orderliness of each language section in the sentences can be kept, the part of speech of the selected first class of words is displayed, the visualization is realized, and the labeling error can be conveniently found.

Description

Text part-of-speech tagging method and device

Technical Field

The embodiment of the invention relates to the technical field of machine learning, in particular to a method and a device for text part-of-speech tagging.

Background

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer.

When the machine is trained, in order to improve the accuracy of language processing, it is often necessary to manually assist in part-of-speech tagging of important texts. In the traditional tool implementation mode, a sentence is directly given, so that a marking person manually enters related words and marks the words. Therefore, the efficiency is low, the marked words are unordered, and if a word in a sentence appears twice continuously and has different parts of speech, the word cannot be distinguished.

Disclosure of Invention

The embodiment of the invention provides a method and a device for text part-of-speech tagging, which are used for improving the efficiency of part-of-speech tagging.

In a first aspect, an embodiment of the present invention provides a method for text part-of-speech tagging, including:

determining the part of speech set by a user;

acquiring a first word selected from a sentence by a user;

and dividing the sentence into a plurality of language sections for storage according to the selected first class of words, marking the part of speech of the selected first class of words as the part of speech set by the user, and displaying the part of speech.

According to the technical scheme, the part of speech of the first class of words selected from the sentences by the user is labeled according to the part of speech set by the user, the words with the same part of speech can be labeled quickly at one time, the part of speech labeling efficiency is effectively improved, the sentences are divided into a plurality of language sections according to the first class of words to be stored, the orderliness of each language section in the sentences can be kept, the part of speech of the selected first class of words is displayed, the visualization is realized, and the labeling error can be conveniently found.

Optionally, after the part of speech of the selected first type of word is marked as the part of speech set by the user and displayed, the method further includes:

acquiring the part of speech modified by the user and a second word selected by the user;

and dividing the language segment where the second type word is located into a plurality of language segments according to the second type word for storage, and marking the part of speech of the second type word as the part of speech modified by the user.

According to the technical scheme, the part of speech modified by the user is obtained, and part of speech tagging is performed on the second type of words, so that the part of speech set can be rapidly changed, and the purpose of tagging words with different parts of speech is achieved.

Optionally, the dividing the sentence into a plurality of language segments according to the selected first type of word for storage, and labeling and displaying the part of speech of the selected first type of word as the part of speech set by the user, includes:

dividing the sentence into a plurality of language segments for sequencing and storing by taking the selected first class of words as a dividing line;

and marking the part of speech of the selected first class of words as the part of speech set by the user, and displaying the marked part of speech in the sentence.

In the technical scheme, the first class of words are used as the dividing lines, the sentences are divided into a plurality of language sections to be sorted and stored, so that the language sections in the sentences can keep orderliness, and the accuracy of part-of-speech tagging is improved.

Optionally, after the part of speech of the selected first type word is labeled as the part of speech set by the user and displayed, the method further includes:

setting words marked as the part of speech set by the user as the same background color;

wherein, the background colors corresponding to the words with different parts of speech are different.

In the technical scheme, the background color can be set after the part of speech is labeled so as to distinguish words with different parts of speech.

Optionally, the method further includes:

acquiring words with marked parts of speech clicked by a user;

and modifying the part of speech of the word with the part of speech marked into unclassified part of speech, determining whether the part of speech of the adjacent word with the part of speech modified into unclassified part of speech is unclassified, and if so, combining and storing the word with the part of speech modified into unclassified part of speech and the adjacent word with the part of speech unclassified part of speech.

In the technical scheme, the word with the part-of-speech tag deleted and the adjacent word with the part-of-speech tag as unclassified words are combined, so that the orderliness can be kept.

Optionally, the part of speech includes, but is not limited to, unclassified, verb, name, pronoun, adjective, numerator, quantifier, or stop word;

wherein, the part of speech is that the unclassified word does not display the part of speech.

In a second aspect, an embodiment of the present invention provides an apparatus for text part-of-speech tagging, including:

a determining unit configured to determine a part of speech set by a user;

the obtaining unit is used for obtaining a first word selected from the sentence by a user;

and the processing unit is used for dividing the sentence into a plurality of language sections for storage according to the selected first class of words, marking the part of speech of the selected first class of words as the part of speech set by the user and displaying the part of speech.

Optionally, the processing unit is further configured to:

after the part of speech of the selected first class of words is marked as the part of speech set by the user and displayed, controlling the acquisition unit to acquire the part of speech modified by the user and a second class of words selected by the user;

Optionally, the processing unit is specifically configured to:

Optionally, the processing unit is further configured to:

after the part of speech of the selected first class of words is marked as the part of speech set by the user and displayed, setting the words marked as the part of speech set by the user as the same background color;

Optionally, the processing unit is further configured to:

controlling the acquisition unit to acquire words with marked parts of speech clicked by a user;

wherein the part of speech is that unclassified words do not show the part of speech.

In a third aspect, an embodiment of the present invention further provides a computing device, including:

a memory for storing program instructions;

and the processor is used for calling the program instruction stored in the memory and executing the text part-of-speech tagging method according to the obtained program.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable non-volatile storage medium, which includes computer-readable instructions, and when the computer-readable instructions are read and executed by a computer, the computer is caused to perform the above method for text part-of-speech tagging.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for part-of-speech tagging of a text according to an embodiment of the present invention;

fig. 3 is a schematic diagram of part-of-speech tagging of a text according to an embodiment of the present invention;

fig. 4 is a schematic diagram of part-of-speech tagging of a text according to an embodiment of the present invention;

fig. 5 is a schematic diagram of part-of-speech tagging of a text according to an embodiment of the present invention;

fig. 6 is a schematic diagram of part-of-speech tagging of a text according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a device for text part-of-speech tagging according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 illustrates an exemplary system architecture, which may be a server 100, including a processor 110, a communication interface 120, and a memory 130, to which embodiments of the present invention are applicable.

The communication interface 120 is used for communicating with a terminal device, and transceiving information transmitted by the terminal device to implement communication.

The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and routes, performs various functions of the server 100 and processes data by operating or executing software programs and/or modules stored in the memory 130 and calling data stored in the memory 130. Alternatively, processor 110 may include one or more processing units.

The memory 130 may be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by operating the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to a business process, etc. Further, the memory 130 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

It should be noted that the structure shown in fig. 1 is only an example, and the embodiment of the present invention is not limited thereto.

Based on the above description, fig. 2 exemplarily shows a flow of a method for text part-of-speech tagging provided by the embodiment of the present invention, where the flow may be performed by a device for text part-of-speech tagging, and the device may be located in the server 100 shown in fig. 1, or may be the server 100.

As shown in fig. 2, the process specifically includes:

in step 201, the part of speech set by the user is determined.

Before a user labels a word in a sentence, the part of speech of the current label needs to be set. In the embodiment of the present invention, the parts of speech may include, but are not limited to, parts of speech such as unclassified, verb, name, pronoun, adjective, numerator, quantifier, or stop word, and may be added or subtracted according to actual situations in a specific application process. Wherein, the words with the part of speech as unclassified are not displayed, and the words in the initial unclassified sentence are unclassified. As shown in FIG. 3, the parts of speech that the user can set include verbs, nouns, adjectives and stop words. After the file is loaded, the part of speech set by the current user is a verb.

Step 202, acquiring a first word selected by a user from a sentence.

When a user needs to label a word, the word needs to be selected first, generally by mouse sliding, and the collection of the first word selected by the user from the sentence can be realized by a click () function in the specific implementation process.

Step 203, dividing the sentence into a plurality of language segments for storage according to the selected first class of words, and marking the part of speech of the selected first class of words as the part of speech set by the user and displaying the part of speech.

After the first kind of words selected by the user are obtained, the selected first kind of words are used as a dividing line, the sentence is divided into a plurality of language sections and is sorted and stored, then the part of speech of the selected first kind of words is marked as the part of speech set by the user, and the marked part of speech is displayed in the sentence. As shown in FIG. 4, the part-of-speech currently set by the user is a verb, and can be saved in a variable Type at this time, such as the current Type value is "verb". Just after loading a whole sentence, there is a record with the type "unclassified", and the storing manner of the initial unclassified sentence can be as shown in table 1. The first word selected by the user is "application", and the record with the serial number of 1 in which the "application" is located is found first, at this time, the "application" can be used as a dividing line to divide the text content with the serial number of 1 in table 1 into three sections (if the left side or the right side is empty, the text content is not divided): "one-time availability", "application", "multiple VPSs? What is specifically done? ", and each segment is assigned a sequence number, which can be specifically shown in table 2.

TABLE 1

Serial number	Text content	Part of speech
			1	Can multiple VPSs be applied at one time? What is specifically done?	Not classified

TABLE 2

Serial number	Text content	Part of speech
			1	Whether at one time can	Not classified
2	Application for	Verb and its usage
			3	Multiple VPSs? What is specifically done?	Not classified

In order to label words with different parts of speech better, the parts of speech modified by the user and a second class of words selected by the user can be obtained, then, the word segment where the second class of words is located is divided into a plurality of word segments according to the second class of words for storage, and the parts of speech of the second class of words are labeled as the parts of speech modified by the user.

For example, as shown in FIG. 5, the part of speech modified by the user is an adjective, and the value of the modified Type variable is an adjective. The second word selected by the user is "can or not", the number of the word segment where "can or not" is located is determined to be 1 from table 2, then "can or not" is taken as a dividing line, and "can or not" is divided into "once" and "can or not", and sorted and stored according to the sequence of "once" and "can or not" in the original sentence, and the part of speech is labeled as "can or not", which is specifically shown in table 3.

TABLE 3

Serial number	Text content	Part of speech
			1	At a time	Not classified
2	Whether or not to	Adjectives
			3	Application for	Verb and its usage
4	Multiple VPS? What is specifically done?	Not classified

Further, words with parts of speech marked can be deleted and marked, specifically, words with parts of speech marked clicked by a user can be obtained, then the parts of speech of the words with parts of speech marked are modified into unclassified words, and merging and storing are carried out according to adjacent words or sentences.

It should be noted that, in a sentence, only words whose part of speech is unclassified may be selected, and words whose part of speech is labeled may not be selected, so that when deleting and labeling a word whose part of speech is labeled, only any region of the word whose part of speech is labeled needs to be clicked, and all the regions can be obtained.

For example, taking canceling the label of "can or not" as an example, the user clicks an arbitrary region where "can or not" is located, and obtains a word with a part of speech already marked, which is clicked by the user: if the word is the same, the word can be merged, and the word with the sequence number 1 can be seen from the table 3 to be unclassified, so that the word can be merged, and finally the list shown in the table 2 can be obtained.

Optionally, after each word is labeled, words labeled as parts of speech set by the user may also be set to have the same background color, where the background colors corresponding to words of different parts of speech are different, specifically as shown in fig. 6, and it can be seen from fig. 6 that the background colors corresponding to different parts of speech are different.

When all words which can be labeled in one sentence are labeled, the submit button can be clicked to label the next sentence.

The embodiment shows that the part of speech set by the user is determined, the first class of words selected by the user from the sentence is obtained, the sentence is divided into a plurality of language sections according to the selected first class of words for storage, and the part of speech of the selected first class of words is marked as the part of speech set by the user and displayed. The part-of-speech tagging is carried out on the first class of words selected from the sentences by the user according to the part-of-speech set by the user, the part-of-speech tagging efficiency can be effectively improved, the sentences are divided into a plurality of language sections according to the first class of words to be stored, and the orderliness of data can be kept.

Based on the same technical concept, fig. 7 exemplarily shows a structure of an apparatus for text part-of-speech tagging provided by an embodiment of the present invention, which may execute a process of text part-of-speech tagging, and the apparatus may be located in the server 100 shown in fig. 1, or the server 100.

As shown in fig. 7, the apparatus specifically includes:

a determining unit 701 configured to determine a part of speech set by a user;

an obtaining unit 702, configured to obtain a first type of word selected from a sentence by a user;

the processing unit 703 is configured to divide the sentence into a plurality of language segments according to the selected first-class word, store the language segments, label the part of speech of the selected first-class word as the part of speech set by the user, and display the part of speech.

Optionally, the processing unit 703 is further configured to:

after the part of speech of the selected first class word is marked as the part of speech set by the user and displayed, the obtaining unit 701 is controlled to obtain the part of speech modified by the user and a second class word selected by the user;

Optionally, the processing unit 703 is specifically configured to:

Optionally, the processing unit 703 is further configured to:

controlling the obtaining unit 701 to obtain a word which is clicked by a user and has a part of speech tagged;

and modifying the part of speech of the word with the part of speech marked into unclassified words, and storing the words or the sentences according to the adjacent words or the sentences.

Based on the same technical concept, an embodiment of the present invention further provides a computing device, including:

a memory for storing program instructions;

Based on the same technical concept, the embodiment of the invention also provides a computer-readable non-volatile storage medium, which comprises computer-readable instructions, and when the computer-readable instructions are read and executed by a computer, the computer is enabled to execute the method for text part-of-speech tagging.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for text part-of-speech tagging, comprising:

determining the part of speech set by a user;

acquiring a first word selected from a sentence by a user;

dividing the sentence into a plurality of language segments for storage according to the selected first class of words, marking the part of speech of the selected first class of words as the part of speech set by the user and displaying the part of speech;

after the part of speech of the selected first kind of word is marked as the part of speech set by the user and displayed, the method further comprises the following steps:

acquiring the part of speech modified by the user and a second word selected by the user; the part of speech modified by the user is the part of speech set by the user by transforming the part of speech set by the user;

2. The method of claim 1, wherein the dividing the sentence into a plurality of language segments for storage according to the selected first type of word, and labeling and displaying the part of speech of the selected first type of word as the part of speech set by the user comprises:

dividing the sentence into a plurality of language segments by taking the selected first class of words as a segmentation line, and sequencing and storing the language segments;

3. The method of claim 1, wherein after the part of speech of the selected first kind of word is marked as the part of speech set by the user and displayed, the method further comprises:

4. The method of claim 3, wherein the method further comprises:

acquiring words with marked parts of speech clicked by a user;

5. The method of any one of claims 1 to 4, wherein the part of speech comprises a taxonomy, a verb, a name, a pronoun, an adjective, a numerator, a quantifier, or a stop word;

6. An apparatus for part-of-speech tagging of text, comprising:

a determining unit configured to determine a part of speech set by a user;

the processing unit is used for dividing the sentence into a plurality of language segments for storage according to the selected first class of words, marking the part of speech of the selected first class of words as the part of speech set by the user and displaying the part of speech;

the processing unit is further to:

after the part of speech of the selected first class of words is marked as the part of speech set by the user and displayed, controlling the acquisition unit to acquire the part of speech modified by the user and a second class of words selected by the user; the part of speech modified by the user is the part of speech set by the user by transforming the part of speech set by the user;

7. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 5 in accordance with the obtained program.

8. A computer-readable non-transitory storage medium including computer-readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 5.