CN110532391A

CN110532391A - A kind of method and device of text part-of-speech tagging

Info

Publication number: CN110532391A
Application number: CN201910817945.5A
Authority: CN
Inventors: 李金锋; 杨绳春; 洪文龙
Original assignee: Wangsu Science and Technology Co Ltd
Current assignee: Wangsu Science and Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-12-03
Anticipated expiration: 2039-08-30
Also published as: CN110532391B

Abstract

The invention discloses a kind of method and devices of text part-of-speech tagging, this method includes the part of speech of determining user setting, obtain the first kind word that user chooses from sentence, sentence is divided into multiple paragraphs according to the first kind word chosen to store, and the part-of-speech tagging for the first kind word chosen for the part of speech of user setting and is shown.Part of speech according to user setting is labeled part of speech to the first kind word that user chooses from sentence, can the word to identical part of speech disposably quickly mark, the effective efficiency for improving part-of-speech tagging, and sentence is divided into multiple paragraphs according to first kind word to store, it can keep the order of each paragraph in sentence, and the part of speech for the first kind word chosen is shown, intuitively, and convenient for discovery marking error.

Description

A kind of method and device of text part-of-speech tagging

Technical field

The present embodiments relate to machine learning techniques field more particularly to a kind of methods and dress of text part-of-speech tagging It sets.

Background technique

Machine learning (Machine Learning, ML) is a multi-field cross discipline, be related to probability theory, statistics, The multiple subjects such as Approximation Theory, convextiry analysis, algorithm complexity theory.Specialize in the study that the mankind were simulated or realized to computer how Behavior reorganizes the existing structure of knowledge and is allowed to constantly improve the performance of itself to obtain new knowledge or skills.

And machine, in order to promote the accuracy of Language Processing, generally requires manually to help to important text when being trained This progress part-of-speech tagging.And traditional tool implementation, all it is to directly give in short, mark personnel is allowed to manually type in correlation Word and carry out mark note.Not only low efficiency in this way, and the word of mark note is unordered, if some word in a word connects It is continuous to occur twice, and part of speech is different, then will be unable to distinguish.

Summary of the invention

The embodiment of the present invention provides a kind of method and device of text part-of-speech tagging, to improve the efficiency of part-of-speech tagging.

In a first aspect, the embodiment of the present invention provides a kind of method of text part-of-speech tagging, comprising:

Determine the part of speech of user setting；

Obtain the first kind word that user chooses from sentence；

The sentence is divided into multiple paragraphs and stored by the first kind word chosen according to described in, and by the chosen The part-of-speech tagging of a kind of word is the part of speech of the user setting and is shown.

In above-mentioned technical proposal, the part of speech according to user setting is labeled the first kind word that user chooses from sentence Part of speech, can the word to identical part of speech disposably quickly mark, the effective efficiency for improving part-of-speech tagging, and according to the first kind Sentence is divided into multiple paragraphs and stored by word, can keep the order of each paragraph in sentence, and to the first kind chosen The part of speech of word is shown, intuitively, and is convenient for discovery marking error.

Optionally, it for the part of speech of the user setting and is shown by the part-of-speech tagging of the first kind word chosen Later, further includes:

The second class word that the part of speech and user for obtaining user's modification are chosen；

The paragraph where the second class word is divided into multiple paragraphs according to the second class word to store, and will be described The part-of-speech tagging of second class word is the part of speech of user modification.

In above-mentioned technical proposal, by obtaining the part of speech of user's modification, part-of-speech tagging is carried out to the second class word, may be implemented The quickly part of speech of transformation setting, reaches the purpose being labeled to the word of different parts of speech.

Optionally, the sentence is divided into multiple paragraphs and stored by the first kind word chosen described in the foundation, and will The part-of-speech tagging of the first kind word chosen is the part of speech of the user setting and is shown, comprising:

Using the first kind word chosen as cut-off rule, the sentence is divided into multiple paragraphs and is ranked up storage；

By the part-of-speech tagging for the first kind word chosen it is the part of speech of the user setting, and the part of speech of mark is shown in institute In predicate sentence.

In above-mentioned technical proposal, using first kind word as cut-off rule, sentence is divided into multiple paragraphs and is ranked up storage, it can be with So that each paragraph keeps order in sentence, the accuracy of part-of-speech tagging is improved.

Identical background colour is set by the word for being labeled as the part of speech of the user setting；

Wherein, the corresponding background colour of the word of different parts of speech is different.

In above-mentioned technical proposal, background colour can also be set after mark part of speech, to realize the word for distinguishing different parts of speech.

Optionally, the method also includes:

Obtain the word for having marked part of speech that user clicks；

The part of speech of the word for having marked part of speech is revised as unfiled, determines that part of speech is revised as the adjacent of non-classified word The part of speech of word whether be unfiled, if so, it is not divide that the part of speech, which is revised as non-classified word with adjacent part of speech, The word of class merges storage.

In above-mentioned technical proposal, the word for deleting part-of-speech tagging is merged with adjacent part of speech for non-classified word, it can To keep order.

Optionally, the part of speech is including but not limited to unfiled, verb, title, pronoun, adjective, number, quantifier or stops Word；

Wherein, part of speech is that non-classified word does not show part of speech.

Second aspect, the embodiment of the present invention provide a kind of device of text part-of-speech tagging, comprising:

Determination unit, for determining the part of speech of user setting；

Acquiring unit, the first kind word chosen from sentence for obtaining user；

The sentence is divided into multiple paragraphs and stored by processing unit, the first kind word for choosing according to described in, and The part-of-speech tagging of the first kind word chosen for the part of speech of the user setting and is shown.

Optionally, the processing unit is also used to:

The part-of-speech tagging of the first kind word chosen for the part of speech of the user setting and after being shown, is being controlled Make the second class word that the acquiring unit obtains the part of speech of user's modification and user chooses；

Optionally, the processing unit is specifically used for:

Optionally, the processing unit is also used to:

The part-of-speech tagging of the first kind word chosen for the part of speech of the user setting and after being shown, is being incited somebody to action The word for being labeled as the part of speech of the user setting is set as identical background colour；

Optionally, the processing unit is also used to:

It controls the acquiring unit and obtains the word for having marked part of speech that user clicks；

The third aspect, the embodiment of the present invention also provide a kind of calculating equipment, comprising:

Memory, for storing program instruction；

Processor executes above-mentioned text according to the program of acquisition for calling the program instruction stored in the memory The method of part-of-speech tagging.

Fourth aspect, the embodiment of the present invention also provide a kind of computer-readable non-volatile memory medium, including computer Readable instruction, when computer is read and executes the computer-readable instruction, so that computer executes above-mentioned text part of speech mark The method of note.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of schematic diagram of system architecture provided in an embodiment of the present invention；

Fig. 2 is a kind of flow diagram of the method for text part-of-speech tagging provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of text part-of-speech tagging provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of text part-of-speech tagging provided in an embodiment of the present invention；

Fig. 5 is a kind of schematic diagram of text part-of-speech tagging provided in an embodiment of the present invention；

Fig. 6 is a kind of schematic diagram of text part-of-speech tagging provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of the device of text part-of-speech tagging provided in an embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, it is clear that described embodiments are only a part of the embodiments of the present invention, rather than whole implementation Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts All other embodiment, shall fall within the protection scope of the present invention.

Fig. 1 illustratively shows a kind of system architecture that the embodiment of the present invention is applicable in, which can be clothes Business device 100, including processor 110, communication interface 120 and memory 130.

Wherein, communication interface 120 is received and dispatched the information of terminal device transmission, is realized for being communicated with terminal device Communication.

Processor 110 is the control centre of server 100, connects entire server 100 with route using various interfaces Various pieces by running or execute the software program/or module that are stored in memory 130, and are called and are stored in storage Data in device 130, the various functions and processing data of execute server 100.Optionally, processor 110 may include one Or multiple processing units.

Memory 130 can be used for storing software program and module, and processor 110 is stored in memory 130 by operation Software program and module, thereby executing various function application and data processing.Memory 130 can mainly include storage journey Sequence area and storage data area, wherein storing program area can application program needed for storage program area, at least one function etc.； Storage data area can store the data etc. created according to business processing.In addition, memory 130 may include high random access Memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are volatile Property solid-state memory.

It should be noted that above-mentioned structure shown in FIG. 1 is only a kind of example, it is not limited in the embodiment of the present invention.

Based on foregoing description, Fig. 2 illustratively shows a kind of side of text part-of-speech tagging provided in an embodiment of the present invention The process of method, the process can be executed by the device of text part-of-speech tagging, which can be located at server 100 as shown in Figure 1 It is interior, it is also possible to the server 100.

As shown in Fig. 2, the process specifically includes:

Step 201, the part of speech of user setting is determined.

User needs first to be arranged the part of speech currently marked before being labeled to the word in sentence.Implement in the present invention In example, which can include but is not limited to the words such as unfiled, verb, title, pronoun, adjective, number, quantifier or stop words Property, in the specific application process, it can be increased and be deleted according to the actual situation.Wherein, it is not that part of speech, which is non-classified word, Show part of speech, word is all non-classified in the sentence that the first beginning and end are labeled.As shown in figure 3, the part of speech that user can be set Including verb, noun, adjective and stop words.After file loads out, the part of speech of active user's setting is verb.

Step 202, the first kind word that user chooses from sentence is obtained.

When user needs to be labeled a certain word, it is necessary to the word is first chosen, is chosen generally by mouse sliding, The first kind word that acquisition user chooses from sentence can be realized by click () function during specific implementation.

Step 203, the sentence is divided into multiple paragraphs and stored by the first kind word chosen according to described in, and will be described The part-of-speech tagging for the first kind word chosen is the part of speech of the user setting and is shown.

After obtaining the first kind word that user chooses, so that it may using the first kind word chosen as cut-off rule, by sentence It is divided into multiple paragraphs and is ranked up storage, is then the part of speech of user setting by the part-of-speech tagging for the first kind word chosen, and In the sentence that the part of speech of mark is shown.As shown in figure 4, the part of speech of user's current setting is verb, one can be saved as at this time In a variable Type, such as current Type value is " verb ".Just load comes out to be recorded once whole sentence one, and type is " not divide The storage mode of class ", initial non-classified sentence can be as shown in table 1.The first kind word that user chooses is " application ", is first found The record of serial number 1 where " application ", at this time can be with " application " for cut-off rule, by the content of text of serial number 1 in table 1 Be divided into three sections (if left side or right side be it is empty if are not segmented): " once can ", " apply ", " more VPS specifically this how Do ", it is stored in the form of a table, and to one serial number of each section of imparting, it specifically can be as shown in table 2.

Table 1

Serial number	Content of text	Part of speech
			1	Can it once apply for more VPS how does specific this do	It is unfiled

Table 2

Serial number	Content of text	Part of speech
			1	Once can	It is unfiled
2	Application	Verb
			3	More VPS how does specific this do	It is unfiled

In order to preferably mark the word of different parts of speech, the part of speech and user that user's modification can also be obtained choose the Then paragraph where second class word is divided into multiple paragraphs and stored by two class words according to the second class word, and by the second class word Part-of-speech tagging be user modification part of speech.

For example, the value of modification Type variable is adjective as shown in figure 5, the part of speech of user's modification is adjective.User's choosing In the second class word be " can ", from can be determined in table 2 " can " where paragraph serial number 1, then with " can " For cut-off rule, will " once can " be divided into " primary ", " can ", and according to " primary " and " can " sequence in prototype statement, It is ranked up storage, for " can " mark part of speech, it is specific as shown in table 3.

Table 3

Serial number	Content of text	Part of speech
			1	Once	It is unfiled
2	It can	Adjective
			3	Application	Verb
4	More VPS how does specific this do	It is unfiled

Further, deletion mark can also be carried out to the word for having marked part of speech, specifically, what available user clicked Marked the word of part of speech, then the part of speech for having marked the word of part of speech is revised as it is unfiled, and according to its adjacent word or sentence Merge storage.

It should be noted that only part of speech is that non-classified word can be just selected in a sentence, part of speech has been marked Word can not be selected, therefore, it is necessary to the word for having marked part of speech carry out delete mark when, it is only necessary to click marked word The arbitrary region of the word of property, can be acquired.

For example, by cancellation " can " mark for, user click " can " where arbitrary region, get use Family click the word for having marked part of speech: " can ", from found in table 3 " can " where position be serial number 2, its part of speech is modified To be unfiled, its two adjacent record (1,3) is then found, if part of speech is the same, so that it may merge, from table It it can be seen that the word of serial number 1 is unfiled in 3, therefore can merge, list as shown in Table 2 may finally be obtained.

Optionally, after being labeled to each word, the word setting of the part of speech of user setting will can also be labeled as For identical background colour, wherein the corresponding background colour of the word of different parts of speech is different, specifically as shown in fig. 6, can from Fig. 6 To find out that the corresponding background colour of different parts of speech is different.

After the completion of all words that can be marked of a sentence all mark, so that it may click and submit key, carry out next language The mark of sentence.

Above-described embodiment shows the part of speech by determining user setting, obtains the first kind word that user chooses from sentence, Sentence is divided into multiple paragraphs according to the first kind word chosen to store, and is to use by the part-of-speech tagging for the first kind word chosen The part of speech of family setting is simultaneously shown.Part of speech according to user setting is labeled the first kind word that user chooses from sentence Part of speech can effectively improve the efficiency of part-of-speech tagging, and sentence is divided into multiple paragraphs according to first kind word and is stored, It can keep the order of data.

Based on the same technical idea, Fig. 7 illustratively shows a kind of text part of speech mark provided in an embodiment of the present invention The structure of the device of note, the device can execute the process of text part-of-speech tagging, which can be located at server shown in FIG. 1 In 100, it is also possible to the server 100.

As shown in fig. 7, the device specifically includes:

Determination unit 701, for determining the part of speech of user setting；

Acquiring unit 702, the first kind word chosen from sentence for obtaining user；

The sentence is divided into multiple paragraphs and stored by processing unit 703, the first kind word for choosing according to described in, And the part-of-speech tagging of the first kind word chosen for the part of speech of the user setting and is shown.

Optionally, the processing unit 703 is also used to:

The part-of-speech tagging of the first kind word chosen for the part of speech of the user setting and after being shown, is being controlled Make the second class word that the acquiring unit 701 obtains the part of speech of user's modification and user chooses；

Optionally, the processing unit 703 is specifically used for:

Optionally, the processing unit 703 is also used to:

It controls the acquiring unit 701 and obtains the word for having marked part of speech that user clicks；

The part of speech of the word for having marked part of speech is revised as unfiled, and is deposited according to its adjacent word or sentence Storage.

Based on the same technical idea, the embodiment of the invention also provides a kind of calculating equipment, comprising:

Memory, for storing program instruction；

Based on the same technical idea, the embodiment of the invention also provides a kind of computer-readable non-volatile memories to be situated between Matter, including computer-readable instruction, when computer is read and executes the computer-readable instruction, so that computer executes The method for stating text part-of-speech tagging.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.

Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of method of text part-of-speech tagging characterized by comprising

Determine the part of speech of user setting；

Obtain the first kind word that user chooses from sentence；

The sentence is divided into multiple paragraphs and stored by the first kind word chosen according to described in, and by the first kind chosen The part-of-speech tagging of word is the part of speech of the user setting and is shown.

2. the method as described in claim 1, which is characterized in that be described by the part-of-speech tagging of the first kind word chosen The part of speech of user setting and after being shown, further includes:

The paragraph where the second class word is divided into multiple paragraphs according to the second class word to store, and by described second The part-of-speech tagging of class word is the part of speech of user modification.

3. the method as described in claim 1, which is characterized in that the first kind word chosen described in the foundation divides the sentence It is stored for multiple paragraphs, and part of speech and progress by the part-of-speech tagging of the first kind word chosen for the user setting Display, comprising:

By the part-of-speech tagging for the first kind word chosen it is the part of speech of the user setting, and the part of speech of mark is shown in institute's predicate In sentence.

4. the method as described in claim 1, which is characterized in that be described by the part-of-speech tagging of the first kind word chosen The part of speech of user setting and after being shown, further includes:

5. method as claimed in claim 4, which is characterized in that the method also includes:

Obtain the word for having marked part of speech that user clicks；

The part of speech of the word for having marked part of speech is revised as unfiled, determines that part of speech is revised as the adjacent word of non-classified word Part of speech whether be unfiled, if so, it is non-classified that the part of speech, which is revised as non-classified word and adjacent part of speech, Word merges storage.

6. such as method described in any one of claim 1 to 5, which is characterized in that the part of speech includes classification, verb, title, generation Word, adjective, number, quantifier or stop words；

7. a kind of device of text part-of-speech tagging characterized by comprising

Determination unit, for determining the part of speech of user setting；

Acquiring unit, the first kind word chosen from sentence for obtaining user；

The sentence is divided into multiple paragraphs for the first kind word chosen according to described in and stored by processing unit, and by institute The part-of-speech tagging for stating the first kind word chosen is the part of speech of the user setting and is shown.

8. device as claimed in claim 7, which is characterized in that the processing unit is also used to:

The part-of-speech tagging of the word chosen for the part of speech of the user setting and after being shown, is being controlled into the acquisition The second class word that unit obtains the part of speech of user's modification and user chooses；

9. a kind of calculating equipment characterized by comprising

Memory, for storing program instruction；

Processor requires 1 to 6 according to the program execution benefit of acquisition for calling the program instruction stored in the memory Described in any item methods.

10. a kind of computer-readable non-volatile memory medium, which is characterized in that including computer-readable instruction, work as computer When reading and executing the computer-readable instruction, so that computer executes such as method as claimed in any one of claims 1 to 6.