CN116562281A

CN116562281A - Method, system and equipment for extracting new words in field based on part-of-speech markers

Info

Publication number: CN116562281A
Application number: CN202310826531.5A
Authority: CN
Inventors: 侯颖; 崔运鹏; 罗冠然; 黄杰; 王婷; 王末; 刘娟
Original assignee: Agricultural Information Institute of CAAS
Current assignee: Agricultural Information Institute of CAAS
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-08-08

Abstract

The invention discloses a method, a system and equipment for extracting new words in the field based on part-of-speech markers, which relate to the field of natural language processing, and comprise the following steps: word segmentation processing is carried out on the text to be processed to obtain a plurality of word segments; marking each word segment by using a part-of-speech tagging model to obtain a part-of-speech tag; selecting candidate phrases matched with the defined part-of-speech patterns from the text to be processed by adopting a regular expression based on the part-of-speech marks; sorting the candidate phrases according to semantic similarity of the candidate phrases and the text to be processed by using a pre-trained language model; filtering and extracting new terms in the field from the sorted candidate phrases. The method and the device can extract the new words in the field rapidly and accurately.

Description

Method, system and equipment for extracting new words in field based on part-of-speech markers

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method, a system and equipment for extracting new words in the field based on part-of-speech tagging.

Background

Chinese word segmentation is an important research content in the field of natural language processing, is a first step of text mining, is one of very important steps, and is the basis of keyword extraction, text clustering, topic modeling, hot spot analysis and the like, and the accuracy of further text processing is directly influenced by the quality of word segmentation results. Therefore, how to quickly, accurately and effectively recognize new words has important effect on improving Chinese word segmentation effect and has important significance on improving working efficiency.

The recognition of new Chinese words is an interesting field in terms of data mining, specific professional terms exist in different fields or disciplines, and the existing word segmentation software is difficult to perform personalized processing and accurately segment special words. At present, research on new word discovery mainly focuses on methods such as rule matching, statistics, mutual information and n-gram model. Specifically, the existing new word discovery method has the following problems: 1) Although the rule-based method has higher accuracy, the rule-based method has large consumption of manpower and material resources and has poor expandability and flexibility; 2) The statistical-based method is flexible, does not receive the limitation of the field, has better portability and is expanded, but has the defect of lower accuracy; 3) The method based on mutual information and n-gram model combination requires predefining n-gram length, the user does not usually know the optimal n-gram range, and has to find the proper n-gram range through some experiments, and even if the proper range is found, the returned phrase may still be grammatically incorrect.

Disclosure of Invention

The invention aims to provide a part-of-speech-marking-based field new word extraction method, a part-of-speech-marking-based field new word extraction system and part-of-speech-based field new word extraction equipment, and aims to solve the problems of large consumption of manpower and material resources, poor expandability and flexibility, low accuracy, incorrect grammar and the like in the existing new word discovery method.

In order to achieve the above object, the present invention provides the following solutions:

a field new word extraction method based on part-of-speech tagging comprises the following steps:

word segmentation processing is carried out on the text to be processed to obtain a plurality of word segments;

marking each word segment by using a part-of-speech tagging model to obtain a part-of-speech tag;

selecting candidate phrases matched with the defined part-of-speech patterns from the text to be processed by adopting a regular expression based on the part-of-speech marks;

sorting the candidate phrases according to semantic similarity of the candidate phrases and the text to be processed by using a pre-trained language model;

filtering and extracting new terms in the field from the sorted candidate phrases.

Optionally, after filtering and extracting new terms in the domain from the ranked candidate phrases, the method further includes:

the extracted domain new words are added to the user dictionary.

Optionally, word segmentation processing is performed on the text to be processed to obtain a plurality of segmented words, which specifically comprises the following steps:

and performing word segmentation processing on the text to be processed according to the domain professional word list in the user dictionary to obtain a plurality of segmented words.

Optionally, filtering and extracting new domain words from the sorted candidate phrases, which specifically includes:

and filtering and extracting new terms in the field from the sorted candidate phrases through a similarity threshold value or topN.

The invention also provides a field new word extraction system based on part-of-speech tagging, which comprises:

the word segmentation processing unit is used for carrying out word segmentation processing on the text to be processed to obtain a plurality of word segments;

the part-of-speech tagging unit is used for tagging each word segment by using a part-of-speech tagging model to obtain a part-of-speech tag;

the candidate phrase selecting unit is used for selecting candidate phrases matched with the defined part-of-speech patterns from the text to be processed by adopting a regular expression based on the part-of-speech marks;

the ranking unit is used for ranking the candidate phrases according to the semantic similarity of the candidate phrases and the text to be processed by utilizing a pre-trained language model;

and the domain new word extraction unit is used for filtering and extracting domain new words from the ordered candidate phrases.

Optionally, the method further comprises:

and the adding unit is used for adding the extracted domain new words into the user dictionary.

Optionally, the word segmentation processing unit performs word segmentation processing on the text to be processed according to the domain professional word list in the user dictionary to obtain a plurality of segmented words.

Optionally, the domain new word extracting unit filters and extracts domain new words from the ranked candidate phrases through a similarity threshold or topN.

The invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic equipment to execute the method for extracting the new words in the field based on the part-of-speech markers.

The invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements the part-of-speech tagging-based domain new word extraction method described above.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

(1) The invention utilizes the pre-trained language model and parts of speech to identify and extract new words from the field literature, does not need a large amount of marking training data, and reduces manpower consumption;

(2) The invention has good expandability and can be flexibly expanded to other fields;

(3) The invention can extract new words with correct grammar without the need of designating n-gram range by a user.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for extracting new terms in a part-of-speech tag-based domain according to an embodiment of the present invention;

FIG. 2 is a general flow chart of a new term extraction method based on part-of-speech tagging in the field according to the first embodiment of the present invention;

FIG. 3 is a flowchart of a word segmentation process;

FIG. 4 is a training flowchart of the part-of-speech tagging model.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a method, a system and equipment for extracting new words in the field based on part-of-speech tagging, which utilize a pre-trained language model and part-of-speech to identify and extract the new words, do not need a large amount of tagging training data, reduce manpower consumption, have good expandability and can extract the new words with correct grammar.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

The first embodiment of the invention provides a new word extraction method in the field based on part-of-speech tagging, which comprises the following steps as shown in fig. 1-2:

s1: word segmentation processing is carried out on the text to be processed, and a plurality of word segments are obtained.

As shown in fig. 3, adding an existing domain professional vocabulary extracted from domain corpus into a user dictionary, and performing word segmentation on a text to be processed as a word segmentation basis to obtain a word segmentation result; meanwhile, new domain words extracted in the subsequent steps can be added into a user dictionary, so that the subsequent word segmentation effect of the text is improved.

The domain specialized vocabulary may be verified domain vocabulary of valuable domain entities, entity attributes, proper nouns, terms, etc. in the domain. The word quality in the user dictionary has great influence on the word segmentation accuracy.

S2: and marking each word segment by using a part-of-speech tagging model to obtain a part-of-speech tag.

And marking the part of speech of each word of the text to be processed by using a part of speech marking model. The part-of-speech tagging model can be a trained existing model or a neural network model is adopted to self-define the part-of-speech tagging model according to own field data.

FIG. 4 is a schematic diagram of a custom part-of-speech tagging model, using self domain data as training data, processing the training data based on a space natural language processing tool kit to obtain text and tags, training a neural network by using a gradient descent method, and storing to obtain the part-of-speech tagging model.

S3: based on the part-of-speech markers, selecting candidate phrases matched with the defined part-of-speech patterns from the text to be processed by adopting regular expressions.

The method comprises the following steps: initializing a vector, inputting parameters such as part-of-speech matching modes (such as < J. ] > < N. >), part-of-speech marks, and stopping a vocabulary, and learning phrases matched with the defined part-of-speech modes from a text to be processed to obtain proper candidate phrases; after fitting learning, the vectorizer may convert the document into a document-phrase matrix, the rows of the matrix representing the document and the columns representing the phrases.

The document-phrase matrix may be a word frequency statistics matrix or a TF-IDF value matrix of phrases. The word frequency statistical matrix can intuitively see the occurrence frequency of each phrase in each document. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency). TF-IDF is a statistical method used to evaluate the importance of words to one of the documents in a set or corpus of documents. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between documents and user queries.

S4: and sequencing the candidate phrases according to the semantic similarity between the candidate phrases and the text to be processed by using a pre-trained language model.

And transmitting the extracted candidate phrases to a pre-trained language model, performing embedding generation and similarity calculation, and sequencing the candidate phrases according to similarity scores.

S5: filtering and extracting new terms in the field from the sorted candidate phrases.

Filtering and extracting new terms in the field from the sorted candidate phrases by setting a similarity threshold or topN, wherein the new terms in the field are specifically as follows: if topN parameters are selected to filter the ordered candidate phrases, topN is directly extracted to obtain new field words; if a threshold judgment method is selected, selecting the phrase of the candidate phrases after sortingIf the similarity value->Greater than a set thresholdThen candidate phrase->Is a new term of the field.

According to the method for extracting the new words in the field, which is provided by the invention, the pre-trained language model and parts of speech are utilized to extract the new words from the field literature, so that a large amount of marking training data is not needed, and the manpower consumption is reduced; the expandability is good, and the system can be flexibly expanded to other fields; the user does not need to specify the range of the n-gram, and new words with correct grammar can be extracted.

Example two

In order to execute the corresponding method of the above embodiment to achieve the corresponding functions and technical effects, a new word extraction system in the field based on part-of-speech tagging is provided below.

The system comprises:

Further, the method further comprises the following steps:

Further, the word segmentation processing unit performs word segmentation processing on the text to be processed according to the domain professional word list in the user dictionary to obtain a plurality of segmented words.

Further, the domain new word extraction unit filters and extracts domain new words from the sorted candidate phrases through a similarity threshold value or topN.

Example III

An electronic device according to a third embodiment of the present invention includes a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to execute the part-of-speech tag-based domain new word extraction method provided in the first embodiment.

In practical applications, the electronic device may be a server.

In practical applications, the electronic device includes: at least one processor (processor), memory (memory), bus, and communication interface (communication interface).

Wherein: the processor, communication interface, and memory communicate with each other via a communication bus.

And the communication interface is used for communicating with other devices.

And a processor, configured to execute a program, and specifically may execute the method described in the foregoing embodiment.

In particular, the program may include program code including computer-operating instructions.

The processor may be a central processing unit, CPU, or specific integrated circuit ASIC (ApplicationSpecificIntegratedCircuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the electronic device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing programs. The memory may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

Example IV

Based on the description of the third embodiment, the fourth embodiment of the present invention provides a storage medium having a computer program stored thereon, where the computer program is executable by a processor to implement the part-of-speech tagging-based domain new word extraction method of the first embodiment.

The part-of-speech tagging-based domain new word extraction system provided in the second embodiment of the present invention exists in various forms, including but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include: smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, etc.

(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally having mobile internet access capabilities. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad.

(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises: audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.

(4) Other electronic devices with data interaction functions.

Thus, particular embodiments of the present subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may be advantageous.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention. It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash memory (flashRAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer readable media, as defined in the present invention, does not include transitory computer readable media (transshipment) such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.

Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular transactions or implement particular abstract data types. The invention may also be practiced in distributed computing environments where transactions are performed by remote processing devices that are connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The utility model provides a new word extraction method in the field based on part of speech mark, which is characterized in that the method comprises the following steps:

2. The method for extracting new terms in the field based on part-of-speech tagging according to claim 1, further comprising, after filtering the ranked candidate phrases to extract new terms in the field:

the extracted domain new words are added to the user dictionary.

3. The method for extracting new words from a part-of-speech tag-based domain according to claim 1, wherein the word segmentation processing is performed on the text to be processed to obtain a plurality of segmented words, and the method specifically comprises:

4. The method for extracting new terms in the field based on part-of-speech tagging according to claim 1, wherein filtering the ranked candidate phrases to extract new terms in the field specifically comprises:

5. The utility model provides a new word extraction system in field based on part of speech mark which characterized in that includes:

6. The part-of-speech tagging based domain new word extraction system of claim 5, further comprising:

7. The part-of-speech tagging-based domain new word extraction system according to claim 5, wherein the word segmentation processing unit performs word segmentation processing on the text to be processed according to a domain specialized vocabulary in the user dictionary, so as to obtain a plurality of segmented words.

8. The part-of-speech tagging based domain new word extraction system of claim 5, wherein the domain new word extraction unit filters the ranked candidate phrases through a similarity threshold or topN to extract domain new words.

9. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the part-of-speech tag-based domain new word extraction method of any one of claims 1-4.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the part-of-speech tagging-based domain new word extraction method according to any one of claims 1-4.