CN112632999A - Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium - Google Patents

Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium Download PDF

Info

Publication number
CN112632999A
CN112632999A CN202011510819.4A CN202011510819A CN112632999A CN 112632999 A CN112632999 A CN 112632999A CN 202011510819 A CN202011510819 A CN 202011510819A CN 112632999 A CN112632999 A CN 112632999A
Authority
CN
China
Prior art keywords
word
word segmentation
segmentation result
frequency
single words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011510819.4A
Other languages
Chinese (zh)
Inventor
黄定帮
付志宏
何径舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011510819.4A priority Critical patent/CN112632999A/en
Publication of CN112632999A publication Critical patent/CN112632999A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The utility model discloses a named entity recognition model obtaining and named entity recognition method, device and medium, relating to the artificial intelligence field of natural language processing, deep learning and computer vision, wherein the model obtaining method can comprise: performing word segmentation processing on the training text, and forming a first sequence by using each word segmentation result; splitting a word segmentation result which at least comprises two single words and is a non-high-frequency word in the first sequence into the single words; and taking the word embedding vector corresponding to the single word in the first sequence after the splitting treatment and the word embedding vector corresponding to the high-frequency word at least comprising two single words as the input of the named entity recognition model, and training the named entity recognition model. By applying the scheme disclosed by the invention, the model training effect and the model performance can be improved, and the accuracy of the named entity recognition result obtained based on the model can be correspondingly improved.

Description

Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a medium for acquiring a named entity recognition model and recognizing a named entity in the fields of natural language processing, deep learning, and computer vision.
Background
Named Entity Recognition (NER) refers to recognizing Entity words in a specific category in text, such as names of people, places, names of organizations, and the like.
The named entity recognition task is a typical sequence labeling task, is realized based on a neural network model at present, and can be obtained by pre-training the neural network model. According to different input processing of the model, two different implementation modes based on word granularity and word granularity can be divided.
In addition, the mode based on the word granularity needs to distinguish the type of the entity word, also needs to distinguish the boundary of the entity word, the model learning difficulty is larger, and the like.
Disclosure of Invention
The disclosure provides named entity recognition model acquisition and named entity recognition methods, apparatuses, and media.
A named entity recognition model acquisition method comprises the following steps:
performing word segmentation processing on the training text, and forming a first sequence by using each word segmentation result;
splitting a word segmentation result which at least comprises two single words and is a non-high-frequency word in the first sequence into the single words;
and taking the word embedding vector corresponding to the single word in the first sequence after the splitting treatment and the word embedding vector corresponding to the high-frequency word at least comprising two single words as the input of a named entity recognition model, and training the named entity recognition model.
A named entity recognition method, comprising:
performing word segmentation processing on the text to be recognized, and forming a second sequence by using word segmentation results;
splitting the word segmentation result which at least comprises two single words and is a non-high-frequency word in the second sequence into the single words;
and inputting the word embedding vector corresponding to the single word in the second sequence after the splitting treatment and the word embedding vector corresponding to the high-frequency word at least comprising two single words into a named entity recognition model obtained by pre-training to obtain a named entity recognition result corresponding to the text to be recognized.
A named entity recognition model acquisition apparatus comprising: the system comprises a first segmentation module, a first splitting module and a model training module;
the first word segmentation module is used for carrying out word segmentation processing on the training text and forming a first sequence by using each word segmentation result;
the first splitting module is used for splitting a word segmentation result which at least comprises two single words and is a non-high-frequency word in the first sequence into the single words;
and the model training module is used for taking word embedding vectors corresponding to the single words in the first sequence after the splitting processing and word embedding vectors corresponding to high-frequency words at least comprising two single words as the input of a named entity recognition model, and training the named entity recognition model.
A named entity recognition apparatus comprising: the system comprises a second word segmentation module, a second splitting module and a result acquisition module;
the second word segmentation module is used for performing word segmentation processing on the text to be recognized and forming a second sequence by using word segmentation results;
the second splitting module is used for splitting the word segmentation result which at least comprises two single words and is a non-high-frequency word in the second sequence into the single words;
and the result acquisition module is used for inputting word embedding vectors corresponding to the single words in the second sequence after the splitting processing and corresponding to high-frequency words at least comprising two single words into a named entity recognition model obtained by pre-training so as to obtain a named entity recognition result corresponding to the text to be recognized.
An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.
A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.
A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
One embodiment in the above disclosure has the following advantages or benefits: in addition, compared with the mode based on the character granularity, the model learning difficulty and the like can be reduced.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of an embodiment of a named entity recognition model acquisition method according to the present disclosure;
FIG. 2 is a flowchart of an embodiment of a named entity recognition method according to the present disclosure;
FIG. 3 is a schematic diagram of a named entity identification process according to the present disclosure;
fig. 4 is a schematic structural diagram illustrating a configuration of an embodiment of a named entity recognition model obtaining apparatus 40 according to the present disclosure;
fig. 5 is a schematic structural diagram illustrating a configuration of a named entity recognition apparatus 50 according to an embodiment of the present disclosure;
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
Fig. 1 is a flowchart of an embodiment of a named entity recognition model obtaining method according to the present disclosure. As shown in fig. 1, the following detailed implementation is included.
In step 101, a word segmentation process is performed on the training text, and a first sequence is formed by using the word segmentation results.
How to perform the word segmentation processing on the training text is not limited. For example, the training text may be segmented using an existing segmenter.
For any word segmentation result, only one single word may be included, and a plurality of single words may be included. The first sequence may be formed using the word segmentation results.
In step 102, the word segmentation result, which includes at least two single words and is a non-high frequency word in the first sequence, is segmented into the single words.
For the word segmentation result at least comprising two single words in the first sequence, when the word segmentation result is a non-high-frequency word, the word segmentation result can be split into the single words. Non-high frequency words may also be referred to as low frequency words.
In step 103, the word embedding vector corresponding to the single word in the split first sequence and the word embedding vector corresponding to the high-frequency word at least comprising two single words are used as the input of the named entity recognition model, and the named entity recognition model is trained.
After splitting, the first sequence is composed of single characters and high-frequency words at least comprising two single characters, wherein the single characters comprise the split single characters and word segmentation results only comprising one single character.
Word embedding vectors corresponding to the individual words in the first sequence and word embedding vectors corresponding to the high-frequency words can be respectively obtained according to the existing mode and used as input of the named entity recognition model to train the named entity recognition model.
It can be seen that, in the above method embodiment, a mode based on word and word mixed granularity is adopted, and since the word segmentation error rate of non-low-frequency words, i.e. high-frequency words, is low, the high-frequency words are input into the model in word granularity, thereby avoiding the influence of word segmentation errors on the model as much as possible, and for low-frequency words, the model is input into the model in word granularity, so that the model can learn the word construction rules of low-frequency words, thereby overcoming the problem of new words as much as possible, in addition, the existing mode based on word granularity occupies a large amount of memory space due to a large word table, and the mode based on word and word mixed granularity can reduce the occupation of memory space, furthermore, compared with the existing mode based on word granularity, the mode based on word and word mixed granularity can reduce the difficulty of model learning, reduce the overall sequence length of the input model, and the like, in a word, the mode based on the word mixing granularity can obviously improve the model training effect, the model performance and the like.
The named entity recognition model is a neural network model, such as a model based on a Bidirectional Long-Short Term Memory network-Conditional Random Field (BilSTM _ CRF, Bidirectional Long-Short-Term Memory with Conditional Random Field).
When the named entity recognition model is trained, for any training text, word segmentation processing can be firstly carried out, word segmentation results are utilized to form a first sequence, and then word segmentation results which at least comprise two single words and are non-high-frequency words in the first sequence can be split into the single words.
Specifically, for any word segmentation result including at least two single words, if it is determined that the word segmentation result is not located in a pre-generated high-frequency word list, it may be determined that the word segmentation result is a non-high-frequency word, and the word segmentation result may be split into M single words, where M represents the number of the single words included in the word segmentation result.
For example, if a word segmentation result a is composed of a single word 1 and a single word 2, and the word segmentation result is not located in the high-frequency word list, the word segmentation result a can be determined to be a non-high-frequency word, and can be split into the single word 1 and the single word 2.
For another example, if a word segmentation result b is composed of a single word 3, a single word 4, a single word 5 and a single word 6, and the word segmentation result b is not located in the high-frequency word list, it is determined that the word segmentation result b is a non-high-frequency word, and the word segmentation result b can be divided into the single word 3, the single word 4, the single word 5 and the single word 6.
For any word segmentation result at least comprising two single words, if the word segmentation result is determined to be located in the high-frequency word list, the word segmentation result can be determined to be a high-frequency word, and the word segmentation result can be maintained unchanged, namely, the splitting is not performed.
For example, a word segmentation result c is composed of a single word 7 and a single word 8, and the word segmentation result c is located in the high-frequency word list, so that the word segmentation result c can be determined to be a high-frequency word, and the word segmentation result c is maintained unchanged.
Through the processing, the high-frequency words and the non-high-frequency words can be conveniently and accurately distinguished, so that a good foundation is laid for subsequent processing.
The high frequency vocabulary may be generated in advance. The generation mode can comprise: performing word segmentation processing on each training text; respectively acquiring the occurrence frequency of each word segmentation result; and taking the word segmentation result with the occurrence frequency larger than a preset threshold and at least comprising two single words as a high-frequency word, and forming a high-frequency word list by using each high-frequency word.
The training text refers to a training sample of the named entity recognition model. The specific number of the training texts is not limited and can be determined according to actual needs.
For each training text, the word segmentation process may be performed separately, for example, the same word segmentation device may be used as the word segmentation process described in step 101. And then, the occurrence frequency of each word segmentation result can be respectively obtained, for example, the occurrence frequency of each word segmentation result can be respectively counted, and the word segmentation result with the occurrence frequency larger than a preset threshold value is taken as a high-frequency word. And forming a high-frequency word list by using the obtained high-frequency words. The specific value of the threshold can be determined according to actual needs. In addition, in general, a high-frequency word needs to be a word including at least two single words.
Further, if it is determined that the entity word segmentation boundary error occurs in any segmentation result, the segmentation result can be filtered, and the segmentation results with the occurrence frequency larger than a predetermined threshold and at least comprising two single words in the remaining segmentation results are used as high-frequency words.
The training text is used as a training sample of the named entity recognition model, and corresponding labels exist, namely different entity categories are labeled. When the word segmentation is performed, a situation that the word segmentation boundary of the entity word is wrong may occur, for example, an expression of "how many spellings" appears in the training text, and the "how many spellings" is a single entity word.
Through the processing, the word segmentation error rate of the high-frequency words is reduced, so that the accuracy of subsequent processing based on the high-frequency word list is improved, and the like.
For the first sequence, after the splitting treatment, a word mixed granularity sequence of high-frequency words based on word granularity and other words based on word granularity such as non-high-frequency words can be obtained.
The word embedding vector corresponding to each single word in the first sequence and the word embedding vector corresponding to each high-frequency word can be used as the input of the named entity recognition model to train the named entity recognition model. How to obtain word embedding vectors and how to train a named entity recognition model is prior art.
Fig. 2 is a flowchart of an embodiment of a named entity recognition method according to the present disclosure. As shown in fig. 2, the following detailed implementation is included.
In step 201, the text to be recognized is subjected to word segmentation processing, and a second sequence is formed by using the word segmentation results.
In step 202, the segmentation result, which includes at least two single words and is a non-high frequency word in the second sequence, is segmented into single words.
In step 203, the word embedding vector corresponding to the single word in the split second sequence and the word embedding vector corresponding to the high-frequency word at least including two single words are input into the named entity recognition model obtained by pre-training, so as to obtain the named entity recognition result corresponding to the text to be recognized.
After the training of the named entity recognition model is completed, the named entity recognition model can be used for carrying out named entity recognition on the text to be recognized, and the accuracy of the named entity recognition result can be improved.
How to perform word segmentation processing on the text to be recognized is not limited. For example, the existing word segmenter can be used to perform word segmentation on the text to be recognized.
The word segmentation results can be used to form a second sequence, and the word segmentation results which at least comprise two single words and are non-high frequency words in the second sequence can be split into the single words. Specifically, for any word segmentation result including at least two single words, if it is determined that the word segmentation result is not located in a pre-generated high-frequency word list, it may be determined that the word segmentation result is a non-high-frequency word, and the word segmentation result may be split into M single words, where M represents the number of the single words included in the word segmentation result.
For any word segmentation result at least comprising two single words, if the word segmentation result is determined to be located in the high-frequency word list, the word segmentation result can be determined to be a high-frequency word, and the word segmentation result can be maintained unchanged, namely, the splitting is not performed.
Further, word embedding vectors corresponding to the single words in the split second sequence and at least word embedding vectors corresponding to high-frequency words comprising two single words can be input into the named entity recognition model, so that a named entity recognition result corresponding to the text to be recognized is obtained.
Based on the foregoing description, fig. 3 is a schematic diagram of a named entity identification process according to the present disclosure.
As shown in fig. 3, assuming that the text to be recognized is "zhang san is a chinese basketball star", word segmentation processing may be performed on "zhang san is a chinese basketball star" first, so as to obtain word segmentation results, i.e., "zhang san", "yes", "chinese", "basketball" and "star", and a second sequence may be formed by "zhang san", "yes", "chinese", "basketball" and "star".
Then, the word segmentation result of the non-high frequency word in the second sequence, which includes at least two single words, can be determined based on the high frequency word list, that is, the low frequency word filtering is performed, and if the word segmentation result is "one word three", the word segmentation result can be split into the single words, that is, the single words can be split into two single words, that is, "one" and "three", so as to obtain the second sequence consisting of "one word", "three", "yes", "chinese", "basketball" and "star".
And then, word embedding vectors corresponding to single words such as 'zhang', 'san' and 'yes' and word embedding vectors corresponding to high-frequency words such as 'china', 'basketball' and 'star' can be used as the input of the named entity recognition model, so that the named entity recognition result corresponding to 'zhang san is the chinese basketball star' is obtained.
It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts, those skilled in the art will appreciate that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure. In addition, for parts which are not described in detail in a certain embodiment, reference may be made to relevant descriptions in other embodiments.
The above is a description of embodiments of the method, and the embodiments of the apparatus are further described below.
Fig. 4 is a schematic structural diagram of a named entity recognition model obtaining apparatus 40 according to an embodiment of the present disclosure. As shown in fig. 4, includes: a first segmentation module 401, a first splitting module 402, and a model training module 403.
The first word segmentation module 401 is configured to perform word segmentation processing on the training text, and form a first sequence by using each word segmentation result.
A first splitting module 402, configured to split the word segmentation result, which includes at least two single words and is a non-high frequency word, in the first sequence into the single words.
The model training module 403 is configured to train the named entity recognition model by using word embedding vectors corresponding to the single words in the split first sequence and word embedding vectors corresponding to high-frequency words at least including two single words as inputs of the named entity recognition model.
For any training text, the first segmentation module 401 may perform segmentation processing on the training text, and may use each segmentation result to form a first sequence, and then the first splitting module 402 may split the segmentation result, which includes at least two single words and is a non-high frequency word, in the first sequence into the single words.
Specifically, the first splitting module 402 may determine, for any word segmentation result that includes at least two single words, that the word segmentation result is a non-high-frequency word if it is determined that the word segmentation result is not located in a pre-generated high-frequency word list, and split the word segmentation result into M single words, where M represents the number of the single words included in the word segmentation result.
The first splitting module 402 may further determine that, for any word segmentation result including at least two single words, if it is determined that the word segmentation result is located in the high-frequency word list, the word segmentation result is a high-frequency word, and the word segmentation result is maintained unchanged, that is, the word segmentation is not performed.
Correspondingly, the device shown in fig. 4 may further include: the preprocessing module 400 is configured to perform word segmentation processing on each training text, obtain occurrence frequencies of word segmentation results, use word segmentation results with occurrence frequencies larger than a predetermined threshold and including at least two single words as high-frequency words, and form a high-frequency word list by using the high-frequency words.
The pre-processing module 400 may also perform the following: if the situation that the entity word segmentation boundary error occurs in any word segmentation result is determined, filtering the word segmentation result, and taking the word segmentation result which has the occurrence frequency larger than a preset threshold value and at least comprises two single words in the remaining word segmentation results as a high-frequency word.
The model training module 403 may use the word embedding vector corresponding to each single word and the word embedding vector corresponding to each high-frequency word in the split first sequence as input of the named entity recognition model to train the named entity recognition model.
Fig. 5 is a schematic structural diagram of a named entity recognition apparatus 50 according to an embodiment of the present disclosure. As shown in fig. 5, includes: a second segmentation module 501, a second splitting module 502 and a result obtaining module 503.
And the second word segmentation module 501 is configured to perform word segmentation processing on the text to be recognized, and form a second sequence by using the word segmentation results.
The second splitting module 502 is configured to split the word segmentation result, which includes at least two single words and is a non-high frequency word, in the second sequence into the single words.
The result obtaining module 503 is configured to input a word embedding vector corresponding to a single word in the split second sequence and a word embedding vector corresponding to a high-frequency word including at least two single words into the named entity recognition model obtained through pre-training, so as to obtain a named entity recognition result corresponding to the text to be recognized.
The second splitting module 502 may determine that the word segmentation result is a non-high frequency word and split the word segmentation result into M single words, where M represents the number of the single words included in the word segmentation result, if it is determined that the word segmentation result is not located in a pre-generated high frequency word list.
The second splitting module 502 may further determine that the word segmentation result is a high-frequency word and maintain the word segmentation result unchanged if the word segmentation result is determined to be located in the high-frequency word list for any word segmentation result including at least two single words.
For a specific work flow of the device embodiments shown in fig. 4 and fig. 5, reference is made to the related description in the foregoing method embodiments, and details are not repeated.
In a word, by adopting the scheme of the embodiment of the device disclosed by the invention, the model training effect and the model performance of the named entity recognition model can be improved, and the accuracy of the named entity recognition result obtained based on the model can be correspondingly improved.
The scheme disclosed by the invention can be applied to the field of artificial intelligence, in particular to the fields of artificial intelligence such as natural language processing, deep learning and computer vision.
Artificial intelligence is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware technology and a software technology, the artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the methods described in the present disclosure may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the methods described in the present disclosure.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (19)

1. A named entity recognition model acquisition method comprises the following steps:
performing word segmentation processing on the training text, and forming a first sequence by using each word segmentation result;
splitting a word segmentation result which at least comprises two single words and is a non-high-frequency word in the first sequence into the single words;
and taking the word embedding vector corresponding to the single word in the first sequence after the splitting treatment and the word embedding vector corresponding to the high-frequency word at least comprising two single words as the input of a named entity recognition model, and training the named entity recognition model.
2. The method of claim 1, wherein the splitting the segmentation result, which includes at least two words in the first sequence and is a non-high frequency word, into the words comprises:
aiming at any word segmentation result at least comprising two single words, if the word segmentation result is determined not to be located in a pre-generated high-frequency word list, determining that the word segmentation result is a non-high-frequency word, and splitting the word segmentation result into M single words, wherein M represents the number of the single words in the word segmentation result.
3. The method of claim 2, further comprising:
and if the word segmentation result is determined to be in the high-frequency word list, determining that the word segmentation result is a high-frequency word, and keeping the word segmentation result unchanged.
4. The method of claim 2, further comprising:
performing word segmentation processing on each training text;
respectively acquiring the occurrence frequency of each word segmentation result;
and taking the word segmentation result with the occurrence frequency larger than a preset threshold and at least comprising two single words as a high-frequency word, and forming the high-frequency word list by using each high-frequency word.
5. The method of claim 4, further comprising:
and if the situation that the word segmentation boundary of the entity word is wrong in any word segmentation result is determined, filtering the word segmentation result, and taking the word segmentation result which has the occurrence frequency larger than a preset threshold value and at least comprises two single words in the rest word segmentation results as a high-frequency word.
6. A named entity recognition method, comprising:
performing word segmentation processing on the text to be recognized, and forming a second sequence by using word segmentation results;
splitting the word segmentation result which at least comprises two single words and is a non-high-frequency word in the second sequence into the single words;
and inputting the word embedding vector corresponding to the single word in the second sequence after the splitting treatment and the word embedding vector corresponding to the high-frequency word at least comprising two single words into a named entity recognition model obtained by pre-training to obtain a named entity recognition result corresponding to the text to be recognized.
7. The method of claim 6, wherein the splitting the segmentation result, which includes at least two words in the second sequence and is a non-high frequency word, into the words comprises:
aiming at any word segmentation result at least comprising two single words, if the word segmentation result is determined not to be located in a pre-generated high-frequency word list, determining that the word segmentation result is a non-high-frequency word, and splitting the word segmentation result into M single words, wherein M represents the number of the single words in the word segmentation result.
8. The method of claim 7, further comprising:
and if the word segmentation result is determined to be in the high-frequency word list, determining that the word segmentation result is a high-frequency word, and keeping the word segmentation result unchanged.
9. A named entity recognition model acquisition apparatus comprising: the system comprises a first segmentation module, a first splitting module and a model training module;
the first word segmentation module is used for carrying out word segmentation processing on the training text and forming a first sequence by using each word segmentation result;
the first splitting module is used for splitting a word segmentation result which at least comprises two single words and is a non-high-frequency word in the first sequence into the single words;
and the model training module is used for taking word embedding vectors corresponding to the single words in the first sequence after the splitting processing and word embedding vectors corresponding to high-frequency words at least comprising two single words as the input of a named entity recognition model, and training the named entity recognition model.
10. The apparatus of claim 9, wherein,
the first splitting module is used for determining that the word segmentation result is a non-high-frequency word and splitting the word segmentation result into M single words if the word segmentation result is determined not to be located in a pre-generated high-frequency word list aiming at any word segmentation result at least comprising two single words, wherein M represents the number of the single words in the word segmentation result.
11. The apparatus of claim 10, wherein,
the first splitting module is further configured to determine that the word segmentation result is a high-frequency word if it is determined that the word segmentation result is located in the high-frequency word list, and maintain the word segmentation result unchanged.
12. The apparatus of claim 10, further comprising: a preprocessing module;
the preprocessing module is used for performing word segmentation processing on each training text, respectively acquiring the occurrence frequency of each word segmentation result, taking the word segmentation result with the occurrence frequency larger than a preset threshold and at least comprising two single words as a high-frequency word, and forming the high-frequency word list by using each high-frequency word.
13. The apparatus of claim 12, wherein,
the preprocessing module is further used for filtering the word segmentation results if the situation that the word segmentation boundary of the entity word is wrong in any word segmentation result is determined, and taking the word segmentation results which have the frequency of occurrence larger than a preset threshold value and at least comprise two single words in the rest word segmentation results as high-frequency words.
14. A named entity recognition apparatus comprising: the system comprises a second word segmentation module, a second splitting module and a result acquisition module;
the second word segmentation module is used for performing word segmentation processing on the text to be recognized and forming a second sequence by using word segmentation results;
the second splitting module is used for splitting the word segmentation result which at least comprises two single words and is a non-high-frequency word in the second sequence into the single words;
and the result acquisition module is used for inputting word embedding vectors corresponding to the single words in the second sequence after the splitting processing and corresponding to high-frequency words at least comprising two single words into a named entity recognition model obtained by pre-training so as to obtain a named entity recognition result corresponding to the text to be recognized.
15. The apparatus of claim 14, wherein,
the second splitting module is used for determining that the word segmentation result is a non-high-frequency word and splitting the word segmentation result into M single words if the word segmentation result is determined not to be located in a pre-generated high-frequency word list aiming at any word segmentation result at least comprising two single words, wherein M represents the number of the single words in the word segmentation result.
16. The apparatus of claim 15, wherein,
the second splitting module is further configured to determine that the word segmentation result is a high-frequency word if it is determined that the word segmentation result is located in the high-frequency word list, and maintain the word segmentation result unchanged.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-8.
19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.
CN202011510819.4A 2020-12-18 2020-12-18 Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium Pending CN112632999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011510819.4A CN112632999A (en) 2020-12-18 2020-12-18 Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011510819.4A CN112632999A (en) 2020-12-18 2020-12-18 Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium

Publications (1)

Publication Number Publication Date
CN112632999A true CN112632999A (en) 2021-04-09

Family

ID=75317691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011510819.4A Pending CN112632999A (en) 2020-12-18 2020-12-18 Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium

Country Status (1)

Country Link
CN (1) CN112632999A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115938365A (en) * 2023-03-09 2023-04-07 广州小鹏汽车科技有限公司 Voice interaction method, vehicle and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101118538A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 Method and system for recognizing feature lexical item in Chinese naming entity
CN108417210A (en) * 2018-01-10 2018-08-17 苏州思必驰信息科技有限公司 A kind of word insertion language model training method, words recognition method and system
CN109145303A (en) * 2018-09-06 2019-01-04 腾讯科技(深圳)有限公司 Name entity recognition method, device, medium and equipment
CN109741732A (en) * 2018-08-30 2019-05-10 京东方科技集团股份有限公司 Name entity recognition method, name entity recognition device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101118538A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 Method and system for recognizing feature lexical item in Chinese naming entity
CN108417210A (en) * 2018-01-10 2018-08-17 苏州思必驰信息科技有限公司 A kind of word insertion language model training method, words recognition method and system
CN109741732A (en) * 2018-08-30 2019-05-10 京东方科技集团股份有限公司 Name entity recognition method, name entity recognition device, equipment and medium
CN109145303A (en) * 2018-09-06 2019-01-04 腾讯科技(深圳)有限公司 Name entity recognition method, device, medium and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115938365A (en) * 2023-03-09 2023-04-07 广州小鹏汽车科技有限公司 Voice interaction method, vehicle and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN112507706B (en) Training method and device for knowledge pre-training model and electronic equipment
CN112466288A (en) Voice recognition method and device, electronic equipment and storage medium
CN113344089B (en) Model training method and device and electronic equipment
CN112507118A (en) Information classification and extraction method and device and electronic equipment
US20220374678A1 (en) Method for determining pre-training model, electronic device and storage medium
CN113657483A (en) Model training method, target detection method, device, equipment and storage medium
CN113836268A (en) Document understanding method and device, electronic equipment and medium
CN114881129A (en) Model training method and device, electronic equipment and storage medium
CN114861059A (en) Resource recommendation method and device, electronic equipment and storage medium
CN112560480B (en) Task community discovery method, device, equipment and storage medium
CN113641804A (en) Pre-training model obtaining method and device, electronic equipment and storage medium
CN112632999A (en) Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium
CN114758649B (en) Voice recognition method, device, equipment and medium
CN113641724B (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN113033179B (en) Knowledge acquisition method, knowledge acquisition device, electronic equipment and readable storage medium
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN114067805A (en) Method and device for training voiceprint recognition model and voiceprint recognition
CN113408632A (en) Method and device for improving image classification accuracy, electronic equipment and storage medium
CN115840867A (en) Generation method and device of mathematical problem solving model, electronic equipment and storage medium
CN113408269A (en) Text emotion analysis method and device
CN113204665A (en) Image retrieval method, image retrieval device, electronic equipment and computer-readable storage medium
CN112905743A (en) Text object detection method and device, electronic equipment and storage medium
CN113553407B (en) Event tracing method and device, electronic equipment and storage medium
CN113076080B (en) Model training method and device and intention recognition method and device
CN113593528B (en) Training method and device of voice segmentation model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination