CN109325226B - Deep learning network-based term extraction method and device and storage medium - Google Patents

Deep learning network-based term extraction method and device and storage medium Download PDF

Info

Publication number
CN109325226B
CN109325226B CN201811052429.XA CN201811052429A CN109325226B CN 109325226 B CN109325226 B CN 109325226B CN 201811052429 A CN201811052429 A CN 201811052429A CN 109325226 B CN109325226 B CN 109325226B
Authority
CN
China
Prior art keywords
term
deep learning
learning network
target text
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811052429.XA
Other languages
Chinese (zh)
Other versions
CN109325226A (en
Inventor
杨旭
杜翠凤
周善明
张添翔
叶绍恩
梁晓文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jiesai Communication Planning And Design Institute Co ltd
GCI Science and Technology Co Ltd
Original Assignee
Guangzhou Jiesai Communication Planning And Design Institute Co ltd
GCI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jiesai Communication Planning And Design Institute Co ltd, GCI Science and Technology Co Ltd filed Critical Guangzhou Jiesai Communication Planning And Design Institute Co ltd
Priority to CN201811052429.XA priority Critical patent/CN109325226B/en
Publication of CN109325226A publication Critical patent/CN109325226A/en
Application granted granted Critical
Publication of CN109325226B publication Critical patent/CN109325226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a term extraction method, a term extraction device and a storage medium based on a deep learning network, wherein the method comprises the following steps: carrying out term annotation on the target text; performing word segmentation processing on the labeled target text to obtain a word segmentation text and extracting keywords; training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model, and obtaining a term prediction result output by the term prediction model; and training the pre-established CNN deep learning network according to the term prediction result and term label corresponding to the target text to obtain a term extraction model, and obtaining a term extraction result output by the term extraction model. The method integrates the RNN and the CNN deep learning networks to form a deeper deep learning network, and carries out term prediction and extraction on the target text according to the extracted keywords and the term labeling result of the target text, so that the term extraction rate can be effectively improved, and the extraction of Chinese terms of massive texts is realized.

Description

Deep learning network-based term extraction method and device and storage medium
Technical Field
The present invention relates to the field of term extraction technologies, and in particular, to a term extraction method and apparatus based on a deep learning network, and a storage medium.
Background
The term denotes a professional or a research direction of a field, and the term extraction has research significance in the field of natural language processing, and particularly has wide application prospects in machine translation and cross-language information retrieval.
The traditional term extraction is three: the method is carried out manually or non-manually according to information of a corpus mostly, the term extraction rate is low, and for the era of information explosion today, extraction of Chinese terms of massive texts is difficult to finish by a manual or semi-manual mode.
Disclosure of Invention
Based on the method, the device and the storage medium, the term extraction method, the device and the storage medium based on the deep learning network are provided, the term extraction rate can be improved, and therefore the extraction of Chinese terms of massive texts is achieved.
In order to achieve the above object, an aspect of the embodiments of the present invention provides a term extraction method based on a deep learning network, including:
carrying out term annotation on the target text;
performing word segmentation processing on the labeled target text to obtain a word segmentation text, and extracting keywords from the word segmentation text;
training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model, and obtaining a term prediction result corresponding to the target text output by the term prediction model;
and training a pre-established CNN deep learning network according to the term prediction result and term label corresponding to the target text to obtain a term extraction model, and acquiring a term extraction result corresponding to the target text output by the term extraction model.
Preferably, the training of the pre-established RNN deep learning network according to the keyword specifically includes:
the hidden layer of the RNN deep learning network adopts an RNN network, and the output layer of the RNN deep learning network adopts a Softmax multilayer network;
and training the RNN of a hidden layer in the RNN deep learning network by using the keywords, and inputting an output result of the RNN into a Softmax multilayer network of an output layer of the RNN for training.
Preferably, before training the pre-established RNN deep learning network according to the keyword, the method further comprises:
performing word vector conversion on the extracted keywords to obtain a word sequence;
and training the pre-established RNN deep learning network by using the word sequence.
Preferably, the term labeling of the target text specifically includes:
carrying out term annotation on a target text by adopting an HANLP open source tool; and the term labeling result of each word in the target text comprises a word, a part of speech and a term boundary.
Preferably, the performing word segmentation processing on the labeled target text to obtain a word segmented text, and extracting keywords from the word segmented text specifically includes:
performing word segmentation processing on the labeled target text by adopting an HANLP open source tool to obtain a word segmentation text;
and extracting a term word and a plurality of words positioned in front of and behind the term word from the word segmentation text according to a term labeling result of each word in the target text to obtain a keyword corresponding to the target text.
Preferably, the training of the pre-established CNN deep learning network according to the term prediction result and the term label corresponding to the target text specifically includes:
the hidden layer of the CNN deep learning network adopts a CNN network, and the output layer of the CNN deep learning network adopts a Softmax multilayer network;
and training a CNN network of a hidden layer in the CNN deep learning network by using a term prediction result and a term label corresponding to the target text, and inputting an output result of the CNN network into a Softmax multilayer network of an output layer of the CNN network for training.
In another aspect, an embodiment of the present invention further provides a term extraction device based on a deep learning network, including:
the term labeling module is used for carrying out term labeling on the target text;
the keyword extraction module is used for performing word segmentation processing on the labeled target text to obtain a word segmentation text and extracting keywords from the word segmentation text;
the first training module is used for training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model and obtain a term prediction result corresponding to the target text output by the term prediction model;
and the second training module is used for training the pre-established CNN deep learning network according to the term prediction result and the term label corresponding to the target text to obtain a term extraction model and acquiring the term extraction result corresponding to the target text output by the term extraction model.
Preferably, the keyword extraction module includes:
the segmentation processing unit is used for performing segmentation processing on the labeled target text by adopting an HANLP open source tool to obtain a segmentation text;
and the keyword acquisition unit is used for extracting term words and a plurality of words positioned in front of and behind the term words from the word segmentation text according to the term labeling result of each word in the target text to obtain keywords corresponding to the target text.
In another aspect, the present invention further provides a deep learning network-based term extraction apparatus, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the deep learning network-based term extraction method is implemented.
In another aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, a device in which the computer-readable storage medium is located is controlled to perform the term extraction method based on a deep learning network as described above.
Compared with the prior art, the embodiment of the invention has the beneficial effects that: the term extraction method based on the deep learning network comprises the following steps: carrying out term annotation on the target text; performing word segmentation processing on the labeled target text to obtain a word segmentation text, and extracting keywords from the word segmentation text; training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model, and obtaining a term prediction result corresponding to the target text output by the term prediction model; and training a pre-established CNN deep learning network according to the term prediction result and term label corresponding to the target text to obtain a term extraction model, and acquiring a term extraction result corresponding to the target text output by the term extraction model. The method integrates the RNN and the CNN deep learning networks to form a deeper deep learning network, and carries out term prediction and extraction on the target text according to the extracted keywords and the term labeling result of the target text, so that the term extraction rate can be effectively improved, and the extraction of Chinese terms of massive texts is realized.
Drawings
Fig. 1 is a schematic flowchart of a term extraction method based on a deep learning network according to an embodiment of the present invention;
FIG. 2 is a block flow diagram of a deep learning network based term extraction method of FIG. 1;
fig. 3 is a schematic block diagram of a term extraction device based on a deep learning network according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Please refer to fig. 1, which is a flowchart illustrating a term extraction method based on a deep learning network according to an embodiment of the present invention. The method comprises the following steps:
s100: carrying out term annotation on the target text;
s200: performing word segmentation processing on the labeled target text to obtain a word segmentation text, and extracting keywords from the word segmentation text;
s300: training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model, and acquiring a term prediction result corresponding to the target text output by the term prediction model;
s400: and training a pre-established CNN deep learning network according to the term prediction result and term label corresponding to the target text to obtain a term extraction model, and acquiring a term extraction result corresponding to the target text output by the term extraction model.
The invention combines RNN and CNN deep learning networks to form a deeper deep learning network, namely, the RNN deep learning network has strong prediction capability on the sequence of word sequences to realize the prediction of the next word of each word, and the CNN deep learning network has strong characteristic extraction function to realize the automatic extraction of terms, thereby effectively improving the term extraction rate, realizing the extraction of Chinese terms of massive texts, and greatly improving the term identification and extraction accuracy.
In an optional embodiment, the training of the pre-established RNN deep learning network according to the keyword specifically includes:
the hidden layer of the RNN deep learning network adopts an RNN network, and the output layer of the RNN deep learning network adopts a Softmax multilayer network;
and training the RNN of a hidden layer in the RNN deep learning network by using the keywords, and inputting an output result of the RNN into a Softmax multilayer network of an output layer of the RNN for training.
In this embodiment, all the extracted keywords are input to the RNN neural learning network to be repeatedly trained, the order of the word sequence is automatically learned, the training is stopped when the loss function satisfies a certain condition, a train model (term prediction model) is generated, and after the term prediction model is completed, the term prediction model automatically realizes prediction.
Further, when the output result of the loss function of the RNN deep learning network is obtained, the word prediction accuracy of the generated term prediction model is judged, if the accuracy reaches a preset threshold value (for example, 80%) of the system, the model is considered to be ideal, and the training is stopped at the moment; otherwise, the model is considered to be not ideal, and the parameters (e.g., including learning rate (learning rate), epoch number, batch size, and Dropout) need to be readjusted, and the training is continued repeatedly.
In an optional embodiment, before training the pre-established RNN deep learning network according to the keyword, the method further includes:
performing Word vector transformation (Word 2 Vec) on the extracted keywords to obtain a Word sequence;
in this embodiment, the vector of a keyword has dimensions of 1 × 128.
And training a pre-established RNN deep learning network by using the word sequence.
In an optional embodiment, the term labeling for the target text specifically includes:
carrying out term annotation on a target text by adopting an HANLP open source tool; and the term labeling result of each word in the target text comprises a word, a part of speech and a term boundary.
In an optional embodiment, the performing word segmentation processing on the labeled target text to obtain a word segmentation text, and extracting keywords from the word segmentation text specifically includes:
performing word segmentation processing on the labeled target text by adopting an HANLP open source tool to obtain a word segmentation text;
and extracting a term word and a plurality of words positioned in front of and behind the term word from the word segmentation text according to a term labeling result of each word in the target text to obtain a keyword corresponding to the target text.
In this embodiment, the tfidf algorithm of the tensolflow tool is used to extract terms from the segmented text. Further, all the extracted terms are ranked according to the weights of the terms, N terms in front of the weights are extracted, and 3 words before and after the N term words are extracted to serve as the keywords of the target text. Wherein, the weight of the term can be calculated by the ratio of the frequency of the term appearing in the target text to the sum of the frequencies of all terms appearing in the target text; the N terms that are extracted in front of the weight are the terms whose weight corresponds to 20% before TOP.
According to the invention, the target text is subjected to primary keyword extraction, words with frequent application are found out to form a new corpus, and the next word is predicted by adopting an RNN deep learning network aiming at the words with frequent application, so that the speed of a computer is greatly improved, and the words which can embody the subject characteristics of the target text and are applied frequently can be found out. The invention only segments the front 3 words and the back 3 words of the term, which can greatly reduce the capacity of the corpus and reduce the term extraction time.
In an optional embodiment, the training of the pre-established CNN deep learning network according to the term prediction result and the term label corresponding to the label text specifically includes:
the hidden layer of the CNN deep learning network adopts a CNN network, and the output layer of the CNN deep learning network adopts a Softmax multilayer network;
and training a CNN network of a hidden layer in the CNN deep learning network by using a term prediction result and a term label corresponding to the target text, and inputting an output result of the CNN network into a Softmax multilayer network of an output layer of the CNN network for training.
In this embodiment, the term labeling result obtained in step S100 and the term prediction result output by the term prediction model in step S300 are simultaneously input into the CNN deep learning network for training, and the training is stopped when the loss function thereof satisfies a certain condition, so as to automatically learn the feature of the term, generate the term extraction model, and after the term extraction model is completed, the term extraction will automatically implement the term extraction.
Further, when the output result of the loss function of the CNN deep learning network is obtained, the term extraction model word extraction accuracy is judged, if the accuracy reaches a preset threshold value (for example, 80%) of the system, the model is considered to be ideal, and then the training is stopped; otherwise, the model is deemed not ideal, and the parameters (including learning rate, epoch number, batch size, and Dropout, for example) are re-adjusted to continue the training.
For convenience of understanding, the principle and process of the deep learning network-based term extraction method according to the embodiment of the present invention are described with reference to fig. 2:
in step S100 and step S200, the marks denoted by predetermined terms, for example: nx represents a noun, v represents a verb; b denotes the term boundary, O and I denote other boundaries. In this embodiment, it is indicated that the term to be extracted is a word corresponding to the term boundary with B as the term. Taking 'in the field of artificial intelligence' as a target text, and obtaining a term labeling result as 'in/v/O/artificial intelligence nx/B/field nx/I' through term labeling; wherein, the first item in each word is the word itself, the second item is the part of speech, the third item is the term boundary mark, the marked items are separated by using a "/" number, and the words are separated by using a blank space. And obtaining the terms to be extracted according to the term labeling result of the target text, wherein the terms are artificial intelligence.
In the keyword extraction in step S200, an open-source tensoflow tool is used, the terms are extracted using tfidf, the extracted terms are ranked according to weight, terms located 20% ahead of TOP are extracted, and the first 3 words and the last 3 words of the terms are extracted as keywords and used for RNN training. If the artificial intelligence is raised from the industry level to the national policy level, the promotion of governments enables artificial intelligence technology and application to be expected to realize breakthrough in 2017. In the industrial field, express trains carrying artificial intelligence are carried, the development direction of industrial robots and artificial intelligence becomes a consensus of robot enterprises at home and abroad, and a plurality of macros begin to be in a strategic direction of tightening layout. The terms required for extraction by the above steps are: artificial/intelligent/industrial/robotic; the first 3 and last 3 words of the term are then extracted, i.e.: artificial/intelligent/slave/industry/layer, push/let/artificial/intelligent/technology/and/application,/carry/landing/artificial/intelligent/express/drive,/and/in/industry/field,/carry \8230; thereby obtaining a corpus of keywords for training the RNN deep learning network.
In step S300, there are various relationships between the input and output of the RNN neural learning network, in the present invention, the hidden layer of the RNN neural learning network employs the RNN network, and the output layer employs the Softmax multilayer network. When a keyword is input, the RNN neural learning network predicts the next word of the keyword through training until the required set output length is reached. Inputting all the extracted keywords into an RNN neural learning network for repeated training, stopping training when a loss function of the keywords meets certain conditions, generating a train model (term prediction model), and after the term prediction model is completed, automatically realizing prediction by the term prediction model, for example: when "artificial" is entered, the term predictive model predicts that the word after "artificial" is intelligent. Corresponding to the term "artificial/intelligent/industrial/robot" in the above steps, what the word after each word is can be found out through the term prediction model, and the specific result is: artificial intelligence, industrial robot, intelligent realization and robot enterprise. Therefore, common phrase stacks in the text corpus are extracted through the RNN neural learning network.
In step S300, the term standard result, for example, obtained by step S100: artificial intelligence (terminology), industrial robots (terminology), intelligent implementation, robotic enterprise; the CNN network is then trained and classified through its last softmax layer, marking artificial intelligence and industrial robots as terms, while the other two are marked as corresponding types. And finally, enabling the CNN deep learning network to automatically learn the characteristics of the terms through the softmax classifier and realizing the term extraction by combining the powerful feature extraction capability of the CNN.
Compared with the prior art, the term extraction method based on the deep learning network provided by the embodiment of the invention has the following advantages:
(1) The RNN deep learning network has strong prediction capability on the sequence of word sequences, realizes the prediction of the next word of each word, and the CNN deep learning network has strong feature extraction function;
(2) The method comprises the steps of performing preliminary keyword extraction on a target text, finding out frequently-applied words to form a new corpus, and specifically predicting the next word by using an RNN (radio network) deep learning network aiming at the frequently-applied words, so that the speed of a computer is greatly improved, and the words which can embody the subject characteristics of the target text and are frequently applied can be found out;
(3) The invention segments the first 3 words and the last 3 words of the term, greatly reduces the capacity of the corpus and reduces the term extraction time.
Please refer to fig. 3, which is a schematic block diagram of a deep learning network-based term extraction apparatus according to an embodiment of the present invention, the apparatus includes:
the term labeling module 1 is used for carrying out term labeling on the target text;
the keyword extraction module 2 is used for performing word segmentation processing on the labeled target text to obtain a word segmentation text and extracting keywords from the word segmentation text;
the first training module 3 is configured to train a pre-established RNN deep learning network according to the keyword to obtain a term prediction model, and obtain a term prediction result corresponding to the target text output by the term prediction model;
and the second training module 4 is configured to train a pre-established CNN deep learning network according to the term prediction result and the term label corresponding to the target text, obtain a term extraction model, and obtain a term extraction result corresponding to the target text output by the term extraction model.
The invention combines the RNN and the CNN deep learning networks to form a deeper deep learning network, namely, the RNN deep learning network has strong prediction capability on the sequence of word sequences to realize the prediction of the next word of each word, and the CNN deep learning network has strong characteristic extraction function to realize the automatic extraction of terms, thereby effectively improving the term extraction rate, realizing the extraction of Chinese terms of massive texts and greatly improving the term identification and extraction accuracy.
In an optional embodiment, the hidden layer of the RNN deep learning network is an RNN network, and the output layer thereof is a Softmax multilayer network;
the first training module 3 is configured to train the RNN network of the hidden layer in the RNN deep learning network by using the keyword, and input an output result of the RNN network to the Softmax multilayer network of the output layer of the RNN network for training.
In an alternative embodiment, the apparatus further comprises:
the word vector conversion module is used for carrying out word vector conversion on the extracted keywords to obtain a word sequence;
the first training module 3 is configured to train a pre-established RNN deep learning network by using the word sequence.
In an optional embodiment, the term tagging module 1 is configured to perform term tagging on a target text by using an shanlp open source tool; and the term labeling result of each word in the target text comprises a word, a part of speech and a term boundary.
In an alternative embodiment, the keyword extraction module 2 includes:
the segmentation processing unit is used for performing segmentation processing on the labeled target text by adopting an HANLP open source tool to obtain a segmentation text;
and the keyword acquisition unit is used for extracting the term words and a plurality of words positioned in front of and behind the term words from the word segmentation text according to the term labeling result of each word in the target text to obtain the keywords corresponding to the target text.
In an optional embodiment, the hidden layer of the CNN deep learning network adopts a CNN network, and the output layer thereof adopts a Softmax multilayer network;
the second training module 4 is configured to train a CNN network of a hidden layer in the CNN deep learning network by using the term prediction result and the term label corresponding to the target text, and input an output result of the CNN network to a Softmax multilayer network of an output layer of the CNN network for training.
The term extraction device based on the deep learning network according to this embodiment is a product of the term extraction method based on the deep learning network according to the foregoing embodiment, and the principle and the technical effect of the implementation are the same as those of the term extraction method based on the deep learning network according to the foregoing embodiment, and will not be described repeatedly here.
In another aspect, the present invention further provides a deep learning network-based term extraction apparatus, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the deep learning network-based term extraction method is implemented.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the deep learning network based term extraction device. For example, the computer program may be divided into a term labeling module 1, a keyword extraction module 2, a first training module 3, and a second training module 4, and each module has the following specific functions: the term labeling module 1 is used for carrying out term labeling on the target text; the keyword extraction module 2 is used for performing word segmentation processing on the labeled target text to obtain a word segmentation text and extracting keywords from the word segmentation text; the first training module 3 is configured to train a pre-established RNN deep learning network according to the keyword to obtain a term prediction model, and obtain a term prediction result corresponding to the target text output by the term prediction model; and the second training module 4 is configured to train the pre-established CNN deep learning network according to the term prediction result and the term label corresponding to the target text to obtain a term extraction model, and obtain a term extraction result corresponding to the target text output by the term extraction model.
The term extraction device based on the deep learning network can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The term extraction device based on the deep learning network can comprise a processor and a memory, but is not limited to the processor and the memory. It will be understood by those skilled in the art that the schematic diagram is merely an example of the term extraction apparatus based on the deep learning network, and does not constitute a limitation of the term extraction apparatus based on the deep learning network, and may include more or less components than those shown in the figure, or combine some components, or different components, for example, the term extraction apparatus based on the deep learning network may further include an input and output device, a network access device, a bus, and the like.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor is a control center of the deep learning network based term extraction device, and various interfaces and lines are used to connect various parts of the whole deep learning network based term extraction device.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the deep learning network-based term extraction apparatus by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the module/unit integrated with the term extraction device based on the deep learning network can be stored in a computer readable storage medium if the module/unit is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
In another aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, a device in which the computer-readable storage medium is located is controlled to perform the term extraction method based on a deep learning network as described above.
Compared with the prior art, the embodiment of the invention has the beneficial effects that: the term extraction method based on the deep learning network comprises the following steps: carrying out term annotation on the target text; performing word segmentation processing on the labeled target text to obtain a word segmentation text, and extracting keywords from the word segmentation text; training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model, and obtaining a term prediction result corresponding to the target text output by the term prediction model; and training a pre-established CNN deep learning network according to the term prediction result and term label corresponding to the target text to obtain a term extraction model, and acquiring a term extraction result corresponding to the target text output by the term extraction model. The method integrates the RNN and the CNN deep learning networks to form a deeper deep learning network, and carries out term prediction and extraction on the target text according to the extracted keywords and the term labeling result of the target text, so that the term extraction rate can be effectively improved, and the extraction of Chinese terms of massive texts is realized.
It should be noted that the above-described embodiments of the apparatus are merely illustrative, where the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (8)

1. A term extraction method based on a deep learning network is characterized by comprising the following steps:
carrying out term annotation on the target text;
performing word segmentation processing on the labeled target text to obtain a word segmentation text, and extracting keywords from the word segmentation text;
training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model, and obtaining a term prediction result corresponding to the target text output by the term prediction model;
training a pre-established CNN deep learning network according to term prediction results and term labels corresponding to the target text to obtain a term extraction model, and obtaining term extraction results corresponding to the target text output by the term extraction model;
the word segmentation processing is carried out on the labeled target text to obtain a word segmentation text, and the extracting of the keywords from the word segmentation text specifically comprises the following steps:
performing word segmentation processing on the labeled target text by adopting an HANLP open source tool to obtain a word segmentation text;
extracting term words from the word segmentation text according to term labeling results of all words in the target text;
and sequencing all the extracted terms according to the weights of the terms, extracting N terms positioned in the front of the weights, and extracting a plurality of words positioned before and after the N term words to be used as the keywords of the target text.
2. The deep learning network-based term extraction method according to claim 1, wherein the training of the pre-established RNN deep learning network according to the keyword specifically comprises:
the hidden layer of the RNN deep learning network adopts an RNN network, and the output layer of the RNN deep learning network adopts a Softmax multilayer network;
and training the RNN of a hidden layer in the RNN deep learning network by using the keywords, and inputting an output result of the RNN into a Softmax multilayer network of an output layer of the RNN for training.
3. The deep learning network-based term extraction method according to claim 1 or 2, wherein before training the pre-established RNN deep learning network according to the keyword, the method further comprises:
performing word vector conversion on the extracted keywords to obtain a word sequence;
and training the pre-established RNN deep learning network by using the word sequence.
4. The deep learning network-based term extraction method as claimed in claim 1, wherein the term labeling of the target text specifically includes:
carrying out term annotation on a target text by adopting an HANLP open source tool; and the term labeling result of each word in the target text comprises a word, a part of speech and a term boundary.
5. The deep learning network-based term extraction method according to claim 1, wherein the training of the pre-established CNN deep learning network according to the term prediction result and the term label corresponding to the target text specifically comprises:
the hidden layer of the CNN deep learning network adopts a CNN network, and the output layer of the CNN deep learning network adopts a Softmax multilayer network;
and training a CNN network of a hidden layer in the CNN deep learning network by using a term prediction result and a term label corresponding to the target text, and inputting an output result of the CNN network into a Softmax multilayer network of an output layer of the CNN network for training.
6. A term extraction device based on a deep learning network is characterized by comprising:
the term labeling module is used for carrying out term labeling on the target text;
the keyword extraction module is used for performing word segmentation processing on the labeled target text to obtain a word segmentation text and extracting keywords from the word segmentation text;
the first training module is used for training a pre-established RNN deep learning network according to the keywords to obtain a term prediction model and obtain a term prediction result corresponding to the target text output by the term prediction model;
the second training module is used for training a pre-established CNN deep learning network according to term prediction results and term labels corresponding to the target text to obtain a term extraction model and obtain term extraction results corresponding to the target text output by the term extraction model;
the keyword extraction module comprises:
the segmentation processing unit is used for performing segmentation processing on the labeled target text by adopting an HANLP open source tool to obtain a segmentation text;
a keyword obtaining unit, configured to extract a term word from the word segmentation text according to a term labeling result of each word in the target text; and sequencing all the extracted terms according to the weights of the terms, extracting N terms positioned in front of the weights, and extracting a plurality of words positioned before and after the N term words to be used as the keywords of the target text.
7. A deep learning network-based term extraction apparatus comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the deep learning network-based term extraction method according to any one of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program runs, the computer-readable storage medium is controlled by a device to execute the deep learning network-based term extraction method according to any one of claims 1 to 5.
CN201811052429.XA 2018-09-10 2018-09-10 Deep learning network-based term extraction method and device and storage medium Active CN109325226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811052429.XA CN109325226B (en) 2018-09-10 2018-09-10 Deep learning network-based term extraction method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811052429.XA CN109325226B (en) 2018-09-10 2018-09-10 Deep learning network-based term extraction method and device and storage medium

Publications (2)

Publication Number Publication Date
CN109325226A CN109325226A (en) 2019-02-12
CN109325226B true CN109325226B (en) 2023-04-14

Family

ID=65264707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811052429.XA Active CN109325226B (en) 2018-09-10 2018-09-10 Deep learning network-based term extraction method and device and storage medium

Country Status (1)

Country Link
CN (1) CN109325226B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918654B (en) * 2019-02-21 2022-12-27 厦门一品威客网络科技股份有限公司 Logo paraphrasing method, device and medium
CN111611340A (en) * 2019-02-26 2020-09-01 广州慧睿思通信息科技有限公司 Information extraction method and device, computer equipment and storage medium
CN111861610A (en) * 2019-04-30 2020-10-30 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN110134767B (en) * 2019-05-10 2021-07-23 云知声(上海)智能科技有限公司 Screening method of vocabulary
CN112100320B (en) * 2020-07-23 2023-09-26 安徽米度智能科技有限公司 Term generating method, device and storage medium
CN112509640B (en) * 2020-10-22 2022-08-19 复旦大学 Gene ontology item name generation method and device and storage medium
CN113240485A (en) * 2021-05-10 2021-08-10 北京沃东天骏信息技术有限公司 Training method of text generation model, and text generation method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815194A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Model training method and device and keyword recognition method and device
CN105654135A (en) * 2015-12-30 2016-06-08 成都数联铭品科技有限公司 Image character sequence recognition system based on recurrent neural network
CN107818080A (en) * 2017-09-22 2018-03-20 新译信息科技(北京)有限公司 Term recognition methods and device

Also Published As

Publication number Publication date
CN109325226A (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN109325226B (en) Deep learning network-based term extraction method and device and storage medium
WO2020220539A1 (en) Data increment method and device, computer device and storage medium
CN111079406B (en) Natural language processing model training method, task execution method, equipment and system
US10262272B2 (en) Active machine learning
JP2021096812A (en) Method, apparatus, electronic device and storage medium for processing semantic representation model
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
CN111985229A (en) Sequence labeling method and device and computer equipment
CN112347241A (en) Abstract extraction method, device, equipment and storage medium
Robbes et al. Leveraging small software engineering data sets with pre-trained neural networks
CN114240672A (en) Method for identifying green asset proportion and related product
JP7178394B2 (en) Methods, apparatus, apparatus, and media for processing audio signals
CN116821307B (en) Content interaction method, device, electronic equipment and storage medium
EP4060526A1 (en) Text processing method and device
CN111368532B (en) Topic word embedding disambiguation method and system based on LDA
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN111563140B (en) Intention identification method and device
CN112287667A (en) Text generation method and equipment
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query
Briskilal et al. Classification of Idiomatic Sentences Using AWD-LSTM
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN107943972A (en) A kind of intelligent response method and its system
CN109189932B (en) Text classification method and device and computer-readable storage medium
CN108694165A (en) Cross-cutting antithesis sentiment analysis method towards product review
CN113177121A (en) Text topic classification method and device, electronic equipment and storage medium
CN112528657A (en) Text intention recognition method and device based on bidirectional LSTM, server and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant