WO2020151218A1 - 电力专业词库生成方法及装置、存储介质 - Google Patents

电力专业词库生成方法及装置、存储介质 Download PDF

Info

Publication number
WO2020151218A1
WO2020151218A1 PCT/CN2019/099862 CN2019099862W WO2020151218A1 WO 2020151218 A1 WO2020151218 A1 WO 2020151218A1 CN 2019099862 W CN2019099862 W CN 2019099862W WO 2020151218 A1 WO2020151218 A1 WO 2020151218A1
Authority
WO
WIPO (PCT)
Prior art keywords
electric power
candidate words
words
professional
word
Prior art date
Application number
PCT/CN2019/099862
Other languages
English (en)
French (fr)
Inventor
庄莉
王秋琳
宋立华
张垚
陈江海
Original Assignee
福建亿榕信息技术有限公司
国网信息通信产业集团有限公司
国网浙江省电力有限公司
国家电网有限公司
国网信通亿力科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福建亿榕信息技术有限公司, 国网信息通信产业集团有限公司, 国网浙江省电力有限公司, 国家电网有限公司, 国网信通亿力科技有限责任公司 filed Critical 福建亿榕信息技术有限公司
Publication of WO2020151218A1 publication Critical patent/WO2020151218A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Definitions

  • the present invention relates to the field of natural language processing, in particular to a method and device for generating a professional vocabulary in the power industry, and a computer storage medium.
  • Word segmentation technology is a relatively basic but very important technology in natural language processing. As the most basic semantic unit in the Chinese language, single characters have their own meanings, but their ideographic ability is poor and their meanings are more scattered, while words have stronger ideographic ability and can describe a thing more accurately. Therefore, in natural language processing, words (including single characters into words) are usually the most basic processing unit. For Latin languages such as English, because there are spaces between words as marginal representations, words can be extracted easily and accurately in general. In addition to punctuation marks, Chinese characters are closely connected and have no obvious word boundaries, so it is difficult to extract words. Chinese word segmentation methods are roughly divided into two methods: dictionary-based segmentation and sequence label segmentation based on statistical models.
  • dictionary-based segmentation is a more common and efficient word segmentation method, and the premise is to have a thesaurus.
  • Power grid companies have accumulated a large number of professional corpus in the power industry. In order to make full use of these corpora through text analysis and mining technology, there is an urgent need for a more accurate and complete power industry thesaurus.
  • the current language analysis and processing methods are mainly based on the following aspects:
  • CRF Consumer Random Field
  • X and Y respectively represent the joint distributed random variables of the observation sequence to be labeled and the corresponding labeled sequence
  • the conditional random field (X, Y) is an undirected graph model conditioned on the observation sequence X.
  • the goal of the conditional random field is to optimize the joint probability of the labeling sequence under the condition of the observation sequence that needs to be labeled.
  • the common practice of Scheme 1 is to mark the professional domain words in the artificially selected corpus, and then use the marked corpus to train the CRF model of machine learning, and finally input the professional corpus to let the model recognize the professional words.
  • Scheme 1 is still good for the recognition of professional words, but the premise is that domain business experts must first mark a large number of professional words on the corpus to provide annotation data for CRF model training.
  • the disadvantage of this scheme is that industry experts are required to participate in training data labeling, and the amount of data to be labelled is relatively large and the efficiency is low.
  • the lexicon generation method based on statistical principles does not rely on the existing lexicon. It usually extracts all the text fragments that may become words in a large-scale corpus based on the word frequency, mutual information, and left and right information entropy characteristics of the word, regardless of it Is it a professional word or a common word? Then, all the extracted words are filtered and screened by setting thresholds to obtain the word database.
  • the lexicon generation process of Scheme 2 is completely unsupervised, does not require industry experts to mark the corpus and is more efficient, which is its main advantage.
  • the shortcoming of the second scheme is that it has the shortcoming of low accuracy to filter professional words only by word frequency, mutual information, and left and right entropy.
  • the embodiments of the present application provide a method, device and computer storage medium for generating a professional electric power lexicon, which can at least solve the problem that the electric power professional lexicon in the prior art is messy and inaccurate, requires manual participation and cannot meet actual needs.
  • an embodiment of the present application provides a method for generating a professional vocabulary for electric power, which includes the following steps:
  • the step of segmenting related corpus to obtain candidate words includes:
  • the short sentence segmented by the Chinese symbol is fixed-length word cut, and multiple fixed-length word cut results are obtained;
  • the method further includes: performing left and right information entropy calculation on the candidate words, and deleting words whose left and right information entropy is less than a preset left and right information entropy threshold.
  • the method further includes: performing part-of-speech tagging on the candidate words, and deleting part-of-speech combinations that are not formed according to the part-of-speech of the candidate words.
  • the method further includes: calculating the TF-IDF value of the candidate words, sorting the candidate words according to the TF-IDF value, and presenting the sorted result to the user.
  • the embodiment of the present application also provides an electric power professional word database generating device, including:
  • the acquisition module is configured to acquire power-related corpus
  • the word segmentation module is configured to segment the power-related corpus to obtain candidate words
  • the calculation module is configured to calculate the mutual information value of the candidate words
  • the deleting module is configured to delete words whose mutual information value is less than a preset mutual information value threshold.
  • the word segmentation module includes:
  • the word segmentation sub-module is configured to perform fixed-length word segmentation on the short sentences segmented from Chinese symbols according to the preset length, and obtain the fixed-length word segmentation result;
  • the word segmentation submodule is configured to intercept the first n characters of the fixed-length word segmentation result to obtain candidate words, where n is the initial step size, and then increase n by a preset value, and repeat the interception step until n is equal to the preset Set the long length.
  • the device further includes a left and right information entropy calculation module configured to perform left and right information entropy calculations on the candidate words, and delete words whose left and right information entropy is less than a preset left and right information entropy threshold.
  • the device further includes a tagging deletion module, configured to tag the candidate words in groups of parts of speech, and delete combinations of parts of speech that are not formed according to the group of parts of speech.
  • the device further includes a sorting module configured to calculate the TF-IDF value of the candidate words, sort the candidate words according to the TF-IDF value, and present the sorted result to the user.
  • the embodiment of the present invention can at least perform word segmentation on the electric power professional corpus and perform related calculations on the word segmentation result so as to make the electric power professional lexicon more accurate and more practical, without the participation of workers.
  • the operation of deleting words whose mutual information value is less than the mutual information value threshold can improve the efficiency of screening candidate words.
  • FIG. 1 is a schematic diagram of the implementation process of Embodiment 1 of the method for generating a professional electric power word database provided by the present invention
  • Embodiment 2 is a schematic diagram of the implementation process of Embodiment 2 of the method for generating a professional electric power word database provided by the present invention
  • Fig. 3 is a schematic diagram of the composition structure of the electric power professional word database generating device provided by the present invention.
  • Fig. 4 is a schematic diagram of the hardware structure of the electric power professional word database generating device provided by the present invention.
  • FIG. 1 is a schematic diagram of an implementation process of an embodiment of a method for generating a professional word database for electric power, including the following steps:
  • the threshold value of the mutual information value is preset and obtained based on experience, and may be a specific value or a range value, and is not specifically limited.
  • the segmentation of the power-related corpus is a full segmentation, which can be implemented in the following ways:
  • the short sentence segmented by the Chinese symbol is segmented to obtain the result of fixed-length word segmentation; for example, by first segmenting the obtained document according to Chinese punctuation, and then segmenting according to Chinese punctuation N-gram segmentation is performed on the corpus (N-gram window size recommended 6-8).
  • N-gram window size recommended 6-8 The advantage of this processing is that it can make the fixed-length word segmentation results more comprehensive, and the window size selection can meet the needs of further segmentation and include more available results. For example, if you perform N-gram segmentation (fixed-length length) with a window of 6 for the sentence "pole-changing induction motor with wound rotor", the following segmentation results will be obtained, namely the fixed-length word segmentation result:
  • the first n characters of the fixed-length word segmentation result can be intercepted to obtain candidate words, where n is the initial step size and a positive integer, and then n is increased With a preset value, the interception step is repeated until n is equal to the preset length.
  • the step size of word segmentation is used to further refine the result of fixed-length word segmentation.
  • Up to 6 characters are used as candidate words, and the preset value for each increase of n can be 1, or other integers. For example, after cutting the word "winding rotor", the following segmentation results will be obtained:
  • the number of candidate words obtained through the fixed-length word segmentation and subsequent word segmentation steps is large and complete, and the operation example that is divided into two main steps at the same time can also save computing resources and will not get too long candidate word results.
  • the effect of effectively improving the effectiveness of candidate words for the electric power profession can be achieved.
  • the mutual information value reflects the tightness of the combination of the various characters or words that make up the current word.
  • the calculation formula is as follows:
  • p(x) and p(y) are the probability that the word or word component x and y appear separately in the candidate word
  • p(x,y) is the probability that x and y appear simultaneously.
  • the generating method further includes:
  • S104 Perform left and right information entropy calculation on the candidate words, set a left and right information entropy threshold, and delete words whose left and right information entropy is less than the threshold.
  • a word can be used in various contexts, so there are more left and right combinations of a word.
  • Information entropy can be used to express the richness of the left and right combinations of the word in the corpus. By filtering the left and right information entropy thresholds, it is possible to delete the situation that some words in the fixed phrase group are individually used as candidate words, and improve the industry applicability of the professional thesaurus in the electric power field.
  • the final word bank can be more scientific.
  • the left and right information entropy thresholds are preset and obtained based on experience, and may be specific numerical values or range values, and are not specifically limited.
  • the generating method may further include:
  • S106 Perform component part-of-speech tagging on candidate words, and delete unformed part-of-speech combinations according to the component parts of speech.
  • the part-of-speech tagging tool can use existing technology.
  • Commonly used word segmentation tools with part-of-speech tagging can include: jieba, nltk, HanLP, Ansj, etc., which are specifically used to tag the segmented part-of-speech of candidate words to obtain the part-of-speech combination of candidate words Features, delete according to the characteristics of the part of speech combination, for example, the combination with higher probability of formation: noun + noun, verb + noun, noun + verb, etc., and the combination with lower probability of formation: verb + verb, preposition + noun , Preposition + verb, adverb + verb, etc., by designing the deletion rule table, the words with verb + verb, preposition + noun, preposition + verb, adverb + verb and other component parts of speech can
  • the generating method further includes:
  • TF-IDF Term Frequency-Inverse Document Frequency, term frequency-inverse text frequency index, which is a weighting technique
  • TF-IDF is a statistical method used to evaluate the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases in a positive correlation with the number of times it appears in the document, but at the same time it decreases in a negative correlation with the frequency of its appearance in the corpus.
  • the calculation formula is as follows:
  • N represents the total number of corpus
  • N w represents the number of corpus containing word w.
  • the candidate words are sorted by calculating the above TF-IDF value, and the sorted result is presented to the user, and the most important word selection is ranked first, which can further optimize the user experience.
  • the generating method of the embodiment of the present invention includes:
  • S200 Obtain power-related corpus, perform word segmentation on the related corpus, perform fixed-length segmentation on the short sentences segmented by Chinese symbols, and obtain the fixed-length word segmentation result; for example, by first segmenting the obtained file according to Chinese punctuation, and then Perform N-gram segmentation on the corpus segmented according to Chinese punctuation (N-gram window size 6), and then subdivide each fixed-length word segmentation result, and the step length of the segmentation is 2 to 6, and the candidate words are obtained ,
  • S202 Perform mutual information value calculation on candidate words, set a mutual information value threshold, and delete words whose mutual information value is less than the mutual information value threshold.
  • S204 Perform left and right information entropy calculation on the candidate words, set a left and right information entropy threshold, and delete words whose left and right information entropy is less than the threshold.
  • S206 Perform component part-of-speech tagging on candidate words, and delete unformed part-of-speech combinations according to the component parts of speech.
  • S208 Calculate the TF-IDF value of the candidate words, and sort the candidate words according to the TF-IDF value.
  • the generation method of the embodiment of the present invention includes the following steps:
  • S200 Obtain power-related corpus, and perform word segmentation on the relevant corpus. It can be specifically: fixed-length segmentation of the short sentences segmented by Chinese symbols to obtain the fixed-length word segmentation result; for example, by first cutting the obtained file according to Chinese punctuation Then, perform N-gram segmentation on the corpus after segmentation according to Chinese punctuation (N-gram window size 4), and then perform re-segmentation of each fixed-length word segmentation result, and the step size of re-segmentation is 2 to 4. Get candidate words.
  • S202 Perform mutual information value calculation on candidate words, set a mutual information value threshold, and delete words whose mutual information value is less than the mutual information value threshold.
  • S204 Perform left and right information entropy calculation on the candidate words, set a left and right information entropy threshold, and delete words whose left and right information entropy is less than the threshold.
  • S206 Perform component part-of-speech tagging on candidate words, and delete unformed part-of-speech combinations according to the component parts of speech.
  • S208 Calculate the TF-IDF value of the candidate words, and sort the candidate words according to the TF-IDF value.
  • the method in the embodiment of the present invention includes the following steps:
  • S200 Obtain power-related corpus, perform word segmentation on the relevant corpus, perform fixed-length segmentation on the short sentences segmented by Chinese symbols, and obtain the fixed-length word segmentation result; for example, by first segmenting the obtained file according to Chinese punctuation, or , N-gram segmentation (N-gram window size 8) according to the Chinese punctuation segmentation corpus, and then re-segmentation of each fixed-length word segmentation result, the step size of the re-segmentation is 2 to 8, and the candidate words are obtained .
  • S202 Perform mutual information value calculation on candidate words, set a mutual information value threshold, and delete words whose mutual information value is less than the mutual information value threshold.
  • S204 Perform left and right information entropy calculation on the candidate words, set a left and right information entropy threshold, and delete words whose left and right information entropy is less than the threshold.
  • S206 Perform component part-of-speech tagging on candidate words, and delete unformed part-of-speech combinations according to the component parts of speech.
  • S208 Calculate the TF-IDF value of the candidate words, and sort the candidate words according to the TF-IDF value.
  • an embodiment of the present application also provides an apparatus for generating a professional electric power thesaurus.
  • the apparatus includes: an acquisition module 301, a word segmentation module 302, a calculation module 303, and a deletion module 304; among them,
  • the obtaining module 301 is configured to obtain power-related corpus
  • the word segmentation module 302 is configured to segment the power-related corpus to obtain candidate words
  • the calculation module 303 is configured to perform mutual information value calculation on candidate words
  • the deleting module 304 is configured to delete words whose mutual information value is less than a preset mutual information value threshold.
  • the word segmentation module further includes:
  • the word segmentation sub-module is configured to perform fixed-length word segmentation on the short sentences segmented from Chinese symbols according to the preset length, and obtain the fixed-length word segmentation result;
  • the word segmentation submodule is configured to intercept the first n characters of the fixed-length word segmentation result to obtain candidate words, where n is the initial step size, and then increase n by a preset value, and repeat the interception step until n is equal to the preset Set the long length; n is a positive integer.
  • the device further includes a left and right information entropy calculation module configured to perform left and right information entropy calculation on the candidate words, and delete words whose left and right information entropy is less than a preset left and right information entropy threshold.
  • a left and right information entropy calculation module configured to perform left and right information entropy calculation on the candidate words, and delete words whose left and right information entropy is less than a preset left and right information entropy threshold.
  • the device further includes a tagging deletion module configured to tag the candidate words in groups of parts of speech, and delete combinations of parts of speech that are not worded according to the group of parts of speech.
  • a tagging deletion module configured to tag the candidate words in groups of parts of speech, and delete combinations of parts of speech that are not worded according to the group of parts of speech.
  • the device further includes a sorting module configured to calculate the TF-IDF value of the candidate words, sort the candidate words according to the TF-IDF value, and present the sorted result to the user.
  • a sorting module configured to calculate the TF-IDF value of the candidate words, sort the candidate words according to the TF-IDF value, and present the sorted result to the user.
  • the electric power professional word library generating device provided in the above embodiment and the electric power professional word library generating method embodiment belong to the same concept.
  • the acquisition module 301, the word segmentation module 302, the calculation module 303, and the deletion module 304 can all be implemented by digital signal processing (DSP), central processing unit (CPU), programmable logic control (FPGA), controller (MCU), etc. to fulfill.
  • DSP digital signal processing
  • CPU central processing unit
  • FPGA programmable logic control
  • MCU controller
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, it is used to perform at least the steps of the method shown in FIG. 1 or FIG. 2.
  • the computer-readable storage medium may specifically be a memory.
  • the memory may be the memory 42 shown in FIG. 4.
  • FIG. 4 is a schematic diagram of the hardware structure of an electric power professional word database generating device according to an embodiment of the application.
  • the device includes: a communication component 43 for data transmission, at least one processor 41, and a memory 42 for storing a computer program that can run on the processor 41.
  • the various components in the terminal are coupled together through the bus system 44. It can be understood that the bus system 44 is used to implement connection and communication between these components.
  • the bus system 44 also includes a power bus, a control bus, and a status signal bus. However, for the sake of clear description, various buses are marked as the bus system 44 in FIG. 4.
  • the processor 41 executes at least the steps of the method shown in FIG. 1 or FIG. 2 when executing the computer program.
  • the memory 42 may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory can be a read only memory (ROM, Read Only Memory), a programmable read only memory (PROM, Programmable Read-Only Memory), an erasable programmable read only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM), Ferromagnetic Random Access Memory (FRAM), Flash Memory, Magnetic Surface Memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage.
  • the volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • SSRAM synchronous static random access memory
  • Synchronous Static Random Access Memory Synchronous Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM enhanced -Type synchronous dynamic random access memory
  • SLDRAM SyncLink Dynamic Random Access Memory
  • DRAM Direct Rambus Random Access Memory
  • the memory 42 described in the embodiment of the present application is intended to include, but is not limited to, these and any other suitable types of memory.
  • the method disclosed in the foregoing embodiment of the present application may be applied to the processor 41 or implemented by the processor 41.
  • the processor 41 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 41 or instructions in the form of software.
  • the aforementioned processor 41 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like.
  • the processor 41 may implement or execute various methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a storage medium, and the storage medium is located in the memory 42.
  • the processor 41 reads the information in the memory 42 and completes the steps of the foregoing method in combination with its hardware.
  • the detection device may be used by one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, programmable logic device (PLD, Programmable Logic Device), complex programmable logic device (CPLD, Complex Programmable Logic Device), FPGA, general-purpose processor, controller, MCU, microprocessor (Microprocessor), or other electronic components are used to implement the aforementioned method for generating the professional vocabulary of electric power.
  • ASIC Application Specific Integrated Circuit
  • DSP programmable logic device
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA general-purpose processor
  • controller MCU
  • microprocessor Microprocessor
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • At least the word segmentation of the electric power professional corpus can be performed, and related calculations, such as the calculation of mutual information value, left and right information entropy, etc., on the word segmentation result can be used to obtain a more accurate thesaurus of electric power professional , It is more practical and does not require workers to participate.
  • the operation mode of deleting the mutual information value less than the mutual information value threshold can delete word combinations with a high probability of not forming a word, and word combinations not forming a compound word, which can improve the effectiveness of the embodiment of the present invention in screening candidate words.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种电力专业词库生成方法及装置,其中方法包括如下步骤,获取电力相关语料,对相关语料进行切词,得到候选词,对候选词进行互信息值计算,设定互信息值阈值,将互信息值小于互信息值阈值的词删除。解决现有技术中电力专业词库杂乱不准确,无法满足实际需求的问题。

Description

电力专业词库生成方法及装置、存储介质
相关申请的交叉引用
本申请基于申请号为2019012201943140、申请日为2019年01月22日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的内容在此以引入方式并入本申请。
技术领域
本发明涉及自然语言处理领域,尤其涉及一种电力行业中专业词库的生成方法及装置、计算机存储介质。
背景技术
分词技术是自然语言处理中比较基础但又非常重要的技术。在中文语言中单字作为最基本的语义单位,虽然也有自己的意义,但表意能力较差,意义较分散,而词的表意能力更强,能更加准确的描述一个事物。因此在自然语言处理中,通常情况下词(包括单字成词)是最基本的处理单位。对于英文等拉丁语系的语言而言,由于词之间有空格作为词边际表示,词语一般情况下都能简单且准确的提取出来。而中文语言除了标点符号之外,字之间紧密相连,没有明显的词边界,因此很难将词提取出来。中文分词方法大致分为两种:基于词典的切分和基于统计模型的序列标注切分两种方式。其中,基于词典切分是比较常用且高效的分词方式,其前提是要有词库。电网公司目前已经积累了大量的电力行业专业语料,为了通过文本分析、挖掘技术对这些语料进行充分利用,当前迫切的需要一个较准确、完整的电力行业词库。
目前的语言分析处理方法主要有基于如下方面:
一、基于CRF(条件随机场)算法的专业词发现
CRF(条件随机场)是由一个在给定输入节点条件下计算输出节点的条件概率的无向图模型。假设X,Y分别表示需要标记的观察序列和相对应的标记序列的联合分布随机变量,那么条件随机场(X,Y)就是一个以观察序列X为条件的无向图模型。条件随机场的目标是在给定需要标记的观察序列的条件下,使标记序列的联合概率达到最优。方案一通常做法是在人工优选语料中标注出专业领域词,然后利用标注好的语料进行机器学习的CRF模型训练,最后输入专业语料让模型识别出专业词。
方案一对于专业词的识别效果还是较好的,但是前提是要有领域业务专家先对语料进行大量专业词标注,为CRF模型训练提供标注数据。此方案的不足之处就是需要行业专家参与训练数据标注,且需要标注的数据量较大,效率较低。
二、基于统计学原理的专业词库生成
基于统计学原理的词库生成方法不依赖于已有的词库,通常根据词的词频、互信息、左右信息熵特征,将一段大规模语料中可能成词的文本片段全部提取出来,不管它是专业词还是普通词。然后,再把所有抽出来的词通过设置阈值进行过滤筛选,得到词库。
方案二的词库生成过程是完全无监督的,不需要行业专家进行语料标注工作且效率较高,这是其主要优点。方案二的不足之处是仅仅通过词频、互信息、左右熵的词特征来筛选专业词存在准确率较低的缺点。
发明内容
为此,本申请实施例提供一种电力专业词库生成方法、装置及计算机存储介质,至少能够解决现有技术中电力专业词库杂乱不准确,需要人工参与、无法满足实际需求的问题。
为实现上述目的,本申请实施例提供一种电力专业词库生成方法,包括如下步骤,
获取电力相关语料;
对电力相关语料进行切词,得到候选词;
对候选词进行互信息值计算;
将互信息值小于预设互信息值阈值的词删除。
上述方案中,所述步骤对相关语料进行切词,得到候选词,包括:
根据预设定长长度,对中文符号分割出的短句进行定长切词,得到多个定长切词结果;
对每个定长切词结果,截取所述定长切词结果的前n个字,得到候选词,所述n为初始步长,再将n增加预设值,重复截取步骤,直至n等于所述预设定长长度。
上述方案中,所述方法还包括:对候选词进行左右信息熵计算,将左右信息熵小于预设的左右信息熵阈值的词删除。
上述方案中,所述方法还包括:对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
上述方案中,所述方法还包括:对候选词进行TF-IDF值计算,根据TF-IDF值对候选词排序,并将排序好的结果呈现给用户。
本申请实施例还提供一种电力专业词库生成装置,包括:
获取模块,配置为获取电力相关语料;
切词模块,配置为对电力相关语料进行切词,得到候选词;
计算模块,配置为对候选词进行互信息值计算;
删除模块,配置为将互信息值小于预设互信息值阈值的词删除。
上述方案中,
所述切词模块包括:
切词子模块,配置为对中文符号分割出的短句根据预设定长长度进行定长切词,得到定长切词结果;
分词子模块,配置为截取所述定长切词结果的前n个字,得到候选词,所述n为初始步长,再将n增加预设值,重复截取步骤,直至n等于所述预设定长长度。
上述方案中,所述装置还包括左右信息熵计算模块,配置为对候选词进行左右信息熵计算,将左右信息熵小于预设左右信息熵阈值的词删除。
上述方案中,所述装置还包括标注删除模块,配置为对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
上述方案中,所述装置还包括排序模块,配置为对候选词进行TF-IDF值计算,根据TF-IDF值对候选词排序,并将排序好的结果呈现给用户。
区别于现有技术,本发明实施例至少可以通过对电力专业语料进行切词,对切词结果进行相关计算从而使得电力专业的词库更加准确,更加有实用性,无需工人参与。其中,删除互信息值小于互信息值阈值的词的操作,可提高对候选词筛选的有效率。
附图说明
图1为本发明提供的电力专业词库生成方法实施例一的实现流程示意图;
图2为本发明提供的电力专业词库生成方法实施例二的实现流程示意图;
图3为本发明提供的电力专业词库生成装置的组成结构示意图;
图4为本发明提供的电力专业词库生成装置的硬件构成示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚明白,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。 基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
为详细说明技术方案的技术内容、构造特征、所实现目的及效果,以下结合具体实施例并配合附图详予说明。
请参阅图1,为一种电力专业词库生成方法实施例的实现流程示意图,包括如下步骤:
S(步骤)100:获取电力相关语料:
S102:对相关语料进行切词,得到候选词:
S104:对候选词进行互信息值计算;
S106:将互信息值小于互信息值阈值的词删除。
其中,所述互信息值阈值为预先设定的,根据经验而得,可以为具体数值,也可以为范围值,不做具体限定。
作为一个可选的实现方式,所述对电力相关语料的切词为全切词,具体可通过如下方式实现:
根据预设定长长度,对中文符号分割出的短句进行定长切词,得到定长切词结果;例如通过对获取到的文件先按中文标点切分,然后对按中文标点切分后的语料进行N-gram切分(N-gram窗口大小推荐6-8)。这样处理的好处在于能够使得定长切词结果能够更加全面,窗口大小选择能够满足进一步细分的切词需要并囊括更多可用的结果。举个例子,对“绕线型转子的变极感应电动机”这句进行窗口为6的N-gram切分(定长长度),会得到下列切分结果,即定长切词结果:
绕线型转子的
线型转子的变
型转子的变极
转子的变极感
子的变极感应
的变极感应电
变极感应电动
极感应电动机
继续的步骤中,还可以对每个定长切词结果,截取所述定长切词结果的前n个字,得到候选词,所述n为初始步长、为正整数,再将n增加预设值,重复截取步骤,直至n等于所述预设定长长度。具体的,分词的步长用于对定长切词结果进行进一步细化,设定n为2至预设定长长度在上例中,就会变成分词取定长切词结果的前2至6个字作为候选词,n每次增加的预设值可以为1,也可以为其他整数。如对“绕线型转子的”进行切词后,会得到如下切分结果:
绕线
绕线型
绕线型转
绕线型转子
绕线型转子的;
通过上述方案,经过定长切词及后续切分词步骤得到的候选词量大而全,同时分两个主要步骤的操作例也能够节省运算资源,并且不会得到过长的候选词结果。最终能够达到有效提高电力专业候选词有效性的效果。
对上述候选词继续进行互信息值的计算:
互信息值体现了组成当前词的各个字或词之间结合的紧密程度,互信 息值越大成词的可能性也越大,计算公式如下:
Figure PCTCN2019099862-appb-000001
其中p(x)、p(y)分别是候选词中的字或词组分x和y单独出现的概率,p(x,y)是x和y同时出现的概率。通过进行互信息值的计算,并删除互信息值小于互信息值阈值的词的操作方式,能够删除大概率不成词的字组合,以及不成合成词的词组合,同时提高对候选词筛选的有效率。
作为一种实现方式,所述生成方法还包括:
S104:对候选词进行左右信息熵计算,设定左右信息熵阈值,将左右信息熵小于该阈值的词删除。通常一个词可以在各种语境中使用,因此一个词的左右组合比较多,可以用信息熵来表示这个词在语料库中左右组合的丰富程度。通过进行左右信息熵阈值的筛选,可以删除固定词组中部分词被单独作为候选词的情况,提高电力领域专业词库的行业适用性。最终形成的词库可以更为科学。其中,所述左右信息熵阈值为预先设定的,根据经验而得,可以为具体数值,也可以为范围值,不做具体限定。
其他一些实施例中,所述生成方法还可以包括:
S106:对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。词性标注工具可以借助现有技术,常用带词性标注的分词工具可以包括:jieba、nltk、HanLP、Ansj等,具体用于对候选词中的细分分词词性进行标注,从而得到候选词的词性组合特点,根据词性组合特点再进行删除,例如,较高成词概率的组合:名词+名词、动词+名词、名词+动词等,而较低成词概率的组合有:动词+动词、介词+名词、介词+动词、副词+动词等,可以通过设计删除规则表将具有动词+动词、介词+名词、介词+动词、副词+动词等组分词性的词进行删除,达到候选词优化的效果,提高候选词的有效率,更好地达到电力专业词库构建的效果。
作为一种实现方式,所述生成方法还包括:
S108对候选词进行TF-IDF(Term Frequency–Inverse Document Frequency,词频-逆文本频率指数,为一种加权技术)值计算,根据TF-IDF值对候选词排序。TF-IDF是一种统计方法,用以评估一个词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正相关增加,但同时会随着它在语料库中出现的频率成负相关下降,计算公式如下:
Figure PCTCN2019099862-appb-000002
其中,
Figure PCTCN2019099862-appb-000003
表示在语料d中,词w出现的次数,N表示语料总数,N w表示含有词w的语料数量。通过计算上述TF-IDF值进行候选词的排序,并将排序好的结果呈现给用户,将重要程度最高的词选排在最前,能够进一步优化用户的使用体验。
在图2所示中,本发明实施例的生成方法包括:
S200:获取电力相关语料,对相关语料进行切词,对中文符号分割出的短句进行定长切词,得到定长切词结果;例如通过对获取到的文件先按中文标点切分,然后对按中文标点切分后的语料进行N-gram切分(N-gram窗口大小6),再对各定长切词结果中进行再分词,再分词的步长为2至6,得到候选词,
S202:对候选词进行互信息值计算,设定互信息值阈值,将互信息值小于互信息值阈值的词删除。
S204:对候选词进行左右信息熵计算,设定左右信息熵阈值,将左右信息熵小于该阈值的词删除。
S206:对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
S208:对候选词进行TF-IDF值计算,根据TF-IDF值对候选词排序。
作为一个实现方式,本发明实施例的生成方法包括如下步骤:
S200:获取电力相关语料,对相关语料进行切词可以具体为:对中文符号分割出的短句进行定长切词,得到定长切词结果;例如通过对获取到的文件先按中文标点切分,然后对按中文标点切分后的语料进行N-gram切分(N-gram窗口大小4),再对各定长切词结果中进行再分词,再分词的步长为2至4,得到候选词。
S202:对候选词进行互信息值计算,设定互信息值阈值,将互信息值小于互信息值阈值的词删除。
S204:对候选词进行左右信息熵计算,设定左右信息熵阈值,将左右信息熵小于该阈值的词删除。
S206:对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
S208:对候选词进行TF-IDF值计算,根据TF-IDF值对候选词排序。
作为一个实现方式,本发明实施例的方法包括如下步骤:
S200:获取电力相关语料,对相关语料进行切词,对中文符号分割出的短句进行定长切词,得到定长切词结果;例如通过对获取到的文件先按中文标点切分,或者,按中文标点切分后的语料进行N-gram切分(N-gram窗口大小8),再对各定长切词结果中进行再分词,再分词的步长为2至8,得到候选词。
S202:对候选词进行互信息值计算,设定互信息值阈值,将互信息值小于互信息值阈值的词删除。
S204:对候选词进行左右信息熵计算,设定左右信息熵阈值,将左右信息熵小于该阈值的词删除。
S206:对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
S208:对候选词进行TF-IDF值计算,根据TF-IDF值对候选词排序。
同时,将步骤S202至S208内的任意步骤进行删除也是能够达到效果,或可以将步骤S202至S206内的任意步骤进行调换顺序。
此外,本申请实施例还提供一种电力专业词库生成装置,如图3所示,所述装置包括:获取模块301、切词模块302、计算模块303和删除模块304;其中,
获取模块301,配置为获取电力相关语料;
切词模块302,配置为对电力相关语料进行切词,得到候选词;
计算模块303,配置为对候选词进行互信息值计算;
删除模块304,配置为将互信息值小于预设互信息值阈值的词删除。
作为一种实现方式,所述切词模块还包括:
切词子模块,配置为对中文符号分割出的短句根据预设定长长度进行定长切词,得到定长切词结果;
分词子模块,配置为截取所述定长切词结果的前n个字,得到候选词,所述n为初始步长,再将n增加预设值,重复截取步骤,直至n等于所述预设定长长度;n为正整数。
作为一种实现方式,所述装置还包括左右信息熵计算模块,配置为对候选词进行左右信息熵计算,将左右信息熵小于预设左右信息熵阈值的词删除。
作为一种实现方式,所述装置还包括标注删除模块,配置为对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
作为一种实现方式,所述装置还包括排序模块,配置为对候选词进行TF-IDF值计算,根据TF-IDF值对候选词排序,并将排序好的结果呈现给用户。
上述实施例提供的电力专业词库生成装置与电力专业词库生成方法实 施例属于同一构思,其具体实现过程详见电力专业词库生成方法的实施例,这里不再赘述。其中,所述获取模块301、切词模块302、计算模块303和删除模块304均可由数字信号处理(DSP)、中央处理器(CPU)、可编程逻辑控制(FPGA)、控制器(MCU)等来实现。
本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时至少用于执行图1、或图2所示方法的步骤。所述计算机可读存储介质具体可以为存储器。所述存储器可以为如图4所示的存储器42。
本申请实施例还提供了一种电力专业词库生成装置。图4为本申请实施例的电力专业词库生成装置的硬件结构示意图。如图4所示,所述装置包括:用于进行数据传输的通信组件43、至少一个处理器41和用于存储能够在处理器41上运行的计算机程序的存储器42。终端中的各个组件通过总线系统44耦合在一起。可理解,总线系统44用于实现这些组件之间的连接通信。总线系统44除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图4中将各种总线都标为总线系统44。
其中,所述处理器41执行所述计算机程序时至少执行图1或图2所示方法的步骤。
可以理解,存储器42可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash  Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本申请实施例描述的存储器42旨在包括但不限于这些和任意其它适合类型的存储器。
上述本申请实施例揭示的方法可以应用于处理器41中,或者由处理器41实现。处理器41可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器41中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器41可以是通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器41可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器42,处理器41读取存储器42中的信 息,结合其硬件完成前述方法的步骤。
在示例性实施例中,检测设备可以被一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、FPGA、通用处理器、控制器、MCU、微处理器(Microprocessor)、或其他电子元件实现,用于执行前述的电力专业词库生成方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
需要说明的是,尽管在本文中已经对上述各实施例进行了描述,但并非因此限制本发明的专利保护范围。因此,基于本发明的创新理念,对本文所述实施例进行的变更和修改,或利用本发明说明书及附图内容所作的等效结构或等效流程变换,直接或间接地将以上技术方案运用在其他相关的技术领域,均包括在本发明的专利保护范围之内。
工业实用性
本发明实施例中,至少可以通过对电力专业语料进行切词,并对切词结果进行相关计算、如互信息值的计算、左右信息熵等,从而可得到更为准确的电力专业的词库,更加有实用性,且无需工人参与。其中,删除互信息值小于互信息值阈值的操作方式,能够删除大概率不成词的字组合,以及不成合成词的词组合,可提高本发明实施例对候选词筛选的有效率。

Claims (12)

  1. 一种电力专业词库生成方法,包括:
    获取电力相关语料;
    对电力相关语料进行切词,得到候选词;
    对候选词进行互信息值计算;
    将互信息值小于预设互信息值阈值的词删除。
  2. 根据权利要求1所述的电力专业词库生成方法,其中,所述对电力相关语料进行切词,得到候选词包括:
    根据预设定长长度,对中文符号分割出的短句进行定长切词,得到多个定长切词结果;
    截取所述定长切词结果的前n个字,得到候选词;其中,所述n为初始步长、为正整数,将n增加预设值,重复截取步骤,直至n等于所述预设定长长度。
  3. 根据权利要求1所述的电力专业词库生成方法,其中,所述方法还包括:
    对候选词进行左右信息熵计算:
    将左右信息熵小于预设的左右信息熵阈值的词删除。
  4. 根据权利要求1所述的电力专业词库生成方法,其中,所述方法还包括:
    对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
  5. 根据权利要求1至4任一项所述的电力专业词库生成方法,其中,所述方法还包括:
    对候选词进行TF-IDF值计算;
    根据TF-IDF值对候选词排序,并呈现排序结果。
  6. 一种电力专业词库生成装置,所述装置包括:
    获取模块,配置为获取电力相关语料;
    切词模块,配置为对电力相关语料进行切词,得到候选词;
    计算模块,配置为对候选词进行互信息值计算;
    删除模块,配置为将互信息值小于预设互信息值阈值的词删除。
  7. 根据权利要求6所述的电力专业词库生成装置,其中,所述切词模块还包括:
    切词子模块,配置为对中文符号分割出的短句根据预设定长长度进行定长切词,得到定长切词结果;
    分词子模块,配置为截取所述定长切词结果的前n个字,得到候选词,所述n为初始步长,再将n增加预设值,重复截取步骤,直至n等于所述预设定长长度。
  8. 根据权利要求6所述的电力专业词库生成装置,其中,所述装置还包括:
    左右信息熵计算模块,配置为对候选词进行左右信息熵计算,将左右信息熵小于预设左右信息熵阈值的词删除。
  9. 根据权利要求6所述的电力专业词库生成装置,其中,所述装置还包括标注删除模块,配置为对候选词进行组分词性标注,根据组分词性删除不成词的词性组合。
  10. 根据权利要求6至9任一项所述的电力专业词库生成装置,其中,所述装置还包括排序模块,配置为对候选词进行TF-IDF值计算,根据TF-IDF值对候选词排序,并呈现排序结果。
  11. 一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1至5任一项所述的电力专业词库生成方法。
  12. 一种电力专业词库生成装置,包括存储器、处理器及存储在存储 器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至5任一项所述电力专业词库生成方法的步骤。
PCT/CN2019/099862 2019-01-22 2019-08-08 电力专业词库生成方法及装置、存储介质 WO2020151218A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910058614.8 2019-01-22
CN201910058614.8A CN109710947B (zh) 2019-01-22 2019-01-22 电力专业词库生成方法及装置

Publications (1)

Publication Number Publication Date
WO2020151218A1 true WO2020151218A1 (zh) 2020-07-30

Family

ID=66261732

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/099862 WO2020151218A1 (zh) 2019-01-22 2019-08-08 电力专业词库生成方法及装置、存储介质

Country Status (2)

Country Link
CN (1) CN109710947B (zh)
WO (1) WO2020151218A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710947B (zh) * 2019-01-22 2021-09-07 福建亿榕信息技术有限公司 电力专业词库生成方法及装置
CN110287488A (zh) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 一种基于大数据和中文特征的中文文本分词方法
CN110287495A (zh) * 2019-07-01 2019-09-27 国网江苏省电力有限公司电力科学研究院 一种电力营销专业词识别方法及系统
CN112182448A (zh) * 2019-07-05 2021-01-05 百度在线网络技术(北京)有限公司 页面信息处理方法、装置及设备
CN110413997B (zh) * 2019-07-16 2023-04-07 深圳供电局有限公司 针对电力行业的新词发现方法及其系统、可读存储介质
CN111090732A (zh) * 2019-12-23 2020-05-01 创意信息技术股份有限公司 一种电力服务信息热点提取方法、装置和电子设备
CN111353050A (zh) * 2019-12-27 2020-06-30 北京合力亿捷科技股份有限公司 一种电信客服垂直领域的词库构建方法及工具
CN111259171A (zh) * 2020-01-15 2020-06-09 青岛聚看云科技有限公司 一种基于分词索引搜索多媒体资源的方法及服务器
CN112100492A (zh) * 2020-09-11 2020-12-18 河北冀联人力资源服务集团有限公司 一种不同版本的简历的批量投递方法和系统
CN112632969B (zh) * 2020-12-13 2022-06-21 复旦大学 一种增量式行业词典更新方法和系统
CN113010682A (zh) * 2021-03-29 2021-06-22 广东电网有限责任公司 一种命令票系统校核方法、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168953A (zh) * 2017-05-16 2017-09-15 电子科技大学 海量文本中基于词向量表征的新词发现方法及系统
WO2017177809A1 (zh) * 2016-04-12 2017-10-19 华为技术有限公司 语言文本的分词方法和系统
CN107463548A (zh) * 2016-06-02 2017-12-12 阿里巴巴集团控股有限公司 短语挖掘方法及装置
CN108595433A (zh) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 一种新词发现方法及装置
CN109710947A (zh) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 电力专业词库生成方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070005345A1 (en) * 2005-07-01 2007-01-04 Microsoft Corporation Generating Chinese language couplets
CN103049501B (zh) * 2012-12-11 2016-08-03 上海大学 基于互信息和条件随机场模型的中文领域术语识别方法
CN105446964B (zh) * 2014-05-30 2019-04-26 国际商业机器公司 用于文件的重复数据删除的方法及装置
CN104598530B (zh) * 2014-12-26 2018-06-05 语联网(武汉)信息技术有限公司 一种领域术语抽取的方法
CN107402945B (zh) * 2017-03-15 2020-07-10 阿里巴巴集团控股有限公司 词库生成方法及装置、短文本检测方法及装置
CN108460136A (zh) * 2018-03-08 2018-08-28 国网福建省电力有限公司 电力运维信息知识图谱构建方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017177809A1 (zh) * 2016-04-12 2017-10-19 华为技术有限公司 语言文本的分词方法和系统
CN107463548A (zh) * 2016-06-02 2017-12-12 阿里巴巴集团控股有限公司 短语挖掘方法及装置
CN107168953A (zh) * 2017-05-16 2017-09-15 电子科技大学 海量文本中基于词向量表征的新词发现方法及系统
CN108595433A (zh) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 一种新词发现方法及装置
CN109710947A (zh) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 电力专业词库生成方法及装置

Also Published As

Publication number Publication date
CN109710947A (zh) 2019-05-03
CN109710947B (zh) 2021-09-07

Similar Documents

Publication Publication Date Title
WO2020151218A1 (zh) 电力专业词库生成方法及装置、存储介质
CN107133213B (zh) 一种基于算法的文本摘要自动提取方法与系统
WO2019174132A1 (zh) 数据处理方法、服务器及计算机存储介质
CN104636466B (zh) 一种面向开放网页的实体属性抽取方法和系统
TWI662425B (zh) 一種自動生成語義相近句子樣本的方法
CN108334495A (zh) 短文本相似度计算方法及系统
CN108573045A (zh) 一种基于多阶指纹的比对矩阵相似度检索方法
CN107463548B (zh) 短语挖掘方法及装置
CN107577663B (zh) 一种关键短语抽取方法和装置
CN110929520B (zh) 非命名实体对象抽取方法、装置、电子设备及存储介质
CN112199937B (zh) 一种短文本相似度分析方法及其系统、计算机设备、介质
CN108763192B (zh) 用于文本处理的实体关系抽取方法及装置
CN107526792A (zh) 一种中文问句关键词快速提取方法
CN110019820B (zh) 一种病历中主诉与现病史症状时间一致性检测方法
Indhuja et al. Text based language identification system for indian languages following devanagiri script
CN101271448A (zh) 汉语基本名词短语的识别及其规则的生成方法和装置
Albeer et al. Automatic summarization of YouTube video transcription text using term frequency-inverse document frequency
CN111160445B (zh) 投标文件相似度计算方法及装置
CN110705285B (zh) 一种政务文本主题词库构建方法、装置、服务器及可读存储介质
CN111680146A (zh) 确定新词的方法、装置、电子设备及可读存储介质
WO2023246337A1 (zh) 无监督的语义检索方法、装置及计算机可读存储介质
CN108804410B (zh) 一种基于人工智能文本语义相似度分析的语义解释方法
TWI636370B (zh) Establishing chart indexing method and computer program product by text information
CN114842982A (zh) 一种面向医疗信息系统的知识表达方法、装置及系统
CN113987133A (zh) 一种融合tfidf和lda实现抽取式文本摘要方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19911530

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10.02.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 19911530

Country of ref document: EP

Kind code of ref document: A1