WO2019075967A1 - 企业名称识别方法、电子设备及计算机可读存储介质 - Google Patents

企业名称识别方法、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2019075967A1
WO2019075967A1 PCT/CN2018/076164 CN2018076164W WO2019075967A1 WO 2019075967 A1 WO2019075967 A1 WO 2019075967A1 CN 2018076164 W CN2018076164 W CN 2018076164W WO 2019075967 A1 WO2019075967 A1 WO 2019075967A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
sequence
label
specific
word vector
Prior art date
Application number
PCT/CN2018/076164
Other languages
English (en)
French (fr)
Inventor
徐冰
汪伟
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019075967A1 publication Critical patent/WO2019075967A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Definitions

  • the present application relates to the field of computer information technology, and in particular, to an enterprise name identification method, an electronic device, and a computer readable storage medium.
  • the first step is to identify the business entities in the news.
  • Traditional natural language processing techniques generally use conditional random field or implicit Markov models for sequence modeling.
  • these methods rely heavily on feature selection and generalization ability. Therefore, the design method of the enterprise name identification in the prior art is not reasonable enough and needs to be improved.
  • the present application proposes an enterprise name identification method, an electronic device, and a computer readable storage medium, which automatically extracts effective features through the combination of the LSTM+CRF model, and can utilize the context information when tagging the enterprise name.
  • the stage effectively utilizes sentence-level tag information to improve recognition accuracy and recall rate.
  • the present application provides an electronic device including a memory and a processor, the memory storing an enterprise name recognition system operable on the processor, the enterprise name identification
  • the system implements the following steps when executed by the processor:
  • the state vector of each word vector is converted into a feature vector by a specific regression model, and the feature vector of each word vector is decoded by using the conditional random field and a preset ternary labeling rule to obtain the specific a ternary annotation set of all Chinese characters in the text sequence, and outputting a ternary annotation set of all Chinese characters through the optimal label sequence;
  • a specific enterprise name is identified from the optimal tag sequence according to the predetermined ternary labeling rule.
  • the preset ternary labeling rule comprises: indicating, by the first label, the first Chinese character of the enterprise name, the second label indicating the remaining Chinese characters of the enterprise name, and the third label indicating the Chinese characters not belonging to the enterprise name.
  • the state vector includes a first hidden layer state vector and a second hidden layer state vector
  • the calculation of the state vector includes:
  • the calculating of the feature vector comprises: combining the first hidden layer state vector and the second hidden layer state vector corresponding to each word vector by the specific regression model to obtain a feature vector of each word vector.
  • the optimal tag sequence is obtained by a predetermined tag sequence calculation formula, and the predetermined tag sequence calculation formula is set as:
  • X represents a feature vector of each word vector
  • y represents a tag sequence to be predicted
  • n represents the number of Chinese characters in the specific character sequence
  • i represents an i-th Chinese character in the specific character sequence
  • A represents a state transition matrix
  • a yi, yi+1 represents the probability of moving from the yith label to the yi+1th label
  • P i, yi represents the probability that the i-th Chinese character is marked as the yi-th label
  • s(X,y) represents an indicator that measures each tag sequence, and the best tag sequence is obtained by maximizing s(X,y).
  • the identifying of the specific enterprise name comprises: extracting consecutively labeled first tags and all Chinese characters corresponding to the second tags from the optimal tag sequence, and using the extracted Chinese characters as a specific enterprise name.
  • the present application further provides an enterprise name identification method, which is applied to an electronic device, and the method includes:
  • the state vector of each word vector is converted into a feature vector by a specific regression model, and the feature vector of each word vector is decoded by using the conditional random field and a preset ternary labeling rule to obtain the specific a ternary annotation set of all Chinese characters in the text sequence, and outputting a ternary annotation set of all Chinese characters through the optimal label sequence;
  • a specific enterprise name is identified from the optimal tag sequence according to the predetermined ternary labeling rule.
  • the preset ternary labeling rule comprises: indicating, by the first label, the first Chinese character of the enterprise name, the second label indicating the remaining Chinese characters of the enterprise name, and the third label indicating the Chinese characters not belonging to the enterprise name.
  • the optimal tag sequence is obtained by a predetermined tag sequence calculation formula, and the predetermined tag sequence calculation formula is set to:
  • X represents a feature vector of each word vector
  • y represents a tag sequence to be predicted
  • n represents the number of Chinese characters in the specific character sequence
  • i represents an i-th Chinese character in the specific character sequence
  • A represents a state transition matrix
  • a yi, yi+1 represents the probability of moving from the yith label to the yi+1th label
  • P i, yi represents the probability that the i-th Chinese character is marked as the yi-th label
  • s(X,y) represents an indicator that measures each tag sequence, and the best tag sequence is obtained by maximizing s(X,y).
  • the identifying of the specific enterprise name comprises: extracting consecutively labeled first tags and all Chinese characters corresponding to the second tags from the optimal tag sequence, and using the extracted Chinese characters as a specific enterprise name.
  • the present application further provides a computer readable storage medium storing an enterprise name identification system, the enterprise name identification system being executable by at least one processor, such that The at least one processor performs the steps of the enterprise name identification method as described above.
  • the electronic device, the enterprise name identification method and the computer readable storage medium proposed by the present application automatically extract effective features through the combination of the LSTM+CRF model, and can utilize context information when identifying the enterprise name. Sentence-level tag information is effectively utilized during the tagging phase. Compared with the traditional sequence modeling method, the enterprise name recognition method proposed in the present application improves the recognition accuracy and the recall rate.
  • 1 is a schematic diagram of an optional hardware architecture of an electronic device of the present application
  • FIG. 2 is a schematic diagram of a program module of an embodiment of an enterprise name identification system in an electronic device of the present application
  • FIG. 3 is a schematic diagram of an implementation process of an embodiment of an enterprise name identification method according to the present application.
  • FIG. 4 is a diagram showing an example of identifying an enterprise name in the present application.
  • first, second and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. .
  • features defining “first” and “second” may include at least one of the features, either explicitly or implicitly.
  • the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
  • FIG. 1 it is a schematic diagram of an optional hardware architecture of the electronic device 2 of the present application.
  • the electronic device 2 may include, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can communicate with each other through a system bus. It is pointed out that FIG. 1 only shows the electronic device 2 with the components 21-23, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
  • the electronic device 2 may be a computing device such as a rack server, a blade server, a tower server, or a rack server.
  • the electronic device 2 may be an independent server or a server cluster composed of multiple servers. .
  • the memory 21 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 21 may be an internal storage unit of the electronic device 2, such as a hard disk or memory of the electronic device 2.
  • the memory 21 may also be an external storage device of the electronic device 2, such as a plug-in hard disk equipped on the electronic device 2, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, flash card, etc.
  • the memory 21 may also include both an internal storage unit of the electronic device 2 and an external storage device thereof.
  • the memory 21 is generally used to store an operating system installed in the electronic device 2 and various types of application software, such as program codes of the enterprise name recognition system 20. Further, the memory 21 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 22 is typically used to control the overall operation of the electronic device 2, such as performing control and processing related to data interaction or communication with the electronic device 2.
  • the processor 22 is configured to run program code or process data stored in the memory 21, such as running the enterprise name identification system 20 and the like.
  • the network interface 23 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 2 and other electronic devices.
  • the network interface 23 is configured to connect the electronic device 2 to an external data platform through a network, and establish a data transmission channel and a communication connection between the electronic device 2 and an external data platform.
  • the network may be an intranet, an Internet, a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, or a 5G network.
  • Wireless or wired networks such as network, Bluetooth, Wi-Fi, etc.
  • FIG. 2 it is a program module diagram of an embodiment of the enterprise name identification system 20 in the electronic device 2 of the present application.
  • the enterprise name identification system 20 may be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and are composed of one or more processors ( This embodiment is executed by the processor 22) to complete the application.
  • the enterprise name identification system 20 can be divided into a receiving module 201, a conversion module 202, a computing module 203, an annotation module 204, and an identification module 205.
  • a program module as used herein refers to a series of computer program instructions that are capable of performing a particular function, and are more suitable than the program to describe the execution of the enterprise name identification system 20 in the electronic device 2. The functions of the respective program modules 201-205 will be described in detail below.
  • the receiving module 201 is configured to receive a specific sequence of characters input.
  • the specific text sequence is a Chinese character, including Chinese characters and spaces, such as the news sentence "China Ping An has released a new product.”
  • RNN Recurrent Neural Network
  • the recurrent neural network uses a Long Short-Term Memory (LSTM), preferably a Bi-directional LSTM.
  • the calculating module 203 is configured to calculate a state vector of each word vector by using the recurrent neural network, and input a state vector of each word vector to a Conditional Random Field (CRF).
  • the state vector includes a first hidden layer state vector and a second hidden layer state vector.
  • the calculation of the state vector specifically includes the following steps:
  • the first hidden layer state vector h i and the second hidden layer state vector h i ' are substantially automatically extracted by the LSTM for the feature of the original input specific character sequence, and the manner of extracting the feature is different from the traditional method, and does not depend on The ability to select features and generalize is strong.
  • the labeling module 204 is configured to convert a state vector of each word vector into a feature vector X i by using a specific regression model (such as a softmax model), and using the conditional random field and a preset ternary labeling rule. Decoding the feature vector X i of each word vector to obtain a ternary annotation set (B, I, S) of all Chinese characters in the specific character sequence, and outputting all the Chinese characters through the optimal tag sequence (represented by Y i ) The ternary annotation set.
  • a specific regression model such as a softmax model
  • the preset ternary labeling rule includes: indicating, by the first label (such as “B”), the first Chinese character of the enterprise name, and the second label (such as “I”) indicating the name of the enterprise.
  • the third label (such as "S") indicates a Chinese character that does not belong to the company name.
  • converting the state vector of each word vector into the feature vector X i includes: first hiding layer state vector corresponding to each word vector by a specific regression model (such as softmax model) h i and the second hidden layer state vector h i ' are combined to obtain a feature vector X i of each word vector.
  • a specific regression model such as softmax model
  • the optimal tag sequence Y i is obtained by a predetermined tag sequence calculation formula (Equation 1 below).
  • X represents the set of eigenvectors X i of each word vector
  • ie X (X 1 , X 2 , ..., X n )
  • y represents the sequence of labels to be predicted
  • s(X, y) represents an indicator that measures each tag sequence, and the best tag sequence Y i is obtained by maximizing s(X, y).
  • A represents a state transition matrix
  • a yi, yi+1 represents the probability of moving from the yith label to the yi+1th label
  • P i, yi represents the probability that the ith Chinese character is marked as the yith label.
  • the CRF introduced in this embodiment is actually modeling the output label triplet, and then using dynamic programming to calculate, and finally marking according to the obtained optimal path, that is, by maximizing s(X, y).
  • the best tag sequence Y i is actually modeling the output label triplet, and then using dynamic programming to calculate, and finally marking according to the obtained optimal path, that is, by maximizing s(X, y).
  • the best tag sequence Y i is actually modeling the output label triplet, and then using dynamic programming to calculate, and finally marking according to the obtained optimal path, that is, by maximizing s(X, y).
  • the best tag sequence Y i The best tag sequence Y i .
  • the identification module 205 is configured to identify a specific enterprise name from the optimal tag sequence according to the preset ternary labeling rule.
  • the identifying of the specific enterprise name comprises: extracting consecutively labeled first tags and all Chinese characters corresponding to the second tags from the optimal tag sequence, and using the extracted Chinese characters as a specific enterprise name.
  • the optimal tag sequence output by the LSTM+CRF model adopted in this application is ⁇ B, I, I, I. , S, S, S, S, S, S ⁇ , which means that the name of the enterprise identified from the specific sequence of characters is “China Ping An”, that is, the first label and all the second labels (B, I, I, which are continuously marked).
  • the final step of the LSTM+CRF model is to optimize the s(X,y) to get the best label sequence ⁇ B,I,I,I,S,S,S,S,S ⁇ , That is, the s(X, y) of the optimal tag sequence ⁇ B, I, I, I, S, S, S, S, S ⁇ is larger than the s(X, y) of other sequences, so it is determined ⁇ B, I, I, I, S, S, S, S, S, S ⁇ are the best tag sequences. Among them, ⁇ B, I, I, I ⁇ represents China Ping An. These four words are enterprise names, because according to the preset ternary labeling rules, B represents the first Chinese character of the enterprise name, and I represents the enterprise name. The remaining Chinese characters, S indicates Chinese characters that do not belong to the company name.
  • the method adopted in the present application is applicable to model training and model application.
  • the enterprise is called sample data
  • the output of the model (enterprise abbreviation) is known reference data.
  • the LSTM model is trained to obtain parameters or variables such as the spatial dimension and coefficient matrix of the model, and then the model is adjusted according to the known reference data until the training obtains a more reliable model.
  • the model is applied, the sentence to be determined is input into the model, and the name of the enterprise contained therein can be predicted.
  • the enterprise name identification system 20 proposed by the present application can automatically extract effective features through the combination of the LSTM+CRF model, and can utilize the context information when identifying the enterprise name, and is effective at the labeling stage. Sentence level tag information is utilized. Compared with the traditional sequence modeling method, the enterprise name recognition method proposed in the present application improves the recognition accuracy and the recall rate.
  • the present application also proposes a method for identifying a company name.
  • FIG. 3 it is a schematic flowchart of an implementation process of an embodiment of the enterprise name identification method of the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
  • Step S31 receiving a specific sequence of characters input.
  • the specific text sequence is a Chinese character, including Chinese characters and spaces, such as the news sentence "China Ping An has released a new product.”
  • RNN Recurrent Neural Network
  • the recurrent neural network uses a Long Short-Term Memory (LSTM), preferably a Bi-directional LSTM.
  • Step S33 calculating a state vector of each word vector through the recurrent neural network, and inputting a state vector of each word vector to a Conditional Random Field (CRF).
  • the state vector includes a first hidden layer state vector and a second hidden layer state vector.
  • the calculation of the state vector specifically includes the following steps:
  • the first hidden layer state vector h i and the second hidden layer state vector h i ' are substantially automatically extracted by the LSTM for the feature of the original input specific character sequence, and the manner of extracting the feature is different from the traditional method, and does not depend on The ability to select features and generalize is strong.
  • Step S34 converting a state vector of each word vector into a feature vector X i by a specific regression model (such as a softmax model), and using the conditional random field and a preset ternary labeling rule for each word vector
  • the feature vector X i is decoded to obtain a ternary annotation set (B, I, S) of all Chinese characters in the specific character sequence, and the ternary annotation set of all Chinese characters is output through the optimal label sequence (represented by Y i ) .
  • the preset ternary labeling rule includes: indicating, by the first label (such as “B”), the first Chinese character of the enterprise name, and the second label (such as “I”) indicating the name of the enterprise.
  • the third label (such as "S") indicates a Chinese character that does not belong to the company name.
  • converting the state vector of each word vector into the feature vector X i includes: first hiding layer state vector corresponding to each word vector by a specific regression model (such as softmax model) h i and the second hidden layer state vector h i ' are combined to obtain a feature vector X i of each word vector.
  • a specific regression model such as softmax model
  • the optimal tag sequence Y i is obtained by a predetermined tag sequence calculation formula (Equation 1 below).
  • X represents the set of eigenvectors X i of each word vector
  • ie X (X 1 , X 2 , ..., X n )
  • y represents the sequence of labels to be predicted
  • s(X, y) represents an indicator that measures each tag sequence, and the best tag sequence Y i is obtained by maximizing s(X, y).
  • A represents a state transition matrix
  • a yi, yi+1 represents the probability of moving from the yith label to the yi+1th label
  • P i, yi represents the probability that the ith Chinese character is marked as the yith label.
  • the CRF introduced in this embodiment is actually modeling the output label triplet, and then using dynamic programming to calculate, and finally marking according to the obtained optimal path, that is, by maximizing s(X, y).
  • the best tag sequence Y i is actually modeling the output label triplet, and then using dynamic programming to calculate, and finally marking according to the obtained optimal path, that is, by maximizing s(X, y).
  • the best tag sequence Y i is actually modeling the output label triplet, and then using dynamic programming to calculate, and finally marking according to the obtained optimal path, that is, by maximizing s(X, y).
  • the best tag sequence Y i The best tag sequence Y i .
  • Step S35 identifying a specific company name from the optimal tag sequence according to the preset ternary labeling rule.
  • the identifying of the specific enterprise name comprises: extracting consecutively labeled first tags and all Chinese characters corresponding to the second tags from the optimal tag sequence, and using the extracted Chinese characters as a specific enterprise name.
  • the optimal tag sequence output by the LSTM+CRF model adopted in this application is ⁇ B, I, I, I. , S, S, S, S, S, S ⁇ , which means that the name of the enterprise identified from the specific sequence of characters is “China Ping An”, that is, the first label and all the second labels (B, I, I, which are continuously marked).
  • the final step of the LSTM+CRF model is to optimize the s(X,y) to get the best label sequence ⁇ B,I,I,I,S,S,S,S,S ⁇ , That is, the s(X, y) of the optimal tag sequence ⁇ B, I, I, I, S, S, S, S, S ⁇ is larger than the s(X, y) of other sequences, so it is determined ⁇ B, I, I, I, S, S, S, S, S, S ⁇ are the best tag sequences. Among them, ⁇ B, I, I, I ⁇ represents China Ping An. These four words are enterprise names, because according to the preset ternary labeling rules, B represents the first Chinese character of the enterprise name, and I represents the enterprise name. The remaining Chinese characters, S indicates Chinese characters that do not belong to the company name.
  • the method adopted in the present application is applicable to model training and model application.
  • the enterprise is called sample data
  • the output of the model (enterprise abbreviation) is known reference data.
  • the LSTM model is trained to obtain parameters or variables such as the spatial dimension and coefficient matrix of the model, and then the model is adjusted according to the known reference data until the training obtains a more reliable model.
  • the model is applied, the sentence to be determined is input into the model, and the name of the enterprise contained therein can be predicted.
  • the enterprise name identification method proposed by the present application can automatically extract effective features through the combination of the LSTM+CRF model, and can utilize the context information when identifying the enterprise name, and effectively utilizes the labeling stage. Sentence level tag information. Compared with the traditional sequence modeling method, the enterprise name recognition method proposed in the present application improves the recognition accuracy and the recall rate.
  • the present application further provides a computer readable storage medium (such as a ROM/RAM, a magnetic disk, an optical disk) storing an enterprise name identification system 20, the enterprise name
  • the identification system 20 can be executed by at least one processor 22 to cause the at least one processor 22 to perform the steps of the enterprise name identification method as described above.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Character Discrimination (AREA)

Abstract

一种企业名称识别方法,该方法包括步骤:接收输入的特定文字序列(S31);将特定文字序列中的每个汉字转化为对应的词向量并输入至递归神经网络(S32);通过所述递归神经网络计算每个词向量的状态向量并输入至条件随机场(S33);通过特定的回归模型将每个词向量的状态向量转换成特征向量,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量进行解码,得到所述特定文字序列中所有汉字的三元标注集,并通过最佳标签序列输出所有汉字的三元标注集(S34);根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称(S35)。可以提高企业名称识别精确度。

Description

企业名称识别方法、电子设备及计算机可读存储介质
本申请要求于2017年10月16日提交中国专利局、申请号为201710960222.1、发明名称为“企业名称识别方法、电子设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及计算机信息技术领域,尤其涉及一种企业名称识别方法、电子设备及计算机可读存储介质。
背景技术
舆情分析需要把财经新闻结构化,其中第一步是识别新闻中的企业实体。传统的自然语言处理技术一般采用条件随机场或者隐式马尔科夫模型进行序列建模,但是,这些方法非常依赖于特征的选取、泛化能力很弱。故,现有技术中的企业名称识别方法设计不够合理,亟需改进。
发明内容
有鉴于此,本申请提出一种企业名称识别方法、电子设备及计算机可读存储介质,通过LSTM+CRF模型的结合,自动提取有效特征,并且在识别企业名称时能够利用上下文信息,在打标签的阶段有效利用了句子级别的标记信息,提高了识别精确度与召回率。
首先,为实现上述目的,本申请提出一种电子设备,所述电子设备包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的企业名称识别系统,所述企业名称识别系统被所述处理器执行时实现如下步骤:
接收输入的特定文字序列;
将所述特定文字序列中的每个汉字转化为对应的词向量,并将转化后的词向量输入至递归神经网络;
通过所述递归神经网络计算每个词向量的状态向量,并将每个词向量的状态向量输入至条件随机场;
通过特定的回归模型将每个词向量的状态向量转换成特征向量,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量进行解码,得到所述特定文字序列中所有汉字的三元标注集,并通过最佳标签序列输出所有汉字的三元标注集;及
根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称。
优选地,所述预先设定的三元标注规则包括:通过第一标签表示企业名称的第一个汉字,第二标签表示企业名称的剩余汉字,及第三标签表示不属于企业名称的汉字。
优选地,所述状态向量包括第一隐藏层状态向量和第二隐藏层状态向量;
所述状态向量的计算包括:
调用所述递归神经网络的双向长短期记忆模块,从左向右根据当前词向量的前一个词向量的隐藏层状态向量计算当前词向量的第一隐藏层状态向量,并从右向左根据当前词向量的后一个词向量的隐藏层状态向量计算当前词向量的第二隐藏层状态向量。
所述特征向量的计算包括:通过所述特定的回归模型将每个词向量对应的第一隐藏层状态向量和第二隐藏层状态向量进行合并,得到每个词向量的特征向量。
优选地,所述最佳标签序列通过预定的标签序列计算公式获取,所述预定的标签序列计算公式设置为:
Figure PCTCN2018076164-appb-000001
其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
优选地,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
此外,为实现上述目的,本申请还提供一种企业名称识别方法,该方法应用于电子设备,所述方法包括:
接收输入的特定文字序列;
将所述特定文字序列中的每个汉字转化为对应的词向量,并将转化后的词向量输入至递归神经网络;
通过所述递归神经网络计算每个词向量的状态向量,并将每个词向量的状态向量输入至条件随机场;
通过特定的回归模型将每个词向量的状态向量转换成特征向量,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量进行解码,得到所述特定文字序列中所有汉字的三元标注集,并通过最佳标签序列输出所有汉字的三元标注集;及
根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称。
优选地,所述预先设定的三元标注规则包括:通过第一标签表示企业名称的第一个汉字,第二标签表示企业名称的剩余汉字,及第三标签表示不属于企业名称的汉字。
优选地,所述最佳标签序列通过预定的标签序列计算公式获取,所述预 定的标签序列计算公式设置为:
Figure PCTCN2018076164-appb-000002
其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
优选地,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有企业名称识别系统,所述企业名称识别系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述的企业名称识别方法的步骤。
相较于现有技术,本申请所提出的电子设备、企业名称识别方法及计算机可读存储介质,通过LSTM+CRF模型的结合,自动提取有效特征,并且在识别企业名称时能够利用上下文信息,在打标签的阶段有效利用了句子级别的标记信息。相比传统序列建模方法,本申请所提出的企业名称识别方法提高了识别精确度与召回率。
附图说明
图1是本申请电子设备一可选的硬件架构的示意图;
图2是本申请电子设备中企业名称识别系统一实施例的程序模块示意图;
图3为本申请企业名称识别方法一实施例的实施流程示意图;
图4为本申请进行企业名称识别的示例图。
附图标记:
电子设备 2
存储器 21
处理器 22
网络接口 23
企业名称识别系统 20
接收模块 201
转化模块 202
计算模块 203
标注模块 204
识别模块 205
流程步骤 S31-S35
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
进一步需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
首先,本申请提出一种电子设备2。
参阅图1所示,是本申请电子设备2一可选的硬件架构的示意图。本实施例中,所述电子设备2可包括,但不限于,可通过系统总线相互通信连接存储器21、处理器22、网络接口23。需要指出的是,图1仅示出了具有组件21-23的电子设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,所述电子设备2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该电子设备2可以是独立的服务器,也可以 是多个服务器所组成的服务器集群。
所述存储器21至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述电子设备2的内部存储单元,例如该电子设备2的硬盘或内存。在另一些实施例中,所述存储器21也可以是所述电子设备2的外部存储设备,例如该电子设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述电子设备2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述电子设备2的操作系统和各类应用软件,例如所述企业名称识别系统20的程序代码等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述电子设备2的总体操作,例如执行与所述电子设备2进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行所述的企业名称识别系统20等。
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述电子设备2与其他电子设备之间建立通信连接。例如,所述网络接口23用于通过网络将所述电子设备2与外部数据平台相连,在所述电子设备2与外部数据平台之间的建立数据传输通道和通信连接。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
至此,己经详细介绍了本申请各个实施例的应用环境和相关设备的硬件结构和功能。下面,将基于上述应用环境和相关设备,提出本申请的各个实施例。
参阅图2所示,是本申请电子设备2中企业名称识别系统20一实施例的程序模块图。本实施例中,所述的企业名称识别系统20可以被分割成一个或多个程序模块,所述一个或者多个程序模块被存储于所述存储器21中,并由一个或多个处理器(本实施例中为所述处理器22)所执行,以完成本申请。例如,在图2中,所述的企业名称识别系统20可以被分割成接收模块201、转化模块202、计算模块203、标注模块204、以及识别模块205。本申请所称的程序模块是指能够完成特定功能的一系列计算机程序指令段,比程序更 适合于描述所述企业名称识别系统20在所述电子设备2中的执行过程。以下将就各程序模块201-205的功能进行详细描述。
所述接收模块201,用于接收输入的特定文字序列。在本实施例中,所述特定文字序列为中文字符,包括汉字和空格,如新闻句子“中国平安发布了新产品”。
所述转化模块202,用于将所述特定文字序列中的每个汉字转化为对应的词向量x i(i=0,1,2,…n,向量维数为100),并将转化后的词向量输入至递归神经网络(Recurrent Neural Network,RNN)。在本实施例中,所述递归神经网络采用长短期记忆模块(Long Short-Term Memory,LSTM),优选采用双向长短期记忆模块(Bi-directional LSTM)。
所述计算模块203,用于通过所述递归神经网络计算每个词向量的状态向量,并将每个词向量的状态向量输入至条件随机场(Conditional Random Field,CRF)。其中,所述状态向量包括第一隐藏层状态向量和第二隐藏层状态向量。
优选地,在本实施例中,所述状态向量的计算具体包括如下步骤:
调用所述递归神经网络的双向长短期记忆模块LSTM,从左向右根据当前词向量x i的前一个词向量x- i的隐藏层状态向量h i-1计算当前词向量x i的第一隐藏层状态向量h i,并从右向左根据当前词向量x i的后一个词向量x- i+1的隐藏层状态向量h i+1计算当前词向量x i的第二隐藏层状态向量h i'。
上述的第一隐藏层状态向量h i、第二隐藏层状态向量h i'实质是通过LSTM对原输入特定文字序列的特征自动提取,这种提取特征的方式与传统方法有所区别,不依赖于特征的选取、泛化能力较强。
所述标注模块204,用于通过特定的回归模型(如softmax模型)将每个词向量的状态向量转换成特征向量X i,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量X i进行解码,得到所述特定文字序列中所有汉字的三元标注集(B,I,S),并通过最佳标签序列(用Y i表示)输出所有汉字的三元标注集。
在本实施例中,所述预先设定的三元标注规则包括:通过第一标签(如“B”)表示企业名称的第一个汉字,第二标签(如“I”)表示企业名称的剩余汉字,第三标签(如“S”)表示不属于企业名称的汉字。
优选地,在本实施例中,所述将每个词向量的状态向量转换成特征向量X i包括:通过特定的回归模型(如softmax模型)将每个词向量对应的第一隐藏层状态向量h i和第二隐藏层状态向量h i'进行合并,得到每个词向量的特征向量X i
优选地,在本实施例中,所述最佳标签序列Y i通过预定的标签序列计算公式(如下公式1)获取。
Figure PCTCN2018076164-appb-000003
其中,X代表每个词向量的特征向量X i集合,即X=(X 1,X 2,…,X n),y代表待预测的标签序列,即y=(y1,y2,…,yn),n代表所述特定文字序列中的汉字个数(n>=1),s(X,y)计算公式中i代表所述特定文字序列中的第i个汉字(i>=1),s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列Y i
A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率。
在本实施例中,对于输入X i,可以定义LSTM的输出概率矩阵P{n*k},其中,n代表所述特定文字序列中的汉字个数(n>=1),k代表输出标签的个数(本实施例中,k=3),即所述三元标注集(B,I,S)的标签个数。
本实施例中引入的CRF,其实是对输出标签三元组进行建模,然后使用动态规划进行计算,最终根据得到的最优路径进行标注,即通过最大化s(X,y)得到所述最佳标签序列Y i
所述识别模块205,用于根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称。优选地,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
举例而言,参阅图4所示,假设输入的特定文字序列为“中国平安发布了新产品”,通过本申请采用的LSTM+CRF模型输出的最佳标签序列为{B,I,I,I,S,S,S,S,S,S},即表示从特定文字序列中识别的企业名称为“中国平安”,即连续标注的第一标签和所有第二标签(B,I,I,I)对应的汉字。在图4的例子中,LSTM+CRF模型最后一步通过最优化s(X,y)来得到最佳标签序列{B,I,I,I,S,S,S,S,S,S},即最佳标签序列{B,I,I,I,S,S,S,S,S,S}的s(X,y)比其他序列的s(X,y)都要大,所以确定{B,I,I,I,S,S,S,S,S,S}为最佳标签序列。其中,{B,I,I,I}就代表中国平安这四个字是企业名称,因为根据所述预先设定的三元标注规则,B表示企业名称的第一个汉字,I表示企业名称的剩余汉字,S表示不属于企业名称的汉字。
需要说明的是,本申请所采用的方法适用于模型训练及模型应用。模型训练的过程中,企业全称为样本数据,模型的输出结果(企业简称)为已知的参考数据。经过大量的样本数据逐步对LSTM模型进行训练,得到模型的空间维度、系数矩阵等参数或变量,再根据已知的参考数据对模型进行调整,直到训练得到较为可靠的模型。模型应用的时候,将待确定的句子输入模型,即可预测其中包含的企业名称。
通过上述程序模块201-205,本申请所提出的企业名称识别系统20,通过LSTM+CRF模型的结合,可以自动提取有效特征,并且在识别企业名称时能够利用上下文信息,在打标签的阶段有效利用了句子级别的标记信息。相比 传统序列建模方法,本申请所提出的企业名称识别方法提高了识别精确度与召回率。
此外,本申请还提出一种企业名称识别方法。
参阅图3所示,是本申请企业名称识别方法一实施例的实施流程示意图。在本实施例中,根据不同的需求,图3所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
步骤S31,接收输入的特定文字序列。在本实施例中,所述特定文字序列为中文字符,包括汉字和空格,如新闻句子“中国平安发布了新产品”。
步骤S32,将所述特定文字序列中的每个汉字转化为对应的词向量x i(i=0,1,2,…n,向量维数为100),并将转化后的词向量输入至递归神经网络(Recurrent Neural Network,RNN)。在本实施例中,所述递归神经网络采用长短期记忆模块(Long Short-Term Memory,LSTM),优选采用双向长短期记忆模块(Bi-directional LSTM)。
步骤S33,通过所述递归神经网络计算每个词向量的状态向量,并将每个词向量的状态向量输入至条件随机场(Conditional Random Field,CRF)。其中,所述状态向量包括第一隐藏层状态向量和第二隐藏层状态向量。
优选地,在本实施例中,所述状态向量的计算具体包括如下步骤:
调用所述递归神经网络的双向长短期记忆模块LSTM,从左向右根据当前词向量x i的前一个词向量x- i的隐藏层状态向量h i-1计算当前词向量x i的第一隐藏层状态向量h i,并从右向左根据当前词向量x i的后一个词向量x- i+1的隐藏层状态向量h i+1计算当前词向量x i的第二隐藏层状态向量h i'。
上述的第一隐藏层状态向量h i、第二隐藏层状态向量h i'实质是通过LSTM对原输入特定文字序列的特征自动提取,这种提取特征的方式与传统方法有所区别,不依赖于特征的选取、泛化能力较强。
步骤S34,通过特定的回归模型(如softmax模型)将每个词向量的状态向量转换成特征向量X i,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量X i进行解码,得到所述特定文字序列中所有汉字的三元标注集(B,I,S),并通过最佳标签序列(用Y i表示)输出所有汉字的三元标注集。
在本实施例中,所述预先设定的三元标注规则包括:通过第一标签(如“B”)表示企业名称的第一个汉字,第二标签(如“I”)表示企业名称的剩余汉字,第三标签(如“S”)表示不属于企业名称的汉字。
优选地,在本实施例中,所述将每个词向量的状态向量转换成特征向量X i包括:通过特定的回归模型(如softmax模型)将每个词向量对应的第一隐藏层状态向量h i和第二隐藏层状态向量h i'进行合并,得到每个词向量的特征向 量X i
优选地,在本实施例中,所述最佳标签序列Y i通过预定的标签序列计算公式(如下公式1)获取。
Figure PCTCN2018076164-appb-000004
其中,X代表每个词向量的特征向量X i集合,即X=(X 1,X 2,…,X n),y代表待预测的标签序列,即y=(y1,y2,…,yn),n代表所述特定文字序列中的汉字个数(n>=1),s(X,y)计算公式中i代表所述特定文字序列中的第i个汉字(i>=1),s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列Y i
A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率。
在本实施例中,对于输入X i,可以定义LSTM的输出概率矩阵P{n*k},其中,n代表所述特定文字序列中的汉字个数(n>=1),k代表输出标签的个数(本实施例中,k=3),即所述三元标注集(B,I,S)的标签个数。
本实施例中引入的CRF,其实是对输出标签三元组进行建模,然后使用动态规划进行计算,最终根据得到的最优路径进行标注,即通过最大化s(X,y)得到所述最佳标签序列Y i
步骤S35,根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称。优选地,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
举例而言,参阅图4所示,假设输入的特定文字序列为“中国平安发布了新产品”,通过本申请采用的LSTM+CRF模型输出的最佳标签序列为{B,I,I,I,S,S,S,S,S,S},即表示从特定文字序列中识别的企业名称为“中国平安”,即连续标注的第一标签和所有第二标签(B,I,I,I)对应的汉字。在图4的例子中,LSTM+CRF模型最后一步通过最优化s(X,y)来得到最佳标签序列{B,I,I,I,S,S,S,S,S,S},即最佳标签序列{B,I,I,I,S,S,S,S,S,S}的s(X,y)比其他序列的s(X,y)都要大,所以确定{B,I,I,I,S,S,S,S,S,S}为最佳标签序列。其中,{B,I,I,I}就代表中国平安这四个字是企业名称,因为根据所述预先设定的三元标注规则,B表示企业名称的第一个汉字,I表示企业名称的剩余汉字,S表示不属于企业名称的汉字。
需要说明的是,本申请所采用的方法适用于模型训练及模型应用。模型训练的过程中,企业全称为样本数据,模型的输出结果(企业简称)为已知的参考数据。经过大量的样本数据逐步对LSTM模型进行训练,得到模型的空间维度、系数矩阵等参数或变量,再根据已知的参考数据对模型进行调整,直到训练得到较为可靠的模型。模型应用的时候,将待确定的句子输入模型,即可预测其中包含的企业名称。
通过上述步骤S31-S35,本申请所提出的企业名称识别方法,通过LSTM+CRF模型的结合,可以自动提取有效特征,并且在识别企业名称时能够利用上下文信息,在打标签的阶段有效利用了句子级别的标记信息。相比传统序列建模方法,本申请所提出的企业名称识别方法提高了识别精确度与召回率。
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质(如ROM/RAM、磁碟、光盘),所述计算机可读存储介质存储有企业名称识别系统20,所述企业名称识别系统20可被至少一个处理器22执行,以使所述至少一个处理器22执行如上所述的企业名称识别方法的步骤。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上参照附图说明了本申请的优选实施例,并非因此局限本申请的权利范围。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。另外,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本领域技术人员不脱离本申请的范围和实质,可以有多种变型方案实现本申请,比如作为一个实施例的特征可用于另一实施例而得到又一实施例。凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种电子设备,其特征在于,所述电子设备包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的企业名称识别系统,所述企业名称识别系统被所述处理器执行时实现如下步骤:
    接收输入的特定文字序列;
    将所述特定文字序列中的每个汉字转化为对应的词向量,并将转化后的词向量输入至递归神经网络;
    通过所述递归神经网络计算每个词向量的状态向量,并将每个词向量的状态向量输入至条件随机场;
    通过特定的回归模型将每个词向量的状态向量转换成特征向量,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量进行解码,得到所述特定文字序列中所有汉字的三元标注集,并通过最佳标签序列输出所有汉字的三元标注集;及
    根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称。
  2. 如权利要求1所述的电子设备,其特征在于,所述预先设定的三元标注规则包括:通过第一标签表示企业名称的第一个汉字,第二标签表示企业名称的剩余汉字,及第三标签表示不属于企业名称的汉字。
  3. 如权利要求2所述的电子设备,其特征在于,所述状态向量包括第一隐藏层状态向量和第二隐藏层状态向量;
    所述状态向量的计算包括:
    调用所述递归神经网络的双向长短期记忆模块,从左向右根据当前词向量的前一个词向量的隐藏层状态向量计算当前词向量的第一隐藏层状态向量,并从右向左根据当前词向量的后一个词向量的隐藏层状态向量计算当前词向量的第二隐藏层状态向量;
    所述特征向量的计算包括:通过所述特定的回归模型将每个词向量对应的第一隐藏层状态向量和第二隐藏层状态向量进行合并,得到每个词向量的特征向量。
  4. 如权利要求2所述的电子设备,其特征在于,所述最佳标签序列通过预定的标签序列计算公式获取,所述预定的标签序列计算公式为:
    Figure PCTCN2018076164-appb-100001
    其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
    A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的 概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
    s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
  5. 如权利要求3所述的电子设备,其特征在于,所述最佳标签序列通过预定的标签序列计算公式获取,所述预定的标签序列计算公式为:
    Figure PCTCN2018076164-appb-100002
    其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
    A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
    s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
  6. 如权利要求2所述的电子设备,其特征在于,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
  7. 如权利要求3-5任一项所述的电子设备,其特征在于,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
  8. 一种企业名称识别方法,应用于电子设备,其特征在于,所述方法包括:
    接收输入的特定文字序列;
    将所述特定文字序列中的每个汉字转化为对应的词向量,并将转化后的词向量输入至递归神经网络;
    通过所述递归神经网络计算每个词向量的状态向量,并将每个词向量的状态向量输入至条件随机场;
    通过特定的回归模型将每个词向量的状态向量转换成特征向量,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量进行解码,得到所述特定文字序列中所有汉字的三元标注集,并通过最佳标签序列输出所有汉字的三元标注集;及
    根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称。
  9. 如权利要求8所述的企业名称识别方法,其特征在于,所述预先设定的三元标注规则包括:通过第一标签表示企业名称的第一个汉字,第二标签表示企业名称的剩余汉字,及第三标签表示不属于企业名称的汉字。
  10. 如权利要求9所述的企业名称识别方法,其特征在于,所述状态向量包括第一隐藏层状态向量和第二隐藏层状态向量;
    所述状态向量的计算包括:
    调用所述递归神经网络的双向长短期记忆模块,从左向右根据当前词向量的前一个词向量的隐藏层状态向量计算当前词向量的第一隐藏层状态向量,并从右向左根据当前词向量的后一个词向量的隐藏层状态向量计算当前词向量的第二隐藏层状态向量;
    所述特征向量的计算包括:通过所述特定的回归模型将每个词向量对应的第一隐藏层状态向量和第二隐藏层状态向量进行合并,得到每个词向量的特征向量。
  11. 如权利要求9所述的企业名称识别方法,其特征在于,所述最佳标签序列通过预定的标签序列计算公式获取,所述预定的标签序列计算公式为:
    Figure PCTCN2018076164-appb-100003
    其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
    A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
    s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
  12. 如权利要求10所述的企业名称识别方法,其特征在于,所述最佳标签序列通过预定的标签序列计算公式获取,所述预定的标签序列计算公式为:
    Figure PCTCN2018076164-appb-100004
    其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
    A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
    s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
  13. 如权利要求9所述的企业名称识别方法,其特征在于,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
  14. 如权利要求10-12任一项所述的企业名称识别方法,其特征在于,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有企业名称识别系统,所述企业名称识别系统可被至少一个处理器执行,所述企业名称识别系统被所述处理器执行时实现如下步骤:
    接收输入的特定文字序列;
    将所述特定文字序列中的每个汉字转化为对应的词向量,并将转化后的词向量输入至递归神经网络;
    通过所述递归神经网络计算每个词向量的状态向量,并将每个词向量的状态向量输入至条件随机场;
    通过特定的回归模型将每个词向量的状态向量转换成特征向量,并利用所述条件随机场和预先设定的三元标注规则,对每个词向量的特征向量进行解码,得到所述特定文字序列中所有汉字的三元标注集,并通过最佳标签序列输出所有汉字的三元标注集;及
    根据所述预先设定的三元标注规则,从所述最佳标签序列中识别出特定企业名称。
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述预先设定的三元标注规则包括:通过第一标签表示企业名称的第一个汉字,第二标签表示企业名称的剩余汉字,及第三标签表示不属于企业名称的汉字。
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述状态向量包括第一隐藏层状态向量和第二隐藏层状态向量;
    所述状态向量的计算包括:
    调用所述递归神经网络的双向长短期记忆模块,从左向右根据当前词向量的前一个词向量的隐藏层状态向量计算当前词向量的第一隐藏层状态向量,并从右向左根据当前词向量的后一个词向量的隐藏层状态向量计算当前词向量的第二隐藏层状态向量;
    所述特征向量的计算包括:通过所述特定的回归模型将每个词向量对应的第一隐藏层状态向量和第二隐藏层状态向量进行合并,得到每个词向量的特征向量。
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述最佳标签序列通过预定的标签序列计算公式获取,所述预定的标签序列计算公式为:
    Figure PCTCN2018076164-appb-100005
    其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
    A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
    s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,所述最佳标签序列通过预定的标签序列计算公式获取,所述预定的标签序列计算公式为:
    Figure PCTCN2018076164-appb-100006
    其中,X代表每个词向量的特征向量,y代表待预测的标签序列,n代表所述特定文字序列中的汉字个数,i代表所述特定文字序列中的第i个汉字;
    A代表状态转移矩阵,A yi,yi+1代表从第yi个标签转移到第yi+1个标签的概率,P i,yi代表第i个汉字被标记为第yi个标签的概率;及
    s(X,y)代表衡量每条标签序列的指标,通过最大化s(X,y)得到所述最佳标签序列。
  20. 如权利要求16-19任一项所述的计算机可读存储介质,其特征在于,所述特定企业名称的识别包括:从所述最佳标签序列中提取连续标注的第一标签和所有第二标签对应的汉字,将提取的汉字作为特定企业名称。
PCT/CN2018/076164 2017-10-16 2018-02-10 企业名称识别方法、电子设备及计算机可读存储介质 WO2019075967A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710960222.1 2017-10-16
CN201710960222.1A CN107797989A (zh) 2017-10-16 2017-10-16 企业名称识别方法、电子设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2019075967A1 true WO2019075967A1 (zh) 2019-04-25

Family

ID=61533188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076164 WO2019075967A1 (zh) 2017-10-16 2018-02-10 企业名称识别方法、电子设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN107797989A (zh)
WO (1) WO2019075967A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555182A (zh) * 2018-05-31 2019-12-10 中国电信股份有限公司 用户画像的确定方法、装置及计算机可读存储介质
CN111209392B (zh) * 2018-11-20 2023-06-20 百度在线网络技术(北京)有限公司 污染企业的挖掘方法、装置及设备
CN109726266A (zh) * 2018-12-21 2019-05-07 珠海市小源科技有限公司 短信签名处理方法、设备及计算机可读存储介质
CN109726397B (zh) * 2018-12-27 2024-02-02 网易(杭州)网络有限公司 中文命名实体的标注方法、装置、存储介质和电子设备
CN109885702A (zh) * 2019-01-17 2019-06-14 哈尔滨工业大学(深圳) 自然语言处理中的序列标注方法、装置、设备及存储介质
CN109815952A (zh) * 2019-01-24 2019-05-28 珠海市筑巢科技有限公司 品牌名称识别方法、计算机装置及计算机可读存储介质
CN110516241B (zh) * 2019-08-26 2021-03-02 北京三快在线科技有限公司 地理地址解析方法、装置、可读存储介质及电子设备
CN112925961A (zh) * 2019-12-06 2021-06-08 北京海致星图科技有限公司 一种基于企业实体的智能问答方法及装置
CN111507108B (zh) * 2020-04-17 2021-03-19 腾讯科技(深圳)有限公司 别名生成方法、装置、电子设备及计算机可读存储介质
CN111914535B (zh) * 2020-07-31 2023-03-24 平安科技(深圳)有限公司 一种单词识别方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202054A (zh) * 2016-07-25 2016-12-07 哈尔滨工业大学 一种面向医疗领域基于深度学习的命名实体识别方法
CN106569998A (zh) * 2016-10-27 2017-04-19 浙江大学 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法
US20170127016A1 (en) * 2015-10-29 2017-05-04 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
CN106980608A (zh) * 2017-03-16 2017-07-25 四川大学 一种中文电子病历分词和命名实体识别方法及系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975555A (zh) * 2016-05-03 2016-09-28 成都数联铭品科技有限公司 一种基于双向递归神经网络的企业简称提取方法
CN106294322A (zh) * 2016-08-04 2017-01-04 哈尔滨工业大学 一种基于lstm的汉语零指代消解方法
CN106886516A (zh) * 2017-02-27 2017-06-23 竹间智能科技(上海)有限公司 自动识别语句关系和实体的方法及装置
CN106980609A (zh) * 2017-03-21 2017-07-25 大连理工大学 一种基于词向量表示的条件随机场的命名实体识别方法
CN107122416B (zh) * 2017-03-31 2021-07-06 北京大学 一种中文事件抽取方法
CN107145483B (zh) * 2017-04-24 2018-09-04 北京邮电大学 一种基于嵌入式表示的自适应中文分词方法
CN107203511B (zh) * 2017-05-27 2020-07-17 中国矿业大学 一种基于神经网络概率消歧的网络文本命名实体识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170127016A1 (en) * 2015-10-29 2017-05-04 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
CN106202054A (zh) * 2016-07-25 2016-12-07 哈尔滨工业大学 一种面向医疗领域基于深度学习的命名实体识别方法
CN106569998A (zh) * 2016-10-27 2017-04-19 浙江大学 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法
CN106980608A (zh) * 2017-03-16 2017-07-25 四川大学 一种中文电子病历分词和命名实体识别方法及系统

Also Published As

Publication number Publication date
CN107797989A (zh) 2018-03-13

Similar Documents

Publication Publication Date Title
WO2019075967A1 (zh) 企业名称识别方法、电子设备及计算机可读存储介质
CN112685565B (zh) 基于多模态信息融合的文本分类方法、及其相关设备
CN111222317B (zh) 序列标注方法、系统和计算机设备
CN110287480B (zh) 一种命名实体识别方法、装置、存储介质及终端设备
CN112328761B (zh) 一种意图标签设置方法、装置、计算机设备及存储介质
CN112560504B (zh) 抽取表单文档中信息的方法、电子设备和计算机可读介质
WO2019041524A1 (zh) 聚类标签生成方法、电子设备及计算机可读存储介质
CN110866115A (zh) 序列标注方法、系统、计算机设备及计算机可读存储介质
CN113158656B (zh) 讽刺内容识别方法、装置、电子设备以及存储介质
CN114580424B (zh) 一种用于法律文书的命名实体识别的标注方法和装置
CN111091004A (zh) 一种语句实体标注模型的训练方法、训练装置及电子设备
CN112052305A (zh) 信息提取方法、装置、计算机设备及可读存储介质
CN112836521A (zh) 问答匹配方法、装置、计算机设备及存储介质
CN112860919A (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
CN113947095A (zh) 多语种文本翻译方法、装置、计算机设备及存储介质
WO2019095568A1 (zh) 企业简称生成方法、装置及存储介质
CN115544560A (zh) 一种敏感信息的脱敏方法、装置、计算机设备及存储介质
CN110705211A (zh) 文本重点内容标记方法、装置、计算机设备及存储介质
CN110442858B (zh) 一种问句实体识别方法、装置、计算机设备及存储介质
WO2019041529A1 (zh) 新闻主体企业识别方法、电子设备及计算机可读存储介质
CN115967549A (zh) 一种基于内外网信息传输的防泄密方法及其相关设备
CN115018608A (zh) 风险预测方法、装置、计算机设备
CN113961672A (zh) 信息标注方法、装置、电子设备和存储介质
CN114637831A (zh) 基于语义分析的数据查询方法及其相关设备
CN114118072A (zh) 文档结构化方法、装置、电子设备和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18869404

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18869404

Country of ref document: EP

Kind code of ref document: A1